Open Source Data Catalog Solutions

Amundsen

Amundsen is a popular open-source data discovery and metadata platform created by Lyft. It helps organizations to find, understand, and trust their data.

Key Features

  • Data discovery through search functionality
  • Table and column-level metadata
  • Data lineage visualization
  • User-friendly UI with table popularity metrics
  • Integration with various data sources (Hive, Presto, Redshift, Snowflake, etc.)
  • Support for custom metadata

Architecture

Amundsen consists of three microservices:

  1. Metadata Service - Neo4j database for storing metadata
  2. Search Service - Elasticsearch for search functionality
  3. Frontend Service - React application for user interface

Getting Started

DataHub

DataHub by LinkedIn is a modern data catalog built to enable end-to-end data discovery, data observability, and data governance.

Apache Atlas

Apache Atlas provides open metadata management and governance capabilities for organizations to build a catalog of their data assets.

Metacat

Metacat by Netflix is a unified metadata exploration API service that provides metadata for various data sources.

Marquez

Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata.