The hidden side of data catalogs

When our team at dalatake began architecting a next-generation metadata management solution, we quickly realized that the industry’s proliferation of specialized catalogs—service catalogs, schema registries, data discovery tools, and more—offered a fractured view of organizational data assets. We set out to build Dalatake Catalogs as a unified metadata layer that could flexibly satisfy diverse requirements for latency, availability, governance, and interoperability without forcing teams to stitch together multiple systems. In this article, I’ll walk through several influential perspectives in the data-catalog landscape and explain how they’ve informed Dalatake’s design.

Data catalog. Author: GPT4o

Convergence of Catalog Systems?

These Aren’t the Catalogs You’re Looking For dives in how service catalogs, schema registries, and data discovery platforms often ingest and expose overlapping metadata—even though they evolved to solve distinct problems. The piece argues that latency constraints, high-availability needs, and compatibility with specific engines have driven the current sprawl of specialized catalogs. Yet it posits that a truly unified catalog could meet all these demands, reducing friction for data teams and eliminating redundant metadata silos. Platforms like Backstage and DataHub are already moving in this direction, broadening their scopes beyond narrow niches and hinting at the convergence we can think of with dalatake catalogs.

Different types of Data Catalogs

The number of emerging solutions is becoming greater and greater, Onehouse.ai’s article “Comprehensive Data Catalog Compariso” offers an in-depth look at the major players—Unity Catalog, Apache Polaris, DataHub, Apache Gravitino, and more—evaluated across features such as data discovery, governance, lineage, and access control. It categorizes catalogs into three archetypes:

Metastores (e.g., Hive Metastore): optimized for table definitions and schema operations.
Business Catalogs (e.g., Alation, Collibra): focused on data asset discovery, annotation, and stewardship workflows.
Catalogs of Catalogs (e.g., Backstage, DataHub): acting as a federated layer that aggregates metadata from multiple sources.

By laying out each solution’s trade-offs in terms of openness, community support, and integration breadth, the article underscores that no single catalog universally “wins.” Instead, organizations must match a catalog’s strengths to their governance models, performance SLAs, and engineering stacks. Dalatake Catalogs embraces this lesson by offering plugin-based connectors and a modular governance engine, allowing teams to tailor the system to their precise needs without vendor lock-in.

On data governance

Data Governance is a system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods. Source: The data governance institute.

Apache Gravitino: A Geo-Distributed Metadata Lake

Apache Gravitino takes the notion of a “catalog” beyond mere indexation and into the realm of a full-blown metadata lake. Designed for geo-distributed deployments, Gravitino ingests metadata from filestores, relational databases, and streaming platforms in real time and distributes it across regions or clouds. Its key innovation is treating metadata changes as first-class events, ensuring that any update in a downstream system—say, a new table created in one region—is reflected globally. While Dalatake Catalogs isn’t strictly a metadata lake, we borrowed Gravitino’s event-driven approach to keep our catalog state synchronized across multi-region clusters and heterogeneous engines.

Apache Polaris: Iceberg’s Native Catalog

For organizations standardizing on Apache Iceberg as their table format, Apache Polaris offers a dedicated, open-source catalog solution. Still in incubation at the Apache Software Foundation, Polaris aims to provide a RESTful API for managing Iceberg tables, supporting engines like Spark, Flink, Trino, and more. Its focus on multi-engine interoperability and community-driven governance makes it a compelling choice for Iceberg users. In designing Dalatake Catalogs, we drew inspiration from Polaris’s clear separation of control plane (catalog) and data plane (table storage), implementing our own abstractions so that new table formats or query engines can be onboarded without reengineering core metadata services.

Datahub

Despite not being a operational data catalog itself, Datahub started as a Linkedin engineering project and became as, perhaps, the most preferred open metadata repository. Moreover, among other features in their roadmap, they plan to step in the operational side adding a metastore in the platform. DataHub is an open-source metadata catalog designed to provide a unified, real-time view of all data and AI assets across an organization by modeling metadata as a graph of interconnected entities. At its core, DataHub ingests change events from diverse sources—data warehouses, lakehouse tables, pipelines, dashboards, ML models, and more—via a streaming layer, normalizes and enriches them, and stores them in a transactional metadata store paired with a high-performance search index. Its key features include global search and faceted navigation for rapid asset discovery; interactive lineage visualization at both table and column levels; embedded observability with data quality metrics, freshness indicators, and SLA alerts; and robust governance capabilities such as ownership assignments, policy-as-code enforcement, and approval workflows. DataHub’s extensibility is enabled by a plugin architecture with over fifty out-of-the-box connectors, SDKs for custom integrations in Python and Java, and comprehensive REST/grpc APIs. A React-based UI provides asset profiles, annotations, and discussion threads to foster data collaboration, while a vibrant open-source community under Linux Foundation stewardship ensures continuous innovation—roadmapping advanced features like a native metrics catalog, probabilistic lineage inference, and AI asset governance. Together, these capabilities empower teams to discover, understand, trust, and govern their data and models at scale.

Key Takeaways from Catalog Comparisons

Revisiting the before mentioned Onehouse.ai comparison, a few common themes emerge:

No Silver Bullet: Each catalog excels in particular dimensions—governance, low-latency lookup, or engine compatibility—but falls short elsewhere.
Plugin Ecosystems Matter: Open-source catalogs with vibrant plugin registries (e.g., DataHub’s ingestion connectors) reduce integration overhead.
Event-Driven Sync: Real-time metadata propagation, as pioneered by Gravitino, is critical for ensuring that catalogs remain accurate and trustworthy.
Federation vs. Monolith: Some teams prefer a federated “catalog of catalogs” to unify existing tools, while others want an all-in-one solution. Dalatake Catalogs supports both modes, offering federation adapters alongside native indexing services.

Closing Thoughts

Building Dalatake Catalogs has been an exercise in balancing the competing demands laid out by these pioneering projects. We’ve fused the low-latency, high-availability principles championed by Gravitino; the schema-and-table governance abstractions from Polaris; and the plugin-driven, federated architecture discussed in “These Aren’t the Catalogs You’re Looking For” and the Onehouse.ai comparison. The result is a catalog platform that scales globally, integrates with any engine or format, and provides a consistent governance layer—yet remains flexible enough to let teams choose exactly which capabilities they need. As the metadata management space continues to evolve, we believe that this hybrid, modular approach will become the de facto standard for organizations seeking both power and agility in their data catalogs.

Who knows what would the future be, in the meantime, we will see the birth and death of many metastore projects, but what seems most likely is that Iceberg metastore is the best positioned to win the race.

David Rey

Explorer

The hidden side of data catalogs

Convergence of Catalog Systems?

Different types of Data Catalogs

Apache Gravitino: A Geo-Distributed Metadata Lake

Apache Polaris: Iceberg’s Native Catalog

Datahub

Key Takeaways from Catalog Comparisons

Closing Thoughts

Graph View

Table of Contents

Latest Posts

From sandboxed to boardroom

Hybrid crews

The microshift revolution

Supply chain copilots

Opportunity or Squeeze