Unlocking the Full Potential of Snowflake Metadata in DataHub
In a world powered by data, knowing what you have and where to find it can make or break your organisation’s success. Data observability platforms like DataHub aim to address this by providing a unified view of your data ecosystem. However, integrating platforms like Snowflake into DataHub isn’t always straightforward. Many teams face challenges with incomplete metadata ingestion—leaving critical table properties like creation timestamps, row counts, and schema-level details out of the picture.
This gap in metadata not only limits the power of data governance but also adds layers of complexity to tracking data lineage and operational visibility. When we encountered this challenge while developing a solution for one of our clients to integrate Snowflake with DataHub, we knew that a custom solution was the only way forward.
In this blog, we’ll take you through the journey of overcoming these limitations. From identifying the gaps to designing a scalable solution, you’ll learn how we transformed the Snowflake-DataHub integration to unlock richer metadata insights and drive better data governance.
The Problem: Incomplete Metadata from Snowflake
Can you rely on a map that only shows the main roads but not the intersections or landmarks? That’s what incomplete metadata feels like. For Snowflake users integrating with DataHub, this is a common scenario. While the default Snowflake connector does a decent job of retrieving basic metadata, it falls short when it comes to deeper insights.
Key Missing Metadata:
- Table creation and modification timestamps
- Row counts and data sizes
- Schema-specific properties
Without these critical pieces, your DataHub environment can feel fragmented, making it harder to manage data lineage, automate governance processes, and ensure operational efficiency.
Key Challenges
Our client’s data team relied heavily on Snowflake for its scalability and DataHub for its data observability. However, the incomplete metadata ingestion posed several challenges:
Data Visibility: Without table creation dates or row counts, teams struggled to gain a comprehensive view of their data assets.
Governance Bottlenecks: Missing metadata hindered lineage tracking and compliance efforts, creating unnecessary manual work.
Performance Concerns: The solution needed to handle large-scale environments without taxing Snowflake’s performance.
With hundreds of tables spread across multiple schemas, it became clear that the default setup wouldn’t cut it. We needed a solution that was not only complete but also scalable and efficient.
The Solution: A Custom Transformer for DataHub
To bridge the gap, we built a custom Python-based transformer that integrates seamlessly with DataHub’s ingestion pipeline. This transformer dynamically retrieves the missing metadata properties from Snowflake and maps them to datasets in DataHub.
How the Solution Works
- Snowflake Property Fetcher
A Python module connects to Snowflake, executes an optimized SQL query, and retrieves key table properties like row counts and creation dates. In-memory caching ensures fast performance during ingestion.
- Custom Transformer
The fetched metadata is processed by a transformer class that maps properties to corresponding datasets in DataHub. This ensures the ingested data is enriched with all necessary attributes.
- Ingestion Recipe
An ingestion recipe integrates the custom transformer, enabling a smooth and automated metadata ingestion process.
Solution Architecture
Here’s a high-level view of how the solution works:
Architecture Diagram
How It Works
- Snowflake Property Fetcher:
A Python class connects to Snowflake, executes a custom SQL query, and fetches the missing table properties. Results are cached for efficient re-use.
- Custom Transformer:
A transformer class in DataHub’s pipeline processes the fetched table properties and maps them to corresponding datasets in DataHub.
- Ingestion Recipe:
The recipe integrates the custom transformer into the ingestion process, ensuring that the properties are ingested into DataHub seamlessly.
Deep Dive: Technical Implementation
- Snowflake Property Fetcher
The Snowflake Property Fetcher is responsible for executing a dynamic SQL query to fetch table properties like row count, creation date, and data size. It uses in-memory caching to avoid repeated database calls during a single ingestion run.
Key Features
- Dynamic Query Execution: Adapts SQL queries to specific account and schema parameters at runtime.
- Caching: Results are stored in memory to reduce database load and speed up ingestion.
- Retry Logic: Robust error-handling mechanisms ensure reliable data retrieval
Code Snippet: Fetching Table Properties
import pandas as pd
import snowflake.connector
class SnowflakePropertyFetcher:
cache = None
@staticmethod
def get_properties(dataset_urn):
# Use cached results if available
if SnowflakePropertyFetcher.cache is None:
SnowflakePropertyFetcher.cache = SnowflakePropertyFetcher.fetch_properties()
# Match dataset URN to table properties
return SnowflakePropertyFetcher.cache.get(dataset_urn)
@staticmethod
def fetch_properties():
# Connect to Snowflake
conn = snowflake.connector.connect(
user='username',
password='password',
account='account_id',
database='database',
schema='schema'
)
# Execute SQL query
query = open('fetch_table_properties.sql').read()
df = pd.read_sql(query, conn)
conn.close()
# Cache results in a dictionary
return df.set_index('dataset_urn').to_dict('index')
- Custom Transformer
The custom transformer integrates with DataHub’s ingestion pipeline and uses the fetched metadata to enrich datasets.
Features
- Dynamic Property Mapping: Links table properties to corresponding datasets using schema and table names.
- Error Handling: Ensures invalid or missing data doesn’t disrupt the ingestion pipeline.
- Seamless Integration: Configured via the ingestion recipe for plug-and-play functionality.
Code Snippet: Adding Properties to Datasets
from datahub.ingestion.transformer.add_dataset_properties import AddDatasetPropertiesResolverBase
class SnowflakePropertiesTransformer(AddDatasetPropertiesResolverBase):
def get_dataset_properties(self, dataset_urn):
# Fetch table properties from Snowflake fetcher
properties = SnowflakePropertyFetcher.get_properties(dataset_urn)
return properties
- Ingestion Recipe Configuration
The ingestion recipe is configured to invoke the custom transformer and ensure seamless integration with the pipeline.
Recipe File
transformers:
- type: add_dataset_properties
config:
add_properties_resolver_class: snowflake_transformer.custom_transformer_dss_snowflake.Snowflake PropertiesTransformer
Challenges and Trade-offs
- Balancing Caching with Freshness
While caching improved performance, it introduced the risk of stale data. We addressed this by implementing periodic refresh schedules to balance speed and accuracy.
- Complexity in Customization
Developing and maintaining a custom solution required additional effort. However, the modular architecture made it reusable and easier to debug.
- Impact on Snowflake Resources
Optimized query design ensured that additional metadata retrieval didn’t disrupt Snowflake workloads, even during peak hours.
Results and Impact
Our custom solution delivered measurable improvements across the board:
Key Outcomes:
- Complete Metadata Ingestion:
All required table properties, including row counts, creation timestamps, and schema details, are now available in DataHub.
- Improved Data Governance:
Enriched metadata enables better tracking, lineage, and compliance across datasets.
- Scalability:
The solution scales seamlessly to handle hundreds of tables across multiple Snowflake schemas
Conclusion: Making Metadata Work for You
Metadata is the backbone of any data observability platform, and without it, your ability to fully manage, govern, and leverage your data is severely limited. By addressing the challenges posed by the default Snowflake connector, we were able to unlock the full potential of Snowflake metadata within DataHub. This custom solution has transformed the way our client interacts with their data, providing richer insights, more effective governance, and greater operational efficiency.
If your organization is grappling with similar metadata gaps, this approach could be the key to enhancing your data observability platform. Building a tailored solution not only ensures that all your critical metadata is captured but also enables you to harness the true power of your data.
After all, data is only as valuable as the insights it generates. With the right tools and strategies in place, you can transform your data into a strategic asset.
Don’t let incomplete metadata hold you back—take the steps to bridge those gaps and unlock the full potential of your data landscape today.