Unlocking the Full Potential of Snowflake Metadata in DataHub

Unlocking the Full Potential of Snowflake Metadata in DataHub

In a world powered by data, knowing what you have and where to find it can make or break your organisation’s success. Data observability platforms like DataHub aim to address this by providing a unified view of your data ecosystem. However, integrating platforms like Snowflake into DataHub isn’t always straightforward. Many teams face challenges with incomplete metadata ingestion—leaving critical table properties like creation timestamps, row counts, and schema-level details out of the picture. 

This gap in metadata not only limits the power of data governance but also adds layers of complexity to tracking data lineage and operational visibility. When we encountered this challenge while developing a solution for one of our clients to integrate Snowflake with DataHub, we knew that a custom solution was the only way forward. 

In this blog, we’ll take you through the journey of overcoming these limitations. From identifying the gaps to designing a scalable solution, you’ll learn how we transformed the Snowflake-DataHub integration to unlock richer metadata insights and drive better data governance. 

The Problem: Incomplete Metadata from Snowflake 

Can you rely on a map that only shows the main roads but not the intersections or landmarks? That’s what incomplete metadata feels like. For Snowflake users integrating with DataHub, this is a common scenario. While the default Snowflake connector does a decent job of retrieving basic metadata, it falls short when it comes to deeper insights. 

Key Missing Metadata: 

  • Table creation and modification timestamps 
  • Row counts and data sizes 
  • Schema-specific properties 

Without these critical pieces, your DataHub environment can feel fragmented, making it harder to manage data lineage, automate governance processes, and ensure operational efficiency.  

Key Challenges 

Our client’s data team relied heavily on Snowflake for its scalability and DataHub for its data observability. However, the incomplete metadata ingestion posed several challenges: 

Data Visibility: Without table creation dates or row counts, teams struggled to gain a comprehensive view of their data assets. 

Governance Bottlenecks: Missing metadata hindered lineage tracking and compliance efforts, creating unnecessary manual work. 

Performance Concerns: The solution needed to handle large-scale environments without taxing Snowflake’s performance. 

With hundreds of tables spread across multiple schemas, it became clear that the default setup wouldn’t cut it. We needed a solution that was not only complete but also scalable and efficient. 

The Solution: A Custom Transformer for DataHub

To bridge the gap, we built a custom Python-based transformer that integrates seamlessly with DataHub’s ingestion pipeline. This transformer dynamically retrieves the missing metadata properties from Snowflake and maps them to datasets in DataHub. 

How the Solution Works 

  1. Snowflake Property Fetcher 

A Python module connects to Snowflake, executes an optimized SQL query, and retrieves key table properties like row counts and creation dates. In-memory caching ensures fast performance during ingestion. 

  1. Custom Transformer 

The fetched metadata is processed by a transformer class that maps properties to corresponding datasets in DataHub. This ensures the ingested data is enriched with all necessary attributes. 

  1. Ingestion Recipe 

An ingestion recipe integrates the custom transformer, enabling a smooth and automated metadata ingestion process. 

Solution Architecture 

Here’s a high-level view of how the solution works: 

Architecture Diagram

How It Works 

  1. Snowflake Property Fetcher: 

A Python class connects to Snowflake, executes a custom SQL query, and fetches the missing table properties. Results are cached for efficient re-use. 

  1. Custom Transformer: 

A transformer class in DataHub’s pipeline processes the fetched table properties and maps them to corresponding datasets in DataHub. 

  1. Ingestion Recipe: 

The recipe integrates the custom transformer into the ingestion process, ensuring that the properties are ingested into DataHub seamlessly. 

 Deep Dive: Technical Implementation 

  1. Snowflake Property Fetcher 

The Snowflake Property Fetcher is responsible for executing a dynamic SQL query to fetch table properties like row count, creation date, and data size. It uses in-memory caching to avoid repeated database calls during a single ingestion run. 

Key Features 

  • Dynamic Query Execution: Adapts SQL queries to specific account and schema parameters at runtime. 
  • Caching: Results are stored in memory to reduce database load and speed up ingestion. 
  • Retry Logic: Robust error-handling mechanisms ensure reliable data retrieval

Code Snippet: Fetching Table Properties 


import pandas as pd 
import snowflake.connector 
class SnowflakePropertyFetcher: 
cache = None 
@staticmethod 
def get_properties(dataset_urn): 
# Use cached results if available 
if SnowflakePropertyFetcher.cache is None: 
SnowflakePropertyFetcher.cache = SnowflakePropertyFetcher.fetch_properties() 
# Match dataset URN to table properties 
return SnowflakePropertyFetcher.cache.get(dataset_urn) 
@staticmethod 
def fetch_properties(): 
# Connect to Snowflake 
conn = snowflake.connector.connect( 
user='username', 
password='password', 
account='account_id', 
database='database', 
schema='schema' 
) 
# Execute SQL query 
query = open('fetch_table_properties.sql').read() 
df = pd.read_sql(query, conn) 
conn.close() 
# Cache results in a dictionary 
return df.set_index('dataset_urn').to_dict('index') 
  1. Custom Transformer 

The custom transformer integrates with DataHub’s ingestion pipeline and uses the fetched metadata to enrich datasets. 

Features 

  • Dynamic Property Mapping: Links table properties to corresponding datasets using schema and table names. 
  • Error Handling: Ensures invalid or missing data doesn’t disrupt the ingestion pipeline. 
  • Seamless Integration: Configured via the ingestion recipe for plug-and-play functionality. 

Code Snippet: Adding Properties to Datasets

from datahub.ingestion.transformer.add_dataset_properties import AddDatasetPropertiesResolverBase 
class SnowflakePropertiesTransformer(AddDatasetPropertiesResolverBase): 
def get_dataset_properties(self, dataset_urn): 
# Fetch table properties from Snowflake fetcher 
properties = SnowflakePropertyFetcher.get_properties(dataset_urn) 
return properties 
  1. Ingestion Recipe Configuration 

The ingestion recipe is configured to invoke the custom transformer and ensure seamless integration with the pipeline. 

Recipe File 

transformers: 
- type: add_dataset_properties 
config: 
add_properties_resolver_class:  snowflake_transformer.custom_transformer_dss_snowflake.Snowflake PropertiesTransformer

Challenges and Trade-offs 

  1. Balancing Caching with Freshness 

While caching improved performance, it introduced the risk of stale data. We addressed this by implementing periodic refresh schedules to balance speed and accuracy.  

  1. Complexity in Customization 

Developing and maintaining a custom solution required additional effort. However, the modular architecture made it reusable and easier to debug. 

  1. Impact on Snowflake Resources 

Optimized query design ensured that additional metadata retrieval didn’t disrupt Snowflake workloads, even during peak hours. 

Results and Impact 

Our custom solution delivered measurable improvements across the board:  

Key Outcomes: 

  1. Complete Metadata Ingestion: 

All required table properties, including row counts, creation timestamps, and schema details, are now available in DataHub. 

  1. Improved Data Governance: 

Enriched metadata enables better tracking, lineage, and compliance across datasets. 

  1. Scalability: 

The solution scales seamlessly to handle hundreds of tables across multiple Snowflake schemas

Conclusion: Making Metadata Work for You 

Metadata is the backbone of any data observability platform, and without it, your ability to fully manage, govern, and leverage your data is severely limited. By addressing the challenges posed by the default Snowflake connector, we were able to unlock the full potential of Snowflake metadata within DataHub. This custom solution has transformed the way our client interacts with their data, providing richer insights, more effective governance, and greater operational efficiency. 

If your organization is grappling with similar metadata gaps, this approach could be the key to enhancing your data observability platform. Building a tailored solution not only ensures that all your critical metadata is captured but also enables you to harness the true power of your data. 

After all, data is only as valuable as the insights it generates. With the right tools and strategies in place, you can transform your data into a strategic asset.  

Don’t let incomplete metadata hold you back—take the steps to bridge those gaps and unlock the full potential of your data landscape today.