Medallion Architecture: A Complete Guide from a GAP Data Engineer Processing 5TB+ Retail Data

Q: Why is it called Bronze, Silver, Gold?

The names reflect data quality progression, like refining precious metals. Bronze is raw ore (data as-is from sources). Silver is refined metal (cleansed, deduplicated, standardized). Gold is the finished product (business-ready aggregations). This naming makes the purpose of each layer clear to both technical and business stakeholders.

Q: Is Medallion Architecture only for Databricks?

No, though it's most commonly associated with Azure Databricks and Delta Lake. The pattern can be implemented with any lakehouse technology: AWS Lake Formation, Google BigQuery, Snowflake, or open-source tools like Apache Spark with Iceberg or Hudi. The concepts of progressive data quality layers apply regardless of tooling.

Q: How does Medallion Architecture differ from Lambda Architecture?

Lambda Architecture separates batch and real-time processing into different paths, then merges results. Medallion Architecture focuses on data quality layers, not processing timing. You can implement real-time Medallion (streaming Bronze → Silver → Gold) or batch Medallion. Many teams combine concepts, using streaming for Bronze ingestion and batch for Gold aggregations.

Q: What goes in Bronze vs Silver vs Gold?

Bronze: Raw data as-is from sources, plus ingestion metadata (timestamp, source, batch ID). No transformations. Silver: Cleansed, validated, deduplicated data at entity level. Schema enforcement, null handling, type conversions. Gold: Business aggregations optimized for consumption. Dimensional models, feature tables, pre-computed metrics.

Q: How do I handle schema changes in Medallion?

Bronze uses permissive schemas (schema-on-read) to absorb source changes. Silver handles evolution explicitly — new columns added with mergeSchema, removed columns deprecated with nulls before removal. Gold maintains stable interfaces with versioning for breaking changes. Monitor for schema drift and update Silver logic when sources change.

Expert Insight by

Jaswanth Malineni

Senior Azure Data Engineer, GAP

Data Engineering / RetailLinkedIn

Jaswanth has 5 years of hands-on data engineering experience across retail, education, and technology. He currently builds Azure Databricks Lakehouse pipelines processing 5TB+ retail data at GAP using Medallion Architecture. Previously, he designed Delta Lake frameworks at Illinois State University and built AWS Glue ETL pipelines at RI Logix Technologies. He holds an MS in Computer Science and is an AWS Certified Data Engineer.

Verified Expert

Quick Answers (TL;DR)

What is Medallion Architecture?

Medallion Architecture is a data design pattern that organizes a lakehouse into three layers: Bronze (raw data), Silver (cleansed and conformed), and Gold (business-ready aggregations). Each layer progressively improves data quality, making it easier to maintain, debug, and scale data pipelines. It's the de facto standard for Azure Databricks and Delta Lake implementations.

Why is it called Bronze, Silver, Gold?

The names reflect data quality progression — like refining precious metals. Bronze is raw ore (data as-is from sources). Silver is refined metal (cleansed, deduplicated, standardized). Gold is the finished product (business-ready aggregations and features). This naming convention makes the purpose of each layer immediately clear to both technical and business stakeholders.

Is Medallion Architecture the same as a data lakehouse?

Medallion Architecture is a pattern for organizing a data lakehouse, not the lakehouse itself. A lakehouse combines data lake storage (like ADLS or S3) with data warehouse features (ACID transactions, schema enforcement). Medallion Architecture provides the layered structure within that lakehouse. You can build a lakehouse without Medallion, but Medallion has become the most popular organizational pattern.

When should I NOT use Medallion Architecture?

Skip Medallion for simple, single-source pipelines where the overhead isn't justified. Also avoid it for real-time streaming use cases that need sub-second latency (though you can use a modified version). If your data is already clean and well-structured from the source, a two-layer approach (raw + curated) might be simpler. Medallion shines for complex, multi-source enterprise data with varying quality.

What is Medallion Architecture?

Share to save for later

Medallion Architecture: Medallion Architecture is a data design pattern that organizes a data lakehouse into three progressive layers: Bronze (raw ingestion), Silver (cleansed and validated), and Gold (aggregated and business-ready). Each layer applies incremental transformations, improving data quality while maintaining lineage and enabling reprocessing. It's the recommended pattern for Azure Databricks, Delta Lake, and modern lakehouse implementations.

When I joined GAP to build their retail data platform, I inherited a mix of legacy SSIS packages, scattered data files, and SQL Server databases that had grown organically over years. The first question wasn't "what tools should we use?" — it was "how do we organize this chaos into something maintainable?"

Medallion Architecture provided the answer. Here's why it works:

Progressive Data Quality: Data moves through layers like refining raw materials. Bronze captures everything as-is. Silver cleanses and standardizes. Gold serves business needs. Each layer has a clear purpose.

Separation of Concerns: Ingestion logic lives in Bronze. Data quality logic lives in Silver. Business logic lives in Gold. When something breaks, you know exactly where to look.

Reprocessability: Because each layer is persisted (not just a view), you can reprocess Silver from Bronze when business rules change, without re-ingesting from source systems.

Schema Evolution: Bronze tolerates schema changes from sources. Silver handles them gracefully. Gold provides stable interfaces for consumers.

5TB+

Retail data processed daily at GAP

30%

Faster ingestion with optimized ADF pipelines

25%

Compute cost reduction after SSIS migration

99%

Data reliability in production pipelines

The pattern has become the de facto standard for Azure Databricks and Delta Lake implementations. Microsoft, Databricks, and most data engineering teams I've worked with use some variation of Bronze-Silver-Gold.

Key Takeaway

Medallion Architecture isn't just folder organization — it's a framework for progressive data quality, clear ownership, and maintainable pipelines. Each layer has a specific job, making complex data platforms manageable.

Why Medallion for Enterprise Data Pipelines

Share to save for later

I've built data pipelines on both Azure and AWS, using traditional ETL, ELT, and lakehouse patterns. For enterprise workloads with multiple sources and complex transformations, Medallion consistently wins. Here's why.

Data Quality as a Journey

Traditional ETL treats data quality as a single transformation: dirty data goes in, clean data comes out. This works until:

Source schemas change unexpectedly
Business rules evolve mid-flight
You need to debug why a number doesn't match

Medallion treats data quality as a journey with checkpoints:

Bronze: Capture raw data, preserve source fidelity, enable forensics
Silver: Apply cleaning rules, handle nulls, enforce types, deduplicate
Gold: Apply business logic, aggregate, optimize for consumption

When something looks wrong in a dashboard, I can trace it back: Gold aggregation issue? Silver cleaning bug? Bronze ingestion problem? Source system change?

Schema Evolution Without Pain

Retail data is messy. At GAP, we ingest from:

Point-of-sale systems (transactions, returns, voids)
Inventory management (stock levels, transfers, adjustments)
E-commerce platforms (orders, carts, sessions)
Marketing systems (campaigns, promotions, customer segments)

Each source has its own schema, its own quirks, and its own tendency to change without warning.

Bronze layer handles this by using schema-on-read. We ingest data as-is, often as JSON or with permissive schemas. When a source adds a field, Bronze doesn't break.

Silver layer applies schema enforcement. We handle nulls, type conversions, and missing fields. When a source changes, we update Silver logic — not ingestion.

Schema Evolution Tip

Use Delta Lake's schema evolution features (mergeSchema, overwriteSchema) carefully. Enable mergeSchema for Silver tables that need to handle new columns gracefully. But for Gold tables serving BI tools, schema stability is more important than flexibility.

Performance at Each Layer

Different layers have different performance profiles:

Bronze: Optimized for write throughput. We use append-only patterns, minimal transformation, and Auto Loader for streaming ingestion. The goal is getting data in fast.

Silver: Optimized for incremental processing. We use Delta Lake MERGE for CDC patterns, partitioning by date, and Z-ordering on common filter columns. The goal is efficient updates.

Gold: Optimized for read performance. We use aggressive aggregation, materialized views, and Snowflake for serving. The goal is fast queries for dashboards and reports.

Layer	Optimization Goal	Key Techniques	Typical Consumers
Bronze	Write throughput	Append-only, Auto Loader, minimal transforms	Silver layer pipelines
Silver	Incremental updates	MERGE, partitioning, Z-ordering, CDC	Gold layer, data scientists
Gold	Read performance	Aggregations, materialized views, caching	BI tools, reports, APIs

Governance and Auditability

In retail, we deal with financial data (sales, revenue), customer data (transactions, preferences), and inventory data (stock counts, valuations). All of this requires:

Lineage: Where did this number come from?
Auditability: What transformations were applied?
Access Control: Who can see what?

Medallion layers provide natural governance boundaries:

Bronze: Restricted access, raw data retention for compliance
Silver: Broader access, curated for analysis
Gold: Business-user access, aggregated and anonymized where needed

We use Unity Catalog to enforce these boundaries, with data lineage tracked automatically through Delta Lake.

Key Takeaway

Medallion Architecture provides progressive quality, schema evolution handling, layer-specific optimization, and natural governance boundaries. For enterprise data with multiple sources and complex requirements, it's the most maintainable pattern I've found.

The Bronze Layer: Raw Data Ingestion

Share to save for later

Bronze is your data's safety net. The goal is simple: capture everything from source systems with minimal transformation, preserving the original structure for debugging and reprocessing.

Design Principles

Preserve Source Fidelity: Don't transform data in Bronze. If the source sends null, store null. If the source sends malformed JSON, store malformed JSON. Bronze is a historical record, not a cleaning layer.

Add Ingestion Metadata: While we don't transform source data, we add metadata: ingestion timestamp, source system, batch ID, file name. This enables debugging and incremental processing.

Use Schema-on-Read: For sources with unstable schemas, use permissive schemas or JSON storage. Let Silver handle schema enforcement.

Optimize for Writes: Bronze tables are write-heavy. Use append-only patterns, partition by ingestion date, and avoid expensive operations like MERGE.

Our ADF Pipeline Patterns

At GAP, we ingest from on-premises SQL Server databases into ADLS Gen2 using Azure Data Factory. Here's what improved our ingestion speed by 30%:

Parallel Copy Activities: We use dynamic pipelines that parallelize across tables. Instead of sequential copies, we batch tables by size and run them concurrently.

Incremental Extraction: For large tables, we use watermark columns (usually modified_date) to extract only changed records. This reduces data volume and source system load.

Change Data Capture: Where available, we use SQL Server CDC to capture inserts, updates, and deletes. This feeds directly into our Delta Lake MERGE patterns in Silver.

Staging Optimization: We land data in Parquet format with snappy compression. This balances write speed with downstream read performance.

Source SQL Server → ADF Copy Activity → ADLS Gen2 (Parquet) → Databricks Auto Loader → Bronze Delta Table

Common Bronze Mistake

Don't over-engineer Bronze. I've seen teams add validation, cleaning, and transformation logic to their ingestion layer. When the source system changes or business rules evolve, everything breaks. Keep Bronze dumb and fast — let Silver be smart.

Handling Diverse Sources

Retail data comes from everywhere:

Databases (SQL Server, Oracle): Use ADF or Databricks JDBC connectors. Extract with watermarks for incremental loads.

Files (CSV, JSON, XML): Land in ADLS, use Auto Loader for streaming ingestion into Bronze. Handle schema inference carefully.

APIs (REST, webhooks): Use Azure Functions or Logic Apps to capture payloads, land in ADLS, process with Auto Loader.

Streaming (Kafka, Event Hubs): Use Structured Streaming with writeStream to Bronze Delta tables. Handle late-arriving data with watermarks.

The key is consistency: regardless of source type, Bronze tables have the same structure — source data plus ingestion metadata.

Key Takeaway

Bronze is about capture, not cleanup. Preserve source fidelity, add ingestion metadata, optimize for writes, and let Silver handle the complexity. When something goes wrong downstream, Bronze is your forensic evidence.

The Silver Layer: Cleansed and Conformed

Share to save for later

Silver is where the real data engineering happens. This layer transforms raw data into a cleansed, validated, and standardized form that's ready for analysis.

Design Principles

Single Source of Truth: Silver tables should be the definitive, cleansed version of each data entity. Duplicates removed, types enforced, relationships validated.

Incremental Processing: Use CDC patterns and Delta Lake MERGE to process only changed records. Full refreshes don't scale for 5TB+ datasets.

Schema Enforcement: Unlike Bronze's permissive schemas, Silver enforces strict types and constraints. Invalid records go to error tables for investigation.

Standardization: Apply naming conventions, date formats, currency conversions, and business key mappings. Silver data should be consistent across all sources.

CDC/Incremental Processing with Delta Lake

At GAP, most of our Silver tables use Delta Lake MERGE for incremental updates. Here's the pattern:

# Silver incremental update pattern
from delta.tables import DeltaTable

# Load Bronze changes since last watermark
bronze_changes = spark.read.table("bronze.sales_transactions") \
    .filter(col("_ingestion_timestamp") > last_watermark)

# Apply transformations
silver_ready = bronze_changes \
    .withColumn("transaction_date", to_date(col("trans_dt"))) \
    .withColumn("amount", col("amt").cast("decimal(18,2)")) \
    .withColumn("store_id", coalesce(col("store_id"), lit("UNKNOWN"))) \
    .dropDuplicates(["transaction_id"])

# MERGE into Silver
silver_table = DeltaTable.forName(spark, "silver.sales_transactions")
silver_table.alias("target") \
    .merge(silver_ready.alias("source"), "target.transaction_id = source.transaction_id") \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

Key optimizations:

Partition pruning: Partition Silver tables by date, filter on partition columns in MERGE conditions
Z-ordering: Apply Z-ORDER on frequently filtered columns (store_id, product_id)
Broadcast joins: For small dimension tables, use broadcast hints to avoid shuffles

Schema Enforcement and Evolution

Silver is where we enforce schemas, but we also need to handle evolution gracefully:

Type Enforcement: Cast all columns to expected types. Invalid casts go to error tables.

Null Handling: Define explicit rules for each column — required vs. optional, default values for nulls.

New Columns: Use Delta Lake's mergeSchema option to handle new columns from sources. But review and validate — not all new columns belong in Silver.

Removed Columns: Don't remove columns from Silver immediately. Mark as deprecated, backfill with nulls, remove after downstream systems adapt.

Pros

Single source of cleansed truth for all downstream
Reprocessable — can rebuild from Bronze when rules change
CDC patterns enable efficient incremental processing
Clear boundary for data quality enforcement

Cons

Additional storage and compute costs
Latency added by transformation layer
Requires careful schema evolution management
Complex for rapidly changing source systems

PySpark Transformations That Scale

When you're processing 5TB+ daily, small inefficiencies compound. Here's what we've learned:

Avoid UDFs When Possible: Native Spark functions are optimized and can be pushed down. UDFs serialize data to Python, killing performance.

Partition Strategically: Too few partitions = slow processing. Too many = small files problem. We aim for 100-200 partitions for most jobs, with partition sizes around 128MB.

Cache Wisely: Cache intermediate DataFrames that are reused multiple times. But unpersist when done — memory is finite.

Broadcast Small Tables: For joins with dimension tables under 100MB, broadcast to avoid shuffle.

The difference between a pipeline that runs in 20 minutes and one that runs in 2 hours is usually a few missing optimizations — broadcast joins, partition pruning, avoiding UDFs. Small changes, big impact at scale.

Jaswanth Malineni, Senior Azure Data Engineer, GAP

Key Takeaway

Silver is your data engineering workhorse — cleaning, validating, standardizing, and optimizing. Use CDC patterns for incremental processing, enforce schemas strictly, and invest in Spark optimization. Silver quality determines everything downstream.

The Gold Layer: Business-Ready

Share to save for later

Gold is where data becomes business value. This layer contains aggregated, business-ready datasets optimized for specific consumption patterns — dashboards, reports, ML features, and APIs.

Design Principles

Business-Centric Modeling: Gold tables are organized around business concepts, not source systems. Think "daily_store_sales" not "pos_transactions_aggregated".

Consumption-Optimized: Pre-aggregate, pre-join, pre-filter. Gold should be fast to query without complex transformations.

Stable Interfaces: Gold schemas are contracts with consumers. Changes require versioning and communication.

Multiple Gold Tables: Different consumers need different views. It's normal to have multiple Gold tables from the same Silver sources.

Dimensional Modeling Patterns

For BI and reporting, we use dimensional modeling in Gold:

Fact Tables: Transactional events with measures. fact_daily_sales with store_id, product_id, date_id, units_sold, revenue.

Dimension Tables: Descriptive attributes. dim_store with store_id, store_name, region, format, open_date.

Star Schema: Facts reference dimensions via surrogate keys. Optimized for BI tool performance.

At GAP, our Gold layer includes:

gold.fact_daily_sales — Store-level daily sales aggregations
gold.fact_inventory_snapshots — End-of-day inventory positions
gold.dim_store — Store attributes and hierarchies
gold.dim_product — Product catalog with categories
gold.dim_date — Calendar dimension with fiscal periods

Analytics-Ready Datasets for Power BI/Tableau

Our BI team consumes Gold data through:

Direct Query: For real-time dashboards, Tableau connects directly to Databricks SQL endpoints querying Gold tables.

Snowflake Curated Datasets: For high-concurrency reporting, we replicate Gold tables to Snowflake. Better BI tool support, predictable performance.

Power BI Dataflows: For self-service analytics, we expose Gold tables through Power Query with pre-built transformations.

Key considerations for BI consumption:

Pre-aggregate: Don't make BI tools aggregate 5TB. Pre-compute at appropriate grain.
Denormalize selectively: Include key dimension attributes in facts to reduce joins.
Optimize for common filters: Z-ORDER on columns that BI users filter frequently.

AI/ML-Ready Feature Tables

Beyond BI, Gold layer serves our data science team:

Feature Tables: Pre-computed features for ML models. gold.customer_features with recency, frequency, monetary values.

Training Datasets: Point-in-time correct snapshots for model training. Avoid leakage by using proper temporal joins.

Inference Inputs: Real-time feature serving for production models.

We use Databricks Feature Store to manage these, with Gold tables as the underlying storage.

Gold Layer Readiness Checklist

0/8

Key Takeaway

Gold is business value realized. Model for consumers, optimize for queries, and maintain stable interfaces. Different consumers need different Gold tables — that's expected, not a design flaw.

Real Project: GAP Retail Analytics Pipeline

Share to save for later

Let me walk you through the actual pipeline I built at GAP — from raw retail data to business dashboards.

The Challenge

When I joined, GAP had:

5TB+ daily retail data: POS transactions, inventory movements, e-commerce orders
Legacy SSIS packages: Hundreds of them, some undocumented, many fragile
Multiple data silos: Different systems for stores, warehouses, and e-commerce
Growing analytics demand: Leadership wanted real-time inventory visibility and sales insights

The goal: build a unified data platform that could handle current scale and grow with the business.

The Architecture

[On-Prem SQL Server] → [Azure Data Factory] → [ADLS Gen2 Bronze]
                                                      ↓
                                           [Databricks Bronze Tables]
                                                      ↓
                                           [Databricks Silver Tables]
                                                      ↓
                                           [Databricks Gold Tables]
                                                      ↓
                                    [Snowflake] ← → [Power BI/Tableau]

Ingestion (ADF → Bronze):

ADF pipelines extract from on-prem SQL Server
Incremental extraction using watermarks
Land in ADLS Gen2 as Parquet
Auto Loader streams into Bronze Delta tables
Result: 30% faster ingestion than legacy SSIS

Transformation (Bronze → Silver):

PySpark jobs in Databricks
CDC patterns with Delta Lake MERGE
Schema enforcement and validation
Error handling with dead-letter tables
Result: 99% data reliability

Aggregation (Silver → Gold):

Dimensional modeling for BI
Pre-aggregated fact tables
Feature tables for ML
Result: Query times from minutes to seconds

Serving (Gold → Consumers):

Snowflake for high-concurrency BI
Databricks SQL for ad-hoc analysis
Feature Store for ML models
Result: Self-service analytics at scale

Performance Optimizations

Here's what moved the needle:

Partitioning Strategy:

Bronze: Partition by ingestion date (for retention management)
Silver: Partition by business date (for incremental processing)
Gold: Partition by report period (for query performance)

Z-Ordering:

Silver transactions: Z-ORDER BY (store_id, product_id)
Gold sales: Z-ORDER BY (date_id, store_id)
Reduced query times by 60% for common filters

Caching and Broadcast:

Cache Silver dimension lookups
Broadcast dim tables (< 100MB) in joins
Reduced shuffle in aggregation jobs

Cluster Right-Sizing:

Auto-scaling clusters for variable workloads
Photon-enabled clusters for SQL-heavy jobs
Spot instances for non-critical batch jobs

Results and Business Impact

After migrating to the Medallion Architecture:

25% compute cost reduction: Efficient incremental processing vs. full refreshes
30% faster ingestion: Parallelized ADF pipelines with optimized extraction
Real-time inventory visibility: From daily batches to hourly refreshes
Self-service analytics: Business users can explore Gold tables directly
ML-ready platform: Feature Store enables rapid model iteration

The biggest win wasn't technical — it was organizational. With Medallion Architecture, everyone knows where to find data, what quality to expect, and who owns what. That clarity is worth more than any performance optimization.

Jaswanth Malineni, Senior Azure Data Engineer, GAP

Key Takeaway

A real Medallion implementation is about orchestrating multiple components — ADF, Databricks, Delta Lake, Snowflake — into a coherent whole. The architecture is simple; the execution requires attention to detail at every layer.

Medallion + Delta Lake: The Perfect Pair

Share to save for later

Medallion Architecture and Delta Lake were designed to work together. Delta Lake provides the transactional foundation that makes Medallion practical at scale.

ACID Transactions Across Layers

Traditional data lakes have a dirty secret: they're not transactional. A failed job can leave data in an inconsistent state. Concurrent writes can corrupt tables. There's no rollback.

Delta Lake solves this with ACID transactions:

Atomicity: Jobs succeed completely or fail completely. No partial writes.
Consistency: Schema enforcement ensures data integrity.
Isolation: Concurrent readers and writers don't interfere.
Durability: Committed changes persist through failures.

For Medallion, this means:

Bronze ingestion can fail mid-batch without corruption
Silver MERGE operations are atomic
Gold tables can be refreshed while users query

Time Travel for Debugging and Auditing

Delta Lake's time travel is a superpower for Medallion implementations:

Debugging: When a Gold metric looks wrong, query Silver as of yesterday:

SELECT * FROM silver.transactions VERSION AS OF 42
-- or
SELECT * FROM silver.transactions TIMESTAMP AS OF '2026-01-26'

Auditing: Demonstrate what data looked like at a specific point for compliance requirements.

Reprocessing: When business rules change, reprocess Gold from a historical Silver version.

Recovery: Accidentally deleted data? Restore from a previous version.

We retain 30 days of history for Silver tables (configurable via Delta Lake retention settings) — enough for most debugging and audit needs.

MERGE for CDC Patterns

Delta Lake MERGE is the engine behind our Silver incremental processing:

MERGE INTO silver.customers AS target
USING bronze.customers_changes AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.operation = 'DELETE' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Key MERGE optimizations:

Partition filters: Always include partition columns in MERGE conditions
Predicate pushdown: Delta Lake pushes filters to file pruning
Optimized writes: Use optimizeWrite for better file sizes

Unity Catalog for Governance

Unity Catalog brings enterprise governance to the lakehouse:

Data Lineage: Automatic tracking of how data flows through Bronze → Silver → Gold.

Access Control: Fine-grained permissions at catalog, schema, and table levels.

Data Discovery: Searchable catalog of all lakehouse assets.

Audit Logging: Who accessed what, when, from where.

We use Unity Catalog to enforce that:

Bronze tables are accessible only to data engineering
Silver tables are accessible to data engineering and data science
Gold tables are accessible to business analysts and BI tools

Key Takeaway

Delta Lake transforms Medallion from a design pattern into a reliable, transactional, auditable data platform. ACID transactions, time travel, MERGE operations, and Unity Catalog governance are the features that make Medallion work at enterprise scale.

Medallion vs Traditional ETL

Share to save for later

I've worked with both traditional ETL (SSIS, Informatica) and modern lakehouse patterns. Here's an honest comparison.

Dimension	Traditional ETL	Medallion Lakehouse
Data Storage	Data warehouse (relational)	Data lake (object storage + Delta)
Processing	Extract-Transform-Load	Extract-Load-Transform (ELT)
Schema	Schema-on-write, strict	Schema-on-read (Bronze), schema-on-write (Silver/Gold)
Scalability	Vertical (bigger warehouse)	Horizontal (more compute nodes)
Cost Model	Storage + compute bundled	Storage + compute separate
Flexibility	Structured data focus	Any data format (JSON, Parquet, etc.)
Reprocessing	Complex, often requires re-extraction	Reprocess from Bronze without re-ingesting

When to Choose Medallion

High data volume: 5TB+ datasets benefit from distributed processing
Multiple sources: Diverse data formats and schemas
Evolving requirements: Business logic changes frequently
ML/AI workloads: Need raw data access for feature engineering
Cost sensitivity: Separate storage and compute for optimization

When to Keep Traditional ETL

Small data volumes: < 100GB where complexity isn't justified
Simple requirements: Single source, well-defined transformations
Existing investment: Mature ETL platform with institutional knowledge
Regulatory constraints: Industry-specific tools required for compliance

Migration from SSIS: Our Experience

At GAP, we migrated hundreds of SSIS packages to Databricks. Here's what we learned:

Don't Lift and Shift: SSIS packages encode assumptions (sequential processing, single-node) that don't apply. Redesign for distributed processing.

Map to Medallion Layers: Most SSIS packages combined ingestion and transformation. Separate into Bronze (ingestion) and Silver (transformation).

Embrace Incremental: SSIS often did full refreshes because change tracking was complex. Delta Lake MERGE makes incremental processing straightforward.

Document Business Logic: Many SSIS packages contained undocumented business rules in Script Tasks. Extract and document before migrating.

Parallel Migration: Don't try to migrate everything at once. Run old and new pipelines in parallel, validate output, switch over gradually.

Result: 25% compute cost reduction, faster development cycles, and happier data engineers.

Key Takeaway

Medallion isn't universally better than traditional ETL — it's better for high-volume, complex, evolving data environments. Migration is a redesign, not a port. Take the opportunity to clean up accumulated technical debt.

Common Mistakes in Medallion Implementation

Share to save for later

After building Medallion architectures across multiple organizations, here are the mistakes I see most often.

Medallion Architecture Mistakes

Over-engineering Bronze layer with transformations that belong in Silver
Skipping Silver and going directly from Bronze to Gold
Creating too many Gold tables without clear ownership
Ignoring data quality until it's a crisis
Not planning for reprocessing from the start
Using full refreshes when incremental is possible
Neglecting monitoring and alerting
Treating schemas as static when sources evolve

Mistake Deep Dive: Over-Engineering Bronze

I've seen teams add validation, cleaning, and even business logic to their Bronze layer. The reasoning: "We should clean data as soon as possible."

Why it fails: When source systems change — and they always do — your Bronze layer breaks. Now you can't even ingest data, let alone transform it.

The fix: Bronze should be dumb. Capture data as-is, add metadata, write fast. All transformation logic belongs in Silver where it can be updated independently of ingestion.

Mistake Deep Dive: Skipping Silver

Some teams argue: "We have Bronze (raw) and Gold (business-ready). Why do we need Silver in between?"

Why it fails: Without Silver, Gold tables contain both cleaning logic AND business logic. When either changes, everything changes. You also lose the ability to have multiple Gold tables from the same cleansed source.

The fix: Silver is the cleansed, validated, entity-level truth. Gold is business aggregations. They serve different purposes and change for different reasons.

Mistake Deep Dive: Schema Complacency

Many teams design schemas once and treat them as static. Then a source system adds a field, removes a field, or changes a type.

Why it fails: Rigid schemas break on source changes. Overly flexible schemas (everything as STRING) create quality nightmares.

The fix:

Bronze: Permissive schemas that tolerate changes
Silver: Schema enforcement with explicit handling for nulls, type mismatches, new columns
Gold: Stable schemas with versioning for breaking changes
Monitor for schema drift and alert before it becomes a crisis

The Reprocessing Trap

If you can't reprocess Silver from Bronze without re-ingesting from sources, you've missed a key benefit of Medallion. Always retain enough Bronze history to rebuild Silver when business rules change.

Key Takeaway

Most Medallion failures come from violating layer responsibilities: Bronze tries to do Silver's job, Silver gets skipped, Gold becomes a dumping ground. Respect the layers, and the architecture works.

Key Takeaways: Medallion Architecture

01Medallion Architecture organizes data into Bronze (raw), Silver (cleansed), and Gold (business-ready) layers with progressive quality improvement
02Bronze is about capture — preserve source fidelity, add metadata, optimize for writes
03Silver is about quality — cleanse, validate, standardize, use CDC patterns for incremental processing
04Gold is about value — business-centric modeling, consumption-optimized, stable interfaces
05Delta Lake provides the transactional foundation — ACID transactions, time travel, MERGE operations
06Unity Catalog adds governance — lineage, access control, data discovery
07The pattern shines for high-volume, multi-source, complex enterprise data — not every use case needs it
08Respect layer responsibilities — most failures come from Bronze doing Silver's job or Silver getting skipped

FAQ

What is Medallion Architecture?

Medallion Architecture is a data design pattern that organizes a lakehouse into three layers: Bronze (raw ingestion), Silver (cleansed and validated), and Gold (aggregated and business-ready). Each layer progressively improves data quality while maintaining lineage and enabling reprocessing. It's the de facto standard for Azure Databricks and Delta Lake implementations.

Why is it called Bronze, Silver, Gold?

The names reflect data quality progression, like refining precious metals. Bronze is raw ore (data as-is from sources). Silver is refined metal (cleansed, deduplicated, standardized). Gold is the finished product (business-ready aggregations). This naming makes the purpose of each layer clear to both technical and business stakeholders.

Is Medallion Architecture only for Databricks?

No, though it's most commonly associated with Azure Databricks and Delta Lake. The pattern can be implemented with any lakehouse technology: AWS Lake Formation, Google BigQuery, Snowflake, or open-source tools like Apache Spark with Iceberg or Hudi. The concepts of progressive data quality layers apply regardless of tooling.

How does Medallion Architecture differ from Lambda Architecture?

Lambda Architecture separates batch and real-time processing into different paths, then merges results. Medallion Architecture focuses on data quality layers, not processing timing. You can implement real-time Medallion (streaming Bronze → Silver → Gold) or batch Medallion. Many teams combine concepts, using streaming for Bronze ingestion and batch for Gold aggregations.

What goes in Bronze vs Silver vs Gold?

Bronze: Raw data as-is from sources, plus ingestion metadata (timestamp, source, batch ID). No transformations. Silver: Cleansed, validated, deduplicated data at entity level. Schema enforcement, null handling, type conversions. Gold: Business aggregations optimized for consumption. Dimensional models, feature tables, pre-computed metrics.

How do I handle schema changes in Medallion?

Bronze uses permissive schemas (schema-on-read) to absorb source changes. Silver handles evolution explicitly — new columns added with mergeSchema, removed columns deprecated with nulls before removal. Gold maintains stable interfaces with versioning for breaking changes. Monitor for schema drift and update Silver logic when sources change.

When should I NOT use Medallion Architecture?

Skip Medallion for simple, single-source pipelines where the overhead isn't justified. Also consider alternatives for sub-second latency requirements (streaming-first architectures) or when data is already clean and well-structured from sources (two-layer raw + curated might suffice). Medallion adds complexity that's only justified for enterprise-scale, multi-source data.

Sources

01Medallion Architecture — Databricks (2026)
02What is a Lakehouse? — Databricks (2020)
03Delta Lake Documentation — Delta Lake (2026)
04Azure Databricks Best Practices — Microsoft (2026)
05Unity Catalog Documentation — Databricks (2026)
06Azure Data Factory Documentation — Microsoft (2026)
07Spark SQL Performance Tuning — Apache Spark (2026)
08The Data Warehouse Toolkit — Ralph Kimball (2013)

AWS QuickSight: A Complete Guide from a Data Engineer Who Automated 12 Excel Workbook Reports to Self-serviced Dashboards— A data analytics engineer breaks down AWS QuickSight — SPICE optimization, dashboard automation, real-world lessons from building dashboards at scale.Data Engineer Roadmap: A Complete Guide from a Data Engineer Who Built ETL Pipelines at Scale— A Data Engineer with 4+ years of experience at Optum (UnitedHealth Group) shares the complete roadmap to grow from junior to mid-level+ — real skills, projects, cloud certifications, and lessons from processing healthcare data for 20+ US states.Churn Rate Analysis: A Complete Guide from a Financial Analyst Who Actually Built Retention Models— Learn how to analyze customer churn, build predictive models, and create retention strategies that work. A financial data analyst breaks down the exact techniques used at banks and credit card companies.From Medicaid Claims Code to Enterprise Data Warehouse: How Vikram Reddy Built Healthcare Data Management from Both Sides— Vikram Reddy spent 3 years coding the Medicaid app that creates healthcare data — then switched sides and built the enterprise warehouse that analyzes it. He shares what most data teams get wrong about star schema, ETL pipelines, and healthcare data integration.