Medallion Architecture: A Complete Guide from a GAP Data Engineer Processing 5TB+ Retail Data

Published: 2026-01-27

Jaswanth Malineni
Expert Insight by

Jaswanth Malineni

Senior Azure Data Engineer, GAP

Data Engineering / RetailLinkedIn

Jaswanth has 5 years of hands-on data engineering experience across retail, education, and technology. He currently builds Azure Databricks Lakehouse pipelines processing 5TB+ retail data at GAP using Medallion Architecture. Previously, he designed Delta Lake frameworks at Illinois State University and built AWS Glue ETL pipelines at RI Logix Technologies. He holds an MS in Computer Science and is an AWS Certified Data Engineer.

Verified Expert
TL;DR

Medallion Architecture is a data design pattern that organizes data into three layers — Bronze (raw), Silver (cleansed), and Gold (business-ready). The real power isn't just organization — it's the incremental quality improvement, schema evolution handling, and clear separation of concerns that make enterprise data pipelines maintainable. I've built Medallion pipelines at GAP that process 5TB+ of retail data daily, powering inventory decisions and sales analytics. This guide covers what I've learned about building lakehouse architecture that actually scales.

What You'll Learn
  • What Medallion Architecture is and why it's become the standard for modern data lakehouses
  • How to design Bronze, Silver, and Gold layers for real enterprise workloads
  • Production patterns for CDC/incremental processing with Delta Lake
  • Real-world lessons from building 5TB+ retail data pipelines at GAP
  • How to migrate from legacy ETL (SSIS) to modern lakehouse architecture
  • Common mistakes that derail Medallion implementations and how to avoid them

Quick Answers

What is Medallion Architecture?

Medallion Architecture is a data design pattern that organizes a lakehouse into three layers: Bronze (raw data), Silver (cleansed and conformed), and Gold (business-ready aggregations). Each layer progressively improves data quality, making it easier to maintain, debug, and scale data pipelines. It's the de facto standard for Azure Databricks and Delta Lake implementations.

Why is it called Bronze, Silver, Gold?

The names reflect data quality progression — like refining precious metals. Bronze is raw ore (data as-is from sources). Silver is refined metal (cleansed, deduplicated, standardized). Gold is the finished product (business-ready aggregations and features). This naming convention makes the purpose of each layer immediately clear to both technical and business stakeholders.

Is Medallion Architecture the same as a data lakehouse?

Medallion Architecture is a pattern for organizing a data lakehouse, not the lakehouse itself. A lakehouse combines data lake storage (like ADLS or S3) with data warehouse features (ACID transactions, schema enforcement). Medallion Architecture provides the layered structure within that lakehouse. You can build a lakehouse without Medallion, but Medallion has become the most popular organizational pattern.

When should I NOT use Medallion Architecture?

Skip Medallion for simple, single-source pipelines where the overhead isn't justified. Also avoid it for real-time streaming use cases that need sub-second latency (though you can use a modified version). If your data is already clean and well-structured from the source, a two-layer approach (raw + curated) might be simpler. Medallion shines for complex, multi-source enterprise data with varying quality.


What is Medallion Architecture?

Medallion Architecture

Medallion Architecture is a data design pattern that organizes a data lakehouse into three progressive layers: Bronze (raw ingestion), Silver (cleansed and validated), and Gold (aggregated and business-ready). Each layer applies incremental transformations, improving data quality while maintaining lineage and enabling reprocessing. It's the recommended pattern for Azure Databricks, Delta Lake, and modern lakehouse implementations.

When I joined GAP to build their retail data platform, I inherited a mix of legacy SSIS packages, scattered data files, and SQL Server databases that had grown organically over years. The first question wasn't "what tools should we use?" — it was "how do we organize this chaos into something maintainable?"

Medallion Architecture provided the answer. Here's why it works:

Progressive Data Quality: Data moves through layers like refining raw materials. Bronze captures everything as-is. Silver cleanses and standardizes. Gold serves business needs. Each layer has a clear purpose.

Separation of Concerns: Ingestion logic lives in Bronze. Data quality logic lives in Silver. Business logic lives in Gold. When something breaks, you know exactly where to look.

Reprocessability: Because each layer is persisted (not just a view), you can reprocess Silver from Bronze when business rules change, without re-ingesting from source systems.

Schema Evolution: Bronze tolerates schema changes from sources. Silver handles them gracefully. Gold provides stable interfaces for consumers.

Key Stats
5TB+
Retail data processed daily at GAP
30%
Faster ingestion with optimized ADF pipelines
25%
Compute cost reduction after SSIS migration
99%
Data reliability in production pipelines

The pattern has become the de facto standard for Azure Databricks and Delta Lake implementations. Microsoft, Databricks, and most data engineering teams I've worked with use some variation of Bronze-Silver-Gold.

🔑

Medallion Architecture isn't just folder organization — it's a framework for progressive data quality, clear ownership, and maintainable pipelines. Each layer has a specific job, making complex data platforms manageable.


Why Medallion for Enterprise Data Pipelines

I've built data pipelines on both Azure and AWS, using traditional ETL, ELT, and lakehouse patterns. For enterprise workloads with multiple sources and complex transformations, Medallion consistently wins. Here's why.

Data Quality as a Journey

Traditional ETL treats data quality as a single transformation: dirty data goes in, clean data comes out. This works until:

  • Source schemas change unexpectedly
  • Business rules evolve mid-flight
  • You need to debug why a number doesn't match

Medallion treats data quality as a journey with checkpoints:

  • Bronze: Capture raw data, preserve source fidelity, enable forensics
  • Silver: Apply cleaning rules, handle nulls, enforce types, deduplicate
  • Gold: Apply business logic, aggregate, optimize for consumption

When something looks wrong in a dashboard, I can trace it back: Gold aggregation issue? Silver cleaning bug? Bronze ingestion problem? Source system change?

Schema Evolution Without Pain

Retail data is messy. At GAP, we ingest from:

  • Point-of-sale systems (transactions, returns, voids)
  • Inventory management (stock levels, transfers, adjustments)
  • E-commerce platforms (orders, carts, sessions)
  • Marketing systems (campaigns, promotions, customer segments)

Each source has its own schema, its own quirks, and its own tendency to change without warning.

Bronze layer handles this by using schema-on-read. We ingest data as-is, often as JSON or with permissive schemas. When a source adds a field, Bronze doesn't break.

Silver layer applies schema enforcement. We handle nulls, type conversions, and missing fields. When a source changes, we update Silver logic — not ingestion.

Schema Evolution Tip

Use Delta Lake's schema evolution features (mergeSchema, overwriteSchema) carefully. Enable mergeSchema for Silver tables that need to handle new columns gracefully. But for Gold tables serving BI tools, schema stability is more important than flexibility.

Performance at Each Layer

Different layers have different performance profiles:

Bronze: Optimized for write throughput. We use append-only patterns, minimal transformation, and Auto Loader for streaming ingestion. The goal is getting data in fast.

Silver: Optimized for incremental processing. We use Delta Lake MERGE for CDC patterns, partitioning by date, and Z-ordering on common filter columns. The goal is efficient updates.

Gold: Optimized for read performance. We use aggressive aggregation, materialized views, and Snowflake for serving. The goal is fast queries for dashboards and reports.

LayerOptimization GoalKey TechniquesTypical Consumers
BronzeWrite throughputAppend-only, Auto Loader, minimal transformsSilver layer pipelines
SilverIncremental updatesMERGE, partitioning, Z-ordering, CDCGold layer, data scientists
GoldRead performanceAggregations, materialized views, cachingBI tools, reports, APIs

Governance and Auditability

In retail, we deal with financial data (sales, revenue), customer data (transactions, preferences), and inventory data (stock counts, valuations). All of this requires:

  • Lineage: Where did this number come from?
  • Auditability: What transformations were applied?
  • Access Control: Who can see what?

Medallion layers provide natural governance boundaries:

  • Bronze: Restricted access, raw data retention for compliance
  • Silver: Broader access, curated for analysis
  • Gold: Business-user access, aggregated and anonymized where needed

We use Unity Catalog to enforce these boundaries, with data lineage tracked automatically through Delta Lake.

🔑

Medallion Architecture provides progressive quality, schema evolution handling, layer-specific optimization, and natural governance boundaries. For enterprise data with multiple sources and complex requirements, it's the most maintainable pattern I've found.


The Bronze Layer: Raw Data Ingestion

Bronze is your data's safety net. The goal is simple: capture everything from source systems with minimal transformation, preserving the original structure for debugging and reprocessing.

Design Principles

Preserve Source Fidelity: Don't transform data in Bronze. If the source sends null, store null. If the source sends malformed JSON, store malformed JSON. Bronze is a historical record, not a cleaning layer.

Add Ingestion Metadata: While we don't transform source data, we add metadata: ingestion timestamp, source system, batch ID, file name. This enables debugging and incremental processing.

Use Schema-on-Read: For sources with unstable schemas, use permissive schemas or JSON storage. Let Silver handle schema enforcement.

Optimize for Writes: Bronze tables are write-heavy. Use append-only patterns, partition by ingestion date, and avoid expensive operations like MERGE.

Our ADF Pipeline Patterns

At GAP, we ingest from on-premises SQL Server databases into ADLS Gen2 using Azure Data Factory. Here's what improved our ingestion speed by 30%:

Parallel Copy Activities: We use dynamic pipelines that parallelize across tables. Instead of sequential copies, we batch tables by size and run them concurrently.

Incremental Extraction: For large tables, we use watermark columns (usually modified_date) to extract only changed records. This reduces data volume and source system load.

Change Data Capture: Where available, we use SQL Server CDC to capture inserts, updates, and deletes. This feeds directly into our Delta Lake MERGE patterns in Silver.

Staging Optimization: We land data in Parquet format with snappy compression. This balances write speed with downstream read performance.

Source SQL Server → ADF Copy Activity → ADLS Gen2 (Parquet) → Databricks Auto Loader → Bronze Delta Table
Common Bronze Mistake

Don't over-engineer Bronze. I've seen teams add validation, cleaning, and transformation logic to their ingestion layer. When the source system changes or business rules evolve, everything breaks. Keep Bronze dumb and fast — let Silver be smart.

Handling Diverse Sources

Retail data comes from everywhere:

Databases (SQL Server, Oracle): Use ADF or Databricks JDBC connectors. Extract with watermarks for incremental loads.

Files (CSV, JSON, XML): Land in ADLS, use Auto Loader for streaming ingestion into Bronze. Handle schema inference carefully.

APIs (REST, webhooks): Use Azure Functions or Logic Apps to capture payloads, land in ADLS, process with Auto Loader.

Streaming (Kafka, Event Hubs): Use Structured Streaming with writeStream to Bronze Delta tables. Handle late-arriving data with watermarks.

The key is consistency: regardless of source type, Bronze tables have the same structure — source data plus ingestion metadata.

🔑

Bronze is about capture, not cleanup. Preserve source fidelity, add ingestion metadata, optimize for writes, and let Silver handle the complexity. When something goes wrong downstream, Bronze is your forensic evidence.


The Silver Layer: Cleansed and Conformed

Silver is where the real data engineering happens. This layer transforms raw data into a cleansed, validated, and standardized form that's ready for analysis.

Design Principles

Single Source of Truth: Silver tables should be the definitive, cleansed version of each data entity. Duplicates removed, types enforced, relationships validated.

Incremental Processing: Use CDC patterns and Delta Lake MERGE to process only changed records. Full refreshes don't scale for 5TB+ datasets.

Schema Enforcement: Unlike Bronze's permissive schemas, Silver enforces strict types and constraints. Invalid records go to error tables for investigation.

Standardization: Apply naming conventions, date formats, currency conversions, and business key mappings. Silver data should be consistent across all sources.

CDC/Incremental Processing with Delta Lake

At GAP, most of our Silver tables use Delta Lake MERGE for incremental updates. Here's the pattern:

# Silver incremental update pattern
from delta.tables import DeltaTable

# Load Bronze changes since last watermark
bronze_changes = spark.read.table("bronze.sales_transactions") \
    .filter(col("_ingestion_timestamp") > last_watermark)

# Apply transformations
silver_ready = bronze_changes \
    .withColumn("transaction_date", to_date(col("trans_dt"))) \
    .withColumn("amount", col("amt").cast("decimal(18,2)")) \
    .withColumn("store_id", coalesce(col("store_id"), lit("UNKNOWN"))) \
    .dropDuplicates(["transaction_id"])

# MERGE into Silver
silver_table = DeltaTable.forName(spark, "silver.sales_transactions")
silver_table.alias("target") \
    .merge(silver_ready.alias("source"), "target.transaction_id = source.transaction_id") \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

Key optimizations:

  • Partition pruning: Partition Silver tables by date, filter on partition columns in MERGE conditions
  • Z-ordering: Apply Z-ORDER on frequently filtered columns (store_id, product_id)
  • Broadcast joins: For small dimension tables, use broadcast hints to avoid shuffles

Schema Enforcement and Evolution

Silver is where we enforce schemas, but we also need to handle evolution gracefully:

Type Enforcement: Cast all columns to expected types. Invalid casts go to error tables.

Null Handling: Define explicit rules for each column — required vs. optional, default values for nulls.

New Columns: Use Delta Lake's mergeSchema option to handle new columns from sources. But review and validate — not all new columns belong in Silver.

Removed Columns: Don't remove columns from Silver immediately. Mark as deprecated, backfill with nulls, remove after downstream systems adapt.

Pros
  • + Single source of cleansed truth for all downstream
  • + Reprocessable — can rebuild from Bronze when rules change
  • + CDC patterns enable efficient incremental processing
  • + Clear boundary for data quality enforcement
Cons
  • Additional storage and compute costs
  • Latency added by transformation layer
  • Requires careful schema evolution management
  • Complex for rapidly changing source systems

PySpark Transformations That Scale

When you're processing 5TB+ daily, small inefficiencies compound. Here's what we've learned:

Avoid UDFs When Possible: Native Spark functions are optimized and can be pushed down. UDFs serialize data to Python, killing performance.

Partition Strategically: Too few partitions = slow processing. Too many = small files problem. We aim for 100-200 partitions for most jobs, with partition sizes around 128MB.

Cache Wisely: Cache intermediate DataFrames that are reused multiple times. But unpersist when done — memory is finite.

Broadcast Small Tables: For joins with dimension tables under 100MB, broadcast to avoid shuffle.

The difference between a pipeline that runs in 20 minutes and one that runs in 2 hours is usually a few missing optimizations — broadcast joins, partition pruning, avoiding UDFs. Small changes, big impact at scale.

J
Jaswanth MalineniSenior Azure Data Engineer, GAP
🔑

Silver is your data engineering workhorse — cleaning, validating, standardizing, and optimizing. Use CDC patterns for incremental processing, enforce schemas strictly, and invest in Spark optimization. Silver quality determines everything downstream.


The Gold Layer: Business-Ready

Gold is where data becomes business value. This layer contains aggregated, business-ready datasets optimized for specific consumption patterns — dashboards, reports, ML features, and APIs.

Design Principles

Business-Centric Modeling: Gold tables are organized around business concepts, not source systems. Think "daily_store_sales" not "pos_transactions_aggregated".

Consumption-Optimized: Pre-aggregate, pre-join, pre-filter. Gold should be fast to query without complex transformations.

Stable Interfaces: Gold schemas are contracts with consumers. Changes require versioning and communication.

Multiple Gold Tables: Different consumers need different views. It's normal to have multiple Gold tables from the same Silver sources.

Dimensional Modeling Patterns

For BI and reporting, we use dimensional modeling in Gold:

Fact Tables: Transactional events with measures. fact_daily_sales with store_id, product_id, date_id, units_sold, revenue.

Dimension Tables: Descriptive attributes. dim_store with store_id, store_name, region, format, open_date.

Star Schema: Facts reference dimensions via surrogate keys. Optimized for BI tool performance.

At GAP, our Gold layer includes:

  • gold.fact_daily_sales — Store-level daily sales aggregations
  • gold.fact_inventory_snapshots — End-of-day inventory positions
  • gold.dim_store — Store attributes and hierarchies
  • gold.dim_product — Product catalog with categories
  • gold.dim_date — Calendar dimension with fiscal periods

Analytics-Ready Datasets for Power BI/Tableau

Our BI team consumes Gold data through:

Direct Query: For real-time dashboards, Tableau connects directly to Databricks SQL endpoints querying Gold tables.

Snowflake Curated Datasets: For high-concurrency reporting, we replicate Gold tables to Snowflake. Better BI tool support, predictable performance.

Power BI Dataflows: For self-service analytics, we expose Gold tables through Power Query with pre-built transformations.

Key considerations for BI consumption:

  • Pre-aggregate: Don't make BI tools aggregate 5TB. Pre-compute at appropriate grain.
  • Denormalize selectively: Include key dimension attributes in facts to reduce joins.
  • Optimize for common filters: Z-ORDER on columns that BI users filter frequently.

AI/ML-Ready Feature Tables

Beyond BI, Gold layer serves our data science team:

Feature Tables: Pre-computed features for ML models. gold.customer_features with recency, frequency, monetary values.

Training Datasets: Point-in-time correct snapshots for model training. Avoid leakage by using proper temporal joins.

Inference Inputs: Real-time feature serving for production models.

We use Databricks Feature Store to manage these, with Gold tables as the underlying storage.

Gold Layer Readiness Checklist
  • Tables organized around business concepts, not source systems
  • Pre-aggregated to appropriate grain for consumers
  • Schemas documented and versioned
  • Access controls aligned with data sensitivity
  • Refresh schedules meet business SLAs
  • Query performance validated with expected workloads
  • Lineage documented back to Silver/Bronze
  • Monitoring in place for data freshness and quality
🔑

Gold is business value realized. Model for consumers, optimize for queries, and maintain stable interfaces. Different consumers need different Gold tables — that's expected, not a design flaw.


Real Project: GAP Retail Analytics Pipeline

Let me walk you through the actual pipeline I built at GAP — from raw retail data to business dashboards.

The Challenge

When I joined, GAP had:

  • 5TB+ daily retail data: POS transactions, inventory movements, e-commerce orders
  • Legacy SSIS packages: Hundreds of them, some undocumented, many fragile
  • Multiple data silos: Different systems for stores, warehouses, and e-commerce
  • Growing analytics demand: Leadership wanted real-time inventory visibility and sales insights

The goal: build a unified data platform that could handle current scale and grow with the business.

The Architecture

[On-Prem SQL Server] → [Azure Data Factory] → [ADLS Gen2 Bronze]
                                                      ↓
                                           [Databricks Bronze Tables]
                                                      ↓
                                           [Databricks Silver Tables]
                                                      ↓
                                           [Databricks Gold Tables]
                                                      ↓
                                    [Snowflake] ← → [Power BI/Tableau]

Ingestion (ADF → Bronze):

  • ADF pipelines extract from on-prem SQL Server
  • Incremental extraction using watermarks
  • Land in ADLS Gen2 as Parquet
  • Auto Loader streams into Bronze Delta tables
  • Result: 30% faster ingestion than legacy SSIS

Transformation (Bronze → Silver):

  • PySpark jobs in Databricks
  • CDC patterns with Delta Lake MERGE
  • Schema enforcement and validation
  • Error handling with dead-letter tables
  • Result: 99% data reliability

Aggregation (Silver → Gold):

  • Dimensional modeling for BI
  • Pre-aggregated fact tables
  • Feature tables for ML
  • Result: Query times from minutes to seconds

Serving (Gold → Consumers):

  • Snowflake for high-concurrency BI
  • Databricks SQL for ad-hoc analysis
  • Feature Store for ML models
  • Result: Self-service analytics at scale

Performance Optimizations

Here's what moved the needle:

Partitioning Strategy:

  • Bronze: Partition by ingestion date (for retention management)
  • Silver: Partition by business date (for incremental processing)
  • Gold: Partition by report period (for query performance)

Z-Ordering:

  • Silver transactions: Z-ORDER BY (store_id, product_id)
  • Gold sales: Z-ORDER BY (date_id, store_id)
  • Reduced query times by 60% for common filters

Caching and Broadcast:

  • Cache Silver dimension lookups
  • Broadcast dim tables (< 100MB) in joins
  • Reduced shuffle in aggregation jobs

Cluster Right-Sizing:

  • Auto-scaling clusters for variable workloads
  • Photon-enabled clusters for SQL-heavy jobs
  • Spot instances for non-critical batch jobs

Results and Business Impact

After migrating to the Medallion Architecture:

  • 25% compute cost reduction: Efficient incremental processing vs. full refreshes
  • 30% faster ingestion: Parallelized ADF pipelines with optimized extraction
  • Real-time inventory visibility: From daily batches to hourly refreshes
  • Self-service analytics: Business users can explore Gold tables directly
  • ML-ready platform: Feature Store enables rapid model iteration

The biggest win wasn't technical — it was organizational. With Medallion Architecture, everyone knows where to find data, what quality to expect, and who owns what. That clarity is worth more than any performance optimization.

J
Jaswanth MalineniSenior Azure Data Engineer, GAP
🔑

A real Medallion implementation is about orchestrating multiple components — ADF, Databricks, Delta Lake, Snowflake — into a coherent whole. The architecture is simple; the execution requires attention to detail at every layer.


Medallion + Delta Lake: The Perfect Pair

Medallion Architecture and Delta Lake were designed to work together. Delta Lake provides the transactional foundation that makes Medallion practical at scale.

ACID Transactions Across Layers

Traditional data lakes have a dirty secret: they're not transactional. A failed job can leave data in an inconsistent state. Concurrent writes can corrupt tables. There's no rollback.

Delta Lake solves this with ACID transactions:

  • Atomicity: Jobs succeed completely or fail completely. No partial writes.
  • Consistency: Schema enforcement ensures data integrity.
  • Isolation: Concurrent readers and writers don't interfere.
  • Durability: Committed changes persist through failures.

For Medallion, this means:

  • Bronze ingestion can fail mid-batch without corruption
  • Silver MERGE operations are atomic
  • Gold tables can be refreshed while users query

Time Travel for Debugging and Auditing

Delta Lake's time travel is a superpower for Medallion implementations:

Debugging: When a Gold metric looks wrong, query Silver as of yesterday:

SELECT * FROM silver.transactions VERSION AS OF 42
-- or
SELECT * FROM silver.transactions TIMESTAMP AS OF '2026-01-26'

Auditing: Demonstrate what data looked like at a specific point for compliance requirements.

Reprocessing: When business rules change, reprocess Gold from a historical Silver version.

Recovery: Accidentally deleted data? Restore from a previous version.

We retain 30 days of history for Silver tables (configurable via Delta Lake retention settings) — enough for most debugging and audit needs.

MERGE for CDC Patterns

Delta Lake MERGE is the engine behind our Silver incremental processing:

MERGE INTO silver.customers AS target
USING bronze.customers_changes AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.operation = 'DELETE' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Key MERGE optimizations:

  • Partition filters: Always include partition columns in MERGE conditions
  • Predicate pushdown: Delta Lake pushes filters to file pruning
  • Optimized writes: Use optimizeWrite for better file sizes

Unity Catalog for Governance

Unity Catalog brings enterprise governance to the lakehouse:

Data Lineage: Automatic tracking of how data flows through Bronze → Silver → Gold.

Access Control: Fine-grained permissions at catalog, schema, and table levels.

Data Discovery: Searchable catalog of all lakehouse assets.

Audit Logging: Who accessed what, when, from where.

We use Unity Catalog to enforce that:

  • Bronze tables are accessible only to data engineering
  • Silver tables are accessible to data engineering and data science
  • Gold tables are accessible to business analysts and BI tools
🔑

Delta Lake transforms Medallion from a design pattern into a reliable, transactional, auditable data platform. ACID transactions, time travel, MERGE operations, and Unity Catalog governance are the features that make Medallion work at enterprise scale.


Medallion vs Traditional ETL

I've worked with both traditional ETL (SSIS, Informatica) and modern lakehouse patterns. Here's an honest comparison.

DimensionTraditional ETLMedallion Lakehouse
Data StorageData warehouse (relational)Data lake (object storage + Delta)
ProcessingExtract-Transform-LoadExtract-Load-Transform (ELT)
SchemaSchema-on-write, strictSchema-on-read (Bronze), schema-on-write (Silver/Gold)
ScalabilityVertical (bigger warehouse)Horizontal (more compute nodes)
Cost ModelStorage + compute bundledStorage + compute separate
FlexibilityStructured data focusAny data format (JSON, Parquet, etc.)
ReprocessingComplex, often requires re-extractionReprocess from Bronze without re-ingesting

When to Choose Medallion

  • High data volume: 5TB+ datasets benefit from distributed processing
  • Multiple sources: Diverse data formats and schemas
  • Evolving requirements: Business logic changes frequently
  • ML/AI workloads: Need raw data access for feature engineering
  • Cost sensitivity: Separate storage and compute for optimization

When to Keep Traditional ETL

  • Small data volumes: < 100GB where complexity isn't justified
  • Simple requirements: Single source, well-defined transformations
  • Existing investment: Mature ETL platform with institutional knowledge
  • Regulatory constraints: Industry-specific tools required for compliance

Migration from SSIS: Our Experience

At GAP, we migrated hundreds of SSIS packages to Databricks. Here's what we learned:

Don't Lift and Shift: SSIS packages encode assumptions (sequential processing, single-node) that don't apply. Redesign for distributed processing.

Map to Medallion Layers: Most SSIS packages combined ingestion and transformation. Separate into Bronze (ingestion) and Silver (transformation).

Embrace Incremental: SSIS often did full refreshes because change tracking was complex. Delta Lake MERGE makes incremental processing straightforward.

Document Business Logic: Many SSIS packages contained undocumented business rules in Script Tasks. Extract and document before migrating.

Parallel Migration: Don't try to migrate everything at once. Run old and new pipelines in parallel, validate output, switch over gradually.

Result: 25% compute cost reduction, faster development cycles, and happier data engineers.

🔑

Medallion isn't universally better than traditional ETL — it's better for high-volume, complex, evolving data environments. Migration is a redesign, not a port. Take the opportunity to clean up accumulated technical debt.


Common Mistakes in Medallion Implementation

After building Medallion architectures across multiple organizations, here are the mistakes I see most often.

Medallion Architecture Mistakes

  • Over-engineering Bronze layer with transformations that belong in Silver
  • Skipping Silver and going directly from Bronze to Gold
  • Creating too many Gold tables without clear ownership
  • Ignoring data quality until it's a crisis
  • Not planning for reprocessing from the start
  • Using full refreshes when incremental is possible
  • Neglecting monitoring and alerting
  • Treating schemas as static when sources evolve

Mistake Deep Dive: Over-Engineering Bronze

I've seen teams add validation, cleaning, and even business logic to their Bronze layer. The reasoning: "We should clean data as soon as possible."

Why it fails: When source systems change — and they always do — your Bronze layer breaks. Now you can't even ingest data, let alone transform it.

The fix: Bronze should be dumb. Capture data as-is, add metadata, write fast. All transformation logic belongs in Silver where it can be updated independently of ingestion.

Mistake Deep Dive: Skipping Silver

Some teams argue: "We have Bronze (raw) and Gold (business-ready). Why do we need Silver in between?"

Why it fails: Without Silver, Gold tables contain both cleaning logic AND business logic. When either changes, everything changes. You also lose the ability to have multiple Gold tables from the same cleansed source.

The fix: Silver is the cleansed, validated, entity-level truth. Gold is business aggregations. They serve different purposes and change for different reasons.

Mistake Deep Dive: Schema Complacency

Many teams design schemas once and treat them as static. Then a source system adds a field, removes a field, or changes a type.

Why it fails: Rigid schemas break on source changes. Overly flexible schemas (everything as STRING) create quality nightmares.

The fix:

  • Bronze: Permissive schemas that tolerate changes
  • Silver: Schema enforcement with explicit handling for nulls, type mismatches, new columns
  • Gold: Stable schemas with versioning for breaking changes
  • Monitor for schema drift and alert before it becomes a crisis
The Reprocessing Trap

If you can't reprocess Silver from Bronze without re-ingesting from sources, you've missed a key benefit of Medallion. Always retain enough Bronze history to rebuild Silver when business rules change.

🔑

Most Medallion failures come from violating layer responsibilities: Bronze tries to do Silver's job, Silver gets skipped, Gold becomes a dumping ground. Respect the layers, and the architecture works.


Key Takeaways: Medallion Architecture

  1. 1Medallion Architecture organizes data into Bronze (raw), Silver (cleansed), and Gold (business-ready) layers with progressive quality improvement
  2. 2Bronze is about capture — preserve source fidelity, add metadata, optimize for writes
  3. 3Silver is about quality — cleanse, validate, standardize, use CDC patterns for incremental processing
  4. 4Gold is about value — business-centric modeling, consumption-optimized, stable interfaces
  5. 5Delta Lake provides the transactional foundation — ACID transactions, time travel, MERGE operations
  6. 6Unity Catalog adds governance — lineage, access control, data discovery
  7. 7The pattern shines for high-volume, multi-source, complex enterprise data — not every use case needs it
  8. 8Respect layer responsibilities — most failures come from Bronze doing Silver's job or Silver getting skipped

Frequently Asked Questions

What is Medallion Architecture?

Medallion Architecture is a data design pattern that organizes a lakehouse into three layers: Bronze (raw ingestion), Silver (cleansed and validated), and Gold (aggregated and business-ready). Each layer progressively improves data quality while maintaining lineage and enabling reprocessing. It's the de facto standard for Azure Databricks and Delta Lake implementations.

Why is it called Bronze, Silver, Gold?

The names reflect data quality progression, like refining precious metals. Bronze is raw ore (data as-is from sources). Silver is refined metal (cleansed, deduplicated, standardized). Gold is the finished product (business-ready aggregations). This naming makes the purpose of each layer clear to both technical and business stakeholders.

Is Medallion Architecture only for Databricks?

No, though it's most commonly associated with Azure Databricks and Delta Lake. The pattern can be implemented with any lakehouse technology: AWS Lake Formation, Google BigQuery, Snowflake, or open-source tools like Apache Spark with Iceberg or Hudi. The concepts of progressive data quality layers apply regardless of tooling.

How does Medallion Architecture differ from Lambda Architecture?

Lambda Architecture separates batch and real-time processing into different paths, then merges results. Medallion Architecture focuses on data quality layers, not processing timing. You can implement real-time Medallion (streaming Bronze → Silver → Gold) or batch Medallion. Many teams combine concepts, using streaming for Bronze ingestion and batch for Gold aggregations.

What goes in Bronze vs Silver vs Gold?

Bronze: Raw data as-is from sources, plus ingestion metadata (timestamp, source, batch ID). No transformations. Silver: Cleansed, validated, deduplicated data at entity level. Schema enforcement, null handling, type conversions. Gold: Business aggregations optimized for consumption. Dimensional models, feature tables, pre-computed metrics.

How do I handle schema changes in Medallion?

Bronze uses permissive schemas (schema-on-read) to absorb source changes. Silver handles evolution explicitly — new columns added with mergeSchema, removed columns deprecated with nulls before removal. Gold maintains stable interfaces with versioning for breaking changes. Monitor for schema drift and update Silver logic when sources change.

When should I NOT use Medallion Architecture?

Skip Medallion for simple, single-source pipelines where the overhead isn't justified. Also consider alternatives for sub-second latency requirements (streaming-first architectures) or when data is already clean and well-structured from sources (two-layer raw + curated might suffice). Medallion adds complexity that's only justified for enterprise-scale, multi-source data.

Sources & References

  1. Medallion ArchitectureDatabricks (2026)
  2. What is a Lakehouse?Databricks (2020)
  3. Delta Lake DocumentationDelta Lake (2026)
  4. Azure Databricks Best PracticesMicrosoft (2026)
  5. Unity Catalog DocumentationDatabricks (2026)
  6. Azure Data Factory DocumentationMicrosoft (2026)
  7. Spark SQL Performance TuningApache Spark (2026)
  8. The Data Warehouse ToolkitRalph Kimball (2013)