From Medicaid Claims Code to Enterprise Data Warehouse: How Vikram Reddy Built Healthcare Data Management from Both Sides

Share to save for later

Feb 27, 2026

I was three years into writing C# code for Florida's Medicaid claims system when I realized something that changed the direction of my career.

Every reporting team pulling data from my application was producing different numbers for the same patients. The claims team had one count. The compliance team had another. The finance team had a third. All pulling from systems I had helped build. None of them wrong, exactly. But none of them agreeing.

The problem wasn't in the application. The problem was that patient data lived in five different systems — SQL Server databases, third-party APIs, a CRM platform, an ERP system, and flat files from state agencies — and nobody had built the layer that makes them talk to each other.

Three years later, I was the one building that layer. Same system. Different company. Completely different side of the stack.

I spent seven years inside FLMMIS — the Florida Medicaid Management Information System — serving 5 million beneficiaries. First as a .NET developer at Hewlett Packard Enterprise, writing the code that processes claims, eligibility, and provider enrollment. Then as a data analyst at DXC Technology, designing the enterprise data warehouse and ETL pipelines that turn that raw data into compliance reports and executive dashboards.

Most data professionals see one side. I built both.

This is what healthcare data management actually looks like from the inside — the integration nightmares, the architectural decisions, and the mistakes I've watched break projects across healthcare and insurance.

Vikram Reddy
Expert Insight by

Vikram Reddy

Data Analyst / Power BI Developer at AIG Insurance

Healthcare & Insurance Data AnalyticsLinkedIn

Vikram Reddy has 8+ years of experience designing enterprise data systems across healthcare and insurance. He spent 7 years inside the Florida Medicaid Management Information System (FLMMIS) — first as a .NET developer at Hewlett Packard Enterprise building claims processing applications, then as a data analyst at DXC Technology designing the star schema data warehouse and ETL pipelines that power Medicaid reporting for 5M+ beneficiaries. He now leads Power BI development and data integration at AIG Insurance, where he unified SQL Server, SAP BW, Oracle, and Azure SQL into centralized enterprise analytics and improved reporting performance by 40% through data model restructuring.

Verified Expert
Quick Answers (TL;DR)

What is healthcare data management?

Healthcare data management is the practice of integrating, structuring, and governing clinical and administrative data from multiple source systems — EHRs, claims systems, CRMs, ERPs, and flat files — into a unified data warehouse that supports compliance reporting, analytics, and executive decision-making. It requires dimensional modeling (typically star schema), ETL pipeline orchestration, data quality enforcement, and cross-system identity resolution for entities like providers, patients, and facilities.

Why is healthcare data integration so difficult?

Healthcare data integration is uniquely hard because source systems use different schemas, different update frequencies, and different definitions for the same entities. "Provider ID" may mean an NPI number in one system, an internal code in another, and a facility identifier in a third. Conformed dimensions that reconcile these identities across systems are the single most important — and most neglected — component of healthcare data warehouses.

What is the best data warehouse design for healthcare?

Star schema is the most effective design for healthcare data warehouses that serve compliance reporting and analytics dashboards. It delivers faster query performance than third normal form (3NF) because fact tables pre-join to dimension tables, eliminating complex runtime joins. Snowflake schema works better for deeply hierarchical provider or facility data that requires drill-down analysis. Most enterprise healthcare warehouses use star schema for claims and encounters with snowflake extensions for provider hierarchies.

What tools are used for healthcare data management?

Enterprise healthcare data management typically uses a combination of ETL tools (SSIS, Informatica PowerCenter, IBM DataStage, Azure Databricks), databases (SQL Server, Teradata, Oracle, Azure SQL), and BI platforms (Power BI, SSRS, Tableau). The tool matters less than the data model underneath — a well-designed star schema with conformed dimensions will outperform a poorly modeled warehouse on any platform.

The FLMMIS Arc: Same System, Both Sides

Share to save for later

FLMMIS — the Florida Medicaid Management Information System — is one of the largest state-level healthcare data systems in America. It handles eligibility determination, claims adjudication, provider enrollment, and reporting for over 5 million Medicaid beneficiaries. Every doctor visit, every prescription fill, every hospital admission for a Medicaid patient in Florida flows through this system.

I spent seven years inside it. Not as a consultant passing through. Not as a vendor running demos. As someone who wrote the code and then designed the data architecture.

Chapter 1: Building the Machine (HPE, 2016–2019)

At Hewlett Packard Enterprise, I was a .NET developer on the FLMMIS platform. ASP.NET MVC applications. C# business logic. RESTful Web APIs handling claims processing and provider enrollment. Entity Framework talking to SQL Server on the backend. Stored procedures, triggers, performance-tuned views — the full stack that makes a Medicaid claims system run.

What I didn't fully understand at the time was where the data went after my application created it.

I'd write code that validated a claim, applied business rules, and committed the result to SQL Server. Job done. But downstream, five different teams were pulling that data into five different reporting systems, applying five different transformation rules, and producing five different versions of the truth.

The data wasn't wrong. It was fragmented. And fragmentation in healthcare data doesn't just mean inconsistent dashboards — it means compliance reports that regulators question.

Chapter 2: Building the Brain (DXC, 2020–2023)

When I joined DXC Technology in 2020, I switched from building the application to building the analytics layer for the same FLMMIS ecosystem. Now I was on the receiving end of the data my previous code had generated.

The difference was immediate. I could look at a claims table and understand not just what the columns meant, but how the application populated them, which business rules affected which fields, and where the edge cases lived. Most data analysts inherit a schema and read the documentation. I had written the business logic that created the schema.

Most data warehouse architects study the schema. I wrote the code that creates the schema. When I designed fact tables for Medicaid claims, I knew which fields were populated by which business rules, which ones had edge cases the documentation didn't mention, and which ones would break if a specific provider type submitted a claim in a non-standard format.

Vikram Reddy, Data Analyst, DXC Technology (FLMMIS)

That dual perspective shaped how I designed the data warehouse. I built star schema dimensional models with conformed dimensions and scalable fact tables, using Informatica PowerCenter and SSIS for ETL orchestration against a Teradata backend. The warehouse had to ingest from SQL Server, third-party APIs, CRM and ERP systems, and state-agency flat files.

The result was a unified data layer that powered SSRS dashboards for Medicaid analytics, compliance reporting, and stakeholder visibility. One version of the truth, not five.

Key Takeaway

Seeing both sides of a healthcare data system — the application that creates the data and the warehouse that analyzes it — is the single biggest advantage a data professional can have. Most integration problems exist because the data architect doesn't understand how the application populates the fields. Most application developers don't know how their schema will be consumed downstream.

What Is Healthcare Data Management?

Share to save for later

Job descriptions make healthcare data management sound clean: "integrate clinical and administrative data sources to support reporting and analytics." Twelve words. Twelve words that skip over the hardest parts.

Here's what it actually means when you're staring at five source systems in a Medicaid context:

30%
Of U.S. healthcare spending attributed to administrative complexity
JAMA (Shrank et al., 2019)
5M+
Florida Medicaid beneficiaries served by FLMMIS
Florida Agency for Health Care Administration
97%
Of hospitals using certified EHR systems
ONC Health IT Dashboard

Healthcare data management means taking data that was designed to support individual applications — a claims processing system, a provider enrollment portal, an eligibility determination engine — and making it behave as if it were designed for analytics. It wasn't. It was designed for transactions.

Transaction systems and analytics systems want completely different things. A claims processing application wants normalized tables, referential integrity, and fast single-record lookups. An analytics dashboard wants denormalized star schemas, pre-aggregated measures, and fast scans across millions of rows. You cannot serve both from the same schema.

Healthcare Data Management

Healthcare data management is the practice of integrating clinical, administrative, and financial data from multiple transaction-oriented source systems (EHRs, claims engines, CRMs, ERPs, and external feeds) into a unified analytical data warehouse. It requires dimensional modeling for query performance, ETL pipeline orchestration for data movement, conformed dimensions for cross-system identity resolution, and data quality enforcement to ensure compliance reporting accuracy.

The gap between what job descriptions say and what the work looks like is enormous. Nobody tells you that the "integration" part means spending weeks reconciling how "Provider" is defined across five systems. Nobody mentions that the "reporting" part means building SSRS dashboards that regulators will scrutinize line by line. Nobody warns you that the "data quality" part means building validation gates into your ETL pipeline because one bad date format from a flat file can cascade into a compliance violation.

Key Takeaway

Healthcare data management is not data storage. It is the translation layer between transaction systems that were designed for individual operations and analytics systems that need a unified, query-optimized view of millions of records. The hardest part is never the tooling. It is reconciling the differences between source systems that were never designed to agree with each other.

Integration Hell: When "Provider ID" Means Five Different Things

Share to save for later

The Medicaid data warehouse at DXC had to ingest from five source categories: SQL Server databases running the core claims application, third-party APIs from state agencies, a CRM system for provider relationships, an ERP system for financial data, and flat files from external entities.

Each source had its own schema. Its own update frequency. Its own definition of every entity.

The worst offender was "Provider."

The Provider ID Problem in Healthcare Data
  • System A uses NPI (National Provider Identifier) — a 10-digit CMS-issued number
  • System B uses an internal provider code assigned during enrollment
  • System C stores the facility TIN (Tax Identification Number) as the provider identifier
  • System D maps providers by DEA number for pharmacy claims
  • Flat files from the state use a legacy code that predates the NPI system entirely

Five systems, five definitions of "Provider," and every one of them was correct within its own context. The problem was that a compliance report might need to show "claims by provider" — and the answer depends entirely on which definition of "provider" you use.

This is why conformed dimensions exist. A conformed dimension is a single, authoritative reference table that reconciles all the different identifiers into one canonical record. Every provider in the warehouse maps to one row in the conformed dim_provider table, regardless of which source system the data came from.

Building that conformed dimension took weeks. Mapping NPI to internal codes to TINs to DEA numbers to legacy state codes required cross-referencing five lookup tables, handling many-to-many relationships (one NPI can map to multiple TINs), and building exception-handling logic for records that didn't match anything.

The Day the Date Format Broke Everything

ETL pipelines are fragile. Not because the tools are bad — SSIS and Informatica PowerCenter are robust platforms. They're fragile because they depend on source systems behaving consistently. And source systems don't always cooperate.

One of the state-agency flat files we ingested daily changed its date format from MM/DD/YYYY to YYYY-MM-DD. No warning. No documentation update. The Informatica workflow that parsed that file ran on schedule, processed the records, and loaded dates into the staging table as garbage.

We caught it the next morning when the compliance dashboard showed zero claims for the previous day. Zero. The ETL pipeline didn't fail — it successfully loaded corrupted data. That's worse than a failure, because a failure triggers an alert. Silent corruption triggers nothing until someone looks at the output.

After that, I built validation gates into every ETL pipeline. Before any data moves from staging to the warehouse, automated checks verify format consistency, null rates, referential integrity, and value range constraints. If any check fails, the pipeline stops and sends an alert. The data stays in staging until a human investigates.

The Silent Corruption Rule

ETL failures that crash are manageable — they trigger alerts, they're visible, they get fixed. ETL pipelines that succeed while loading corrupted data are dangerous. Build validation gates between staging and production that catch format changes, null spikes, and referential integrity breaks before bad data reaches your warehouse.

Key Takeaway

Healthcare data integration is hard because the real challenge is not moving data between systems — it is reconciling what the data means across systems. "Provider ID" means five different things in five different systems. Conformed dimensions are the unglamorous, essential solution. Build them early, validate them constantly, and never trust that source systems will behave consistently.

The 40% Fix: Why Healthcare Data Problems Are Never Hardware Problems

Share to save for later

When I joined AIG Insurance in 2024, one of the first problems I inherited was reporting performance. Executives were waiting minutes for Excel reports to refresh from Power BI datasets. The IT team had already scoped a server upgrade. New hardware, more RAM, faster storage. The estimate was six figures.

I asked to look at the Power BI data model first.

What I found was a model with dozens of table relationships, many of them unnecessary. Tables that should have been star-schema dimensions were connected in circular patterns. Measures that should have been calculated at the fact-table level were computed at query time across multiple joins. The model was doing ten times more work than it needed to for every single report refresh.

The fix wasn't hardware. It was architecture.

Every time someone says 'we need better hardware,' check the data model first. In eight years across healthcare and insurance, I have never seen a reporting performance problem that was actually a hardware problem. It was always a data model problem.

Vikram Reddy, Data Analyst / Power BI Developer, AIG Insurance

I restructured the backend Power BI data model. Removed circular relationships. Collapsed redundant lookup tables into proper star-schema dimensions. Moved expensive calculations from query-time DAX to pre-computed columns where appropriate. Re-indexed the underlying SQL Server views.

Reports ran 40% faster. No server upgrade. No new hardware. The same data, the same reports, the same Power BI Service deployment — just a model that stopped fighting itself.

At AIG, I also integrated SQL Server, SAP BW, Oracle, and Azure SQL into one centralized enterprise BI platform. Four different database engines, four different query languages, four different data models — all feeding into Power BI dashboards that executives treat as a single source of truth. The integration pattern was the same one I'd learned at DXC: conformed dimensions, star schema design, ETL pipelines that validate before they load.

Key Takeaway

When reports are slow, check the data model before approving a hardware upgrade. Circular relationships, missing indexes, and query-time calculations that should be pre-computed are responsible for the majority of BI performance problems. Architecture fixes are free. Server upgrades are not.

Star Schema vs Snowflake: The Decision That Changes Everything

Share to save for later

When I designed the Medicaid data warehouse at DXC, the first architectural decision was schema design. Star schema or snowflake. It's the kind of decision that sounds academic until you're sitting in front of a Teradata instance that needs to process millions of Medicaid claims records and produce compliance reports before the morning shift starts.

Star Schema in Healthcare Data Warehouses

A star schema organizes data into a central fact table (e.g., claims, encounters, prescriptions) surrounded by denormalized dimension tables (e.g., provider, patient, facility, date). By pre-joining dimension attributes into flat tables, star schema eliminates complex runtime joins and delivers the fast scan performance that healthcare analytics dashboards and compliance reports require.

I chose star schema for Medicaid. The reasoning was specific, not theoretical:

Compliance reports run on tight deadlines. Medicaid reporting to CMS has fixed submission windows. The queries behind those reports need to run fast and predictably. Star schema's denormalized dimensions mean fewer joins, which means more predictable query times.
Healthcare analysts write simple queries. The people consuming the data warehouse are business analysts and compliance officers, not database engineers. Star schema is intuitive: pick a fact table, filter by dimensions, aggregate. Snowflake requires understanding multiple levels of normalization, which increases the error rate in ad-hoc queries.
Teradata optimizes for star schema. Teradata's query optimizer is specifically designed to detect and optimize star join patterns. Building a snowflake schema on Teradata means giving up the platform's core strength.
FactorStar SchemaSnowflake Schema
Query performanceFaster — fewer joins, denormalized dimensionsSlower — normalized dimensions require multi-level joins
Best for healthcareClaims analytics, compliance reports, executive dashboardsComplex provider hierarchies, facility drill-downs, organizational rollups
Ease of useIntuitive for business analysts and report writersRequires deeper SQL knowledge to navigate normalized tables
Storage efficiencyHigher redundancy — dimension data is repeatedLower redundancy — dimension data is normalized
ETL complexitySimpler loads — dimensions are flatMore complex — must maintain referential integrity across normalized tables
Real exampleFLMMIS Medicaid claims warehouse (Teradata)AIG provider hierarchy for multi-level organizational reporting

The exception at AIG was provider hierarchies. Insurance providers have complex organizational structures: individual practitioners belong to group practices, which belong to networks, which belong to regional organizations. That kind of deeply hierarchical data is where snowflake schema actually earns its complexity — each level of the hierarchy is a separate normalized table, allowing drill-down queries without storing every combination.

In practice, most healthcare data warehouses use both. Star schema for the core fact tables (claims, encounters, prescriptions). Snowflake extensions for hierarchical dimensions (provider, facility, organizational structure). The mistake is treating it as an either/or decision.

Key Takeaway

Star schema is the right default for healthcare data warehouses. It delivers faster queries, simpler SQL, and better optimizer performance on platforms like Teradata. Use snowflake extensions only for deeply hierarchical dimensions like provider organizations or facility structures. The worst decision is no dimensional modeling at all — which is what most healthcare organizations are still running.

The ETL Stack Nobody Glamorizes

Share to save for later

Nobody writes LinkedIn posts about SSIS packages. Nobody gives conference talks about Teradata FastLoad vs MultiLoad trade-offs. Nobody gets excited about Informatica PowerCenter mappings.

But these are the tools that keep healthcare data moving. Every Medicaid compliance report, every executive dashboard, every claims analysis depends on ETL pipelines that run overnight and load data before the first analyst logs in.

FastLoad vs MultiLoad: The Monday Morning Problem

On the Medicaid data warehouse, we used Teradata as the backend. Loading millions of records nightly required choosing between two utilities: FastLoad and MultiLoad. Both move data from source to target. They are not interchangeable.

FastLoad is faster. It bypasses the transactional log, streams data directly to the Access Module Processor, and can load millions of rows in minutes. The trade-off: it only works on empty tables. No updates. No upserts. No existing rows.

MultiLoad handles updates and mixed operations. It can insert, update, delete, and upsert. The trade-off: it's slower, and it has overhead for error handling and transaction management.

Pick the wrong one and the consequences are concrete. Use FastLoad on a table that has existing records, and the load fails. Use MultiLoad on a large initial load, and the batch window runs long. If your ETL batch window ends at 6 AM and the load finishes at 7 AM, your Monday morning compliance reports run on Saturday's data.

The Teradata Loading Rule

Use FastLoad for initial loads and full-refresh dimensions where the target table is truncated first. Use MultiLoad for incremental fact table loads and slowly changing dimensions that require upserts. Never assume one utility works for both patterns.

SQL Profiler: Finding the Query That Ate 60% of ETL Time

At DXC, the Medicaid ETL pipeline was running close to its batch window limit. Not failing — just taking longer each month as data volume grew. I used SQL Profiler to trace every query in the pipeline and found a single stored procedure that consumed over 60% of total processing time. The procedure was doing a full table scan on a fact table with no index on the join column. Adding the right index dropped that procedure's runtime from 45 minutes to under 3. The entire pipeline gained back an hour of headroom.

Performance tuning is not glamorous. But that one index change is the reason the compliance team had fresh data on Monday mornings for the next two years.

The Shift to Azure Databricks

At AIG, the ETL paradigm is evolving. Azure Databricks processes high-volume datasets using distributed Spark clusters. IBM DataStage handles large-scale healthcare dataset transformations. The traditional SSIS packages still run, optimized with checkpoints and event handling for reliability.

The cloud isn't replacing on-premise ETL everywhere. It's supplementing it. Many healthcare organizations run hybrid stacks: SSIS for internal SQL Server pipelines, Informatica for cross-platform integration, and Databricks for the workloads that exceed what a single server can handle.

Key Takeaway

The ETL stack that powers healthcare data is not glamorous, but it is the difference between fresh data and stale data. FastLoad vs MultiLoad, index optimization, and batch window management are the skills that keep compliance reports running on time. Cloud tools like Azure Databricks are extending — not replacing — the traditional ETL stack.

5 Healthcare Data Management Mistakes That Break Projects

Share to save for later

I've seen these across two industries and three employers. Different organizations, different tools, same patterns. Every one of these mistakes turns a solvable architecture problem into a six-figure remediation project.

Mistake 1: Treating the Warehouse as a Data Dump

The fastest way to build a useless data warehouse is to load every table from every source system without dimensional modeling. You end up with a staging area that someone relabeled "data warehouse." The data is all there. Nobody can query it efficiently because there's no star schema, no conformed dimensions, and no fact-to-dimension relationships.

A data warehouse is not a backup of your source systems. It is a dimensional model designed to answer specific business questions fast.

Mistake 2: Building Dashboards Before Fixing the Model

This is the "pretty charts, wrong numbers" trap. A team spends months building Power BI or Tableau dashboards on top of a poorly modeled warehouse. The dashboards look great. The numbers don't reconcile. When someone finally checks, the problem isn't the visualization — it's the data model underneath.

At AIG, the dashboards were fine. The data model behind them had circular relationships that made reports slow and some calculations incorrect. Fixing the model fixed everything downstream.

Mistake 3: Skipping Data Quality Gates in ETL

When the state agency flat file changed its date format, our ETL pipeline didn't fail. It loaded corrupted dates into the warehouse. No alert. No error. Just silent corruption that showed up 12 hours later when a compliance dashboard showed zero claims for an entire day.

After that, every pipeline I build has validation gates between staging and production. Format checks, null rate thresholds, referential integrity verification, value range constraints. If any check fails, the pipeline stops. Bad data stays in staging until a human investigates.

Mistake 4: Ignoring Conformed Dimensions

Five systems. Five definitions of "Provider ID." If you don't reconcile them into one conformed dimension, you get five different answers to the question "how many providers submitted claims last month?" This isn't a theoretical risk. It's what happens in most healthcare organizations that skip the dimensional modeling step.

Conformed dimensions are boring. They take weeks to build. They require mapping tables, cross-referencing multiple identifier systems, and handling edge cases. But they are the only thing that makes cross-system reporting trustworthy.

Mistake 5: Betting on the Tool Instead of the Architecture

"We're evaluating Power BI vs Tableau." I've heard this in every organization I've worked for. It's the wrong conversation. Power BI and Tableau are both excellent tools. Neither one will save a data warehouse that's a pile of denormalized tables with no fact-dimension relationships.

The Tool vs Architecture Trap
  • Power BI, Tableau, and QlikView all perform well on top of a properly designed star schema — the tool choice is secondary
  • A $500K Tableau deployment on top of a poorly modeled warehouse will produce the same wrong numbers as a free Excel pivot table
  • The fix for slow reports is almost never a better BI tool — it is a better data model
  • Architecture decisions (star schema, conformed dimensions, ETL validation) outlast any tool selection by years
Key Takeaway

The five mistakes that break healthcare data projects are all architecture failures, not tool failures. Dimensional modeling, conformed dimensions, data quality gates, and proper ETL design are the foundations. Choose your BI tool after the architecture is right — not before.

Key Takeaways: Healthcare Data Management
  1. 01Healthcare data management is the translation layer between transaction systems designed for individual operations and analytics systems that need a unified view of millions of records — the hardest part is reconciling source system differences, not choosing tools
  2. 02Star schema is the right default for healthcare data warehouses: faster queries, simpler SQL, and better optimizer performance on platforms like Teradata — use snowflake extensions only for deeply hierarchical dimensions
  3. 03Conformed dimensions that reconcile different entity identifiers (Provider ID, Patient ID, Facility ID) across source systems are the single most important and most neglected component of healthcare data integration
  4. 04ETL pipeline reliability depends on validation gates between staging and production — silent data corruption (successful loads of bad data) is more dangerous than pipeline failures that trigger alerts
  5. 05Reporting performance problems are almost never hardware problems — check the data model for circular relationships, missing indexes, and query-time calculations before approving infrastructure upgrades
  6. 06The five mistakes that break healthcare data projects are all architecture failures: treating the warehouse as a data dump, building dashboards before fixing the model, skipping data quality gates, ignoring conformed dimensions, and betting on tools instead of architecture
FAQ

What is healthcare data management?

Healthcare data management is the practice of integrating clinical, administrative, and financial data from multiple source systems — EHRs, claims engines, CRMs, ERPs, and external feeds — into a unified analytical data warehouse. It requires dimensional modeling (typically star schema), ETL pipeline orchestration, conformed dimensions for cross-system identity resolution, and data quality enforcement to ensure compliance reporting accuracy.

Why is healthcare data integration so difficult?

Healthcare data integration is difficult because source systems use different schemas, different update frequencies, and different definitions for the same entities. A "Provider ID" may be an NPI number in one system, an internal enrollment code in another, and a facility TIN in a third. Reconciling these identities into conformed dimensions requires cross-referencing multiple lookup tables, handling many-to-many relationships, and building exception logic for unmatched records.

What is a star schema in a healthcare data warehouse?

A star schema organizes healthcare data into a central fact table (such as claims, encounters, or prescriptions) surrounded by denormalized dimension tables (provider, patient, facility, date). By pre-joining dimension attributes into flat tables, star schema eliminates complex runtime joins and delivers the fast query performance that healthcare compliance reports and analytics dashboards require.

What tools are used for healthcare data management?

Enterprise healthcare data management typically uses ETL tools (SSIS, Informatica PowerCenter, IBM DataStage, Azure Databricks) for data movement, databases (SQL Server, Teradata, Oracle, Azure SQL) for storage, and BI platforms (Power BI, SSRS, Tableau) for reporting. The tool matters less than the data model — a well-designed star schema with conformed dimensions will outperform a poorly modeled warehouse on any platform.

What is the difference between star schema and snowflake schema for healthcare?

Star schema uses denormalized dimension tables for faster query performance and simpler SQL — best for claims analytics, compliance reports, and executive dashboards. Snowflake schema normalizes dimension tables into multiple related tables, reducing storage redundancy but requiring more complex joins — best for deeply hierarchical data like provider organizations or facility structures. Most healthcare warehouses use star schema as the default with snowflake extensions for hierarchical dimensions.

How do you ensure data quality in healthcare data warehouses?

Data quality in healthcare warehouses requires validation gates between the ETL staging area and the production warehouse. These gates check format consistency (date formats, code values), null rate thresholds, referential integrity against conformed dimensions, and value range constraints. If any check fails, the pipeline stops and the data stays in staging until investigated. Silent data corruption — where the pipeline succeeds but loads bad data — is more dangerous than visible pipeline failures.

What is a conformed dimension in healthcare data?

A conformed dimension is a single, authoritative reference table that reconciles all different identifiers for an entity (such as a provider, patient, or facility) across multiple source systems into one canonical record. In healthcare, a conformed provider dimension maps NPI numbers, internal codes, TINs, DEA numbers, and legacy identifiers to one row per provider, enabling consistent cross-system reporting.

Sources
  1. 01Waste Not, Want Not: The Facts About Healthcare SpendingShrank, W.H., Rogstad, T.L., Parekh, N., JAMA (2019)
  2. 02Non-federal Acute Care Hospital Electronic Health Record AdoptionOffice of the National Coordinator for Health Information Technology (ONC/ASTP) (2021)
  3. 03The Data Warehouse Toolkit: The Definitive Guide to Dimensional ModelingKimball, R., Ross, M., Wiley Publishing (2013)
  4. 04Florida State Medicaid ProfileCenters for Medicare & Medicaid Services (CMS) (2025)
  5. 05Run Large Bulkloads Efficiently with FastloadTeradata Developers Portal (2024)