I was three years into writing C# code for Florida's Medicaid claims system when I realized something that changed the direction of my career.
Every reporting team pulling data from my application was producing different numbers for the same patients. The claims team had one count. The compliance team had another. The finance team had a third. All pulling from systems I had helped build. None of them wrong, exactly. But none of them agreeing.
The problem wasn't in the application. The problem was that patient data lived in five different systems — SQL Server databases, third-party APIs, a CRM platform, an ERP system, and flat files from state agencies — and nobody had built the layer that makes them talk to each other.
Three years later, I was the one building that layer. Same system. Different company. Completely different side of the stack.
I spent seven years inside FLMMIS — the Florida Medicaid Management Information System — serving 5 million beneficiaries. First as a .NET developer at Hewlett Packard Enterprise, writing the code that processes claims, eligibility, and provider enrollment. Then as a data analyst at DXC Technology, designing the enterprise data warehouse and ETL pipelines that turn that raw data into compliance reports and executive dashboards.
Most data professionals see one side. I built both.
This is what healthcare data management actually looks like from the inside — the integration nightmares, the architectural decisions, and the mistakes I've watched break projects across healthcare and insurance.

Vikram Reddy
Data Analyst / Power BI Developer at AIG Insurance
Vikram Reddy has 8+ years of experience designing enterprise data systems across healthcare and insurance. He spent 7 years inside the Florida Medicaid Management Information System (FLMMIS) — first as a .NET developer at Hewlett Packard Enterprise building claims processing applications, then as a data analyst at DXC Technology designing the star schema data warehouse and ETL pipelines that power Medicaid reporting for 5M+ beneficiaries. He now leads Power BI development and data integration at AIG Insurance, where he unified SQL Server, SAP BW, Oracle, and Azure SQL into centralized enterprise analytics and improved reporting performance by 40% through data model restructuring.
What is healthcare data management?
Healthcare data management is the practice of integrating, structuring, and governing clinical and administrative data from multiple source systems — EHRs, claims systems, CRMs, ERPs, and flat files — into a unified data warehouse that supports compliance reporting, analytics, and executive decision-making. It requires dimensional modeling (typically star schema), ETL pipeline orchestration, data quality enforcement, and cross-system identity resolution for entities like providers, patients, and facilities.
Why is healthcare data integration so difficult?
Healthcare data integration is uniquely hard because source systems use different schemas, different update frequencies, and different definitions for the same entities. "Provider ID" may mean an NPI number in one system, an internal code in another, and a facility identifier in a third. Conformed dimensions that reconcile these identities across systems are the single most important — and most neglected — component of healthcare data warehouses.
What is the best data warehouse design for healthcare?
Star schema is the most effective design for healthcare data warehouses that serve compliance reporting and analytics dashboards. It delivers faster query performance than third normal form (3NF) because fact tables pre-join to dimension tables, eliminating complex runtime joins. Snowflake schema works better for deeply hierarchical provider or facility data that requires drill-down analysis. Most enterprise healthcare warehouses use star schema for claims and encounters with snowflake extensions for provider hierarchies.
What tools are used for healthcare data management?
Enterprise healthcare data management typically uses a combination of ETL tools (SSIS, Informatica PowerCenter, IBM DataStage, Azure Databricks), databases (SQL Server, Teradata, Oracle, Azure SQL), and BI platforms (Power BI, SSRS, Tableau). The tool matters less than the data model underneath — a well-designed star schema with conformed dimensions will outperform a poorly modeled warehouse on any platform.
FLMMIS — the Florida Medicaid Management Information System — is one of the largest state-level healthcare data systems in America. It handles eligibility determination, claims adjudication, provider enrollment, and reporting for over 5 million Medicaid beneficiaries. Every doctor visit, every prescription fill, every hospital admission for a Medicaid patient in Florida flows through this system.
I spent seven years inside it. Not as a consultant passing through. Not as a vendor running demos. As someone who wrote the code and then designed the data architecture.
Chapter 1: Building the Machine (HPE, 2016–2019)
At Hewlett Packard Enterprise, I was a .NET developer on the FLMMIS platform. ASP.NET MVC applications. C# business logic. RESTful Web APIs handling claims processing and provider enrollment. Entity Framework talking to SQL Server on the backend. Stored procedures, triggers, performance-tuned views — the full stack that makes a Medicaid claims system run.
What I didn't fully understand at the time was where the data went after my application created it.
I'd write code that validated a claim, applied business rules, and committed the result to SQL Server. Job done. But downstream, five different teams were pulling that data into five different reporting systems, applying five different transformation rules, and producing five different versions of the truth.
The data wasn't wrong. It was fragmented. And fragmentation in healthcare data doesn't just mean inconsistent dashboards — it means compliance reports that regulators question.
Chapter 2: Building the Brain (DXC, 2020–2023)
When I joined DXC Technology in 2020, I switched from building the application to building the analytics layer for the same FLMMIS ecosystem. Now I was on the receiving end of the data my previous code had generated.
The difference was immediate. I could look at a claims table and understand not just what the columns meant, but how the application populated them, which business rules affected which fields, and where the edge cases lived. Most data analysts inherit a schema and read the documentation. I had written the business logic that created the schema.
Most data warehouse architects study the schema. I wrote the code that creates the schema. When I designed fact tables for Medicaid claims, I knew which fields were populated by which business rules, which ones had edge cases the documentation didn't mention, and which ones would break if a specific provider type submitted a claim in a non-standard format.
That dual perspective shaped how I designed the data warehouse. I built star schema dimensional models with conformed dimensions and scalable fact tables, using Informatica PowerCenter and SSIS for ETL orchestration against a Teradata backend. The warehouse had to ingest from SQL Server, third-party APIs, CRM and ERP systems, and state-agency flat files.
The result was a unified data layer that powered SSRS dashboards for Medicaid analytics, compliance reporting, and stakeholder visibility. One version of the truth, not five.
Seeing both sides of a healthcare data system — the application that creates the data and the warehouse that analyzes it — is the single biggest advantage a data professional can have. Most integration problems exist because the data architect doesn't understand how the application populates the fields. Most application developers don't know how their schema will be consumed downstream.
Job descriptions make healthcare data management sound clean: "integrate clinical and administrative data sources to support reporting and analytics." Twelve words. Twelve words that skip over the hardest parts.
Here's what it actually means when you're staring at five source systems in a Medicaid context:
Healthcare data management means taking data that was designed to support individual applications — a claims processing system, a provider enrollment portal, an eligibility determination engine — and making it behave as if it were designed for analytics. It wasn't. It was designed for transactions.
Transaction systems and analytics systems want completely different things. A claims processing application wants normalized tables, referential integrity, and fast single-record lookups. An analytics dashboard wants denormalized star schemas, pre-aggregated measures, and fast scans across millions of rows. You cannot serve both from the same schema.
- Healthcare Data Management
Healthcare data management is the practice of integrating clinical, administrative, and financial data from multiple transaction-oriented source systems (EHRs, claims engines, CRMs, ERPs, and external feeds) into a unified analytical data warehouse. It requires dimensional modeling for query performance, ETL pipeline orchestration for data movement, conformed dimensions for cross-system identity resolution, and data quality enforcement to ensure compliance reporting accuracy.
The gap between what job descriptions say and what the work looks like is enormous. Nobody tells you that the "integration" part means spending weeks reconciling how "Provider" is defined across five systems. Nobody mentions that the "reporting" part means building SSRS dashboards that regulators will scrutinize line by line. Nobody warns you that the "data quality" part means building validation gates into your ETL pipeline because one bad date format from a flat file can cascade into a compliance violation.
Healthcare data management is not data storage. It is the translation layer between transaction systems that were designed for individual operations and analytics systems that need a unified, query-optimized view of millions of records. The hardest part is never the tooling. It is reconciling the differences between source systems that were never designed to agree with each other.
The Medicaid data warehouse at DXC had to ingest from five source categories: SQL Server databases running the core claims application, third-party APIs from state agencies, a CRM system for provider relationships, an ERP system for financial data, and flat files from external entities.
Each source had its own schema. Its own update frequency. Its own definition of every entity.
The worst offender was "Provider."
- System A uses NPI (National Provider Identifier) — a 10-digit CMS-issued number
- System B uses an internal provider code assigned during enrollment
- System C stores the facility TIN (Tax Identification Number) as the provider identifier
- System D maps providers by DEA number for pharmacy claims
- Flat files from the state use a legacy code that predates the NPI system entirely
Five systems, five definitions of "Provider," and every one of them was correct within its own context. The problem was that a compliance report might need to show "claims by provider" — and the answer depends entirely on which definition of "provider" you use.
dim_provider table, regardless of which source system the data came from.Building that conformed dimension took weeks. Mapping NPI to internal codes to TINs to DEA numbers to legacy state codes required cross-referencing five lookup tables, handling many-to-many relationships (one NPI can map to multiple TINs), and building exception-handling logic for records that didn't match anything.
The Day the Date Format Broke Everything
ETL pipelines are fragile. Not because the tools are bad — SSIS and Informatica PowerCenter are robust platforms. They're fragile because they depend on source systems behaving consistently. And source systems don't always cooperate.
MM/DD/YYYY to YYYY-MM-DD. No warning. No documentation update. The Informatica workflow that parsed that file ran on schedule, processed the records, and loaded dates into the staging table as garbage.We caught it the next morning when the compliance dashboard showed zero claims for the previous day. Zero. The ETL pipeline didn't fail — it successfully loaded corrupted data. That's worse than a failure, because a failure triggers an alert. Silent corruption triggers nothing until someone looks at the output.
After that, I built validation gates into every ETL pipeline. Before any data moves from staging to the warehouse, automated checks verify format consistency, null rates, referential integrity, and value range constraints. If any check fails, the pipeline stops and sends an alert. The data stays in staging until a human investigates.
ETL failures that crash are manageable — they trigger alerts, they're visible, they get fixed. ETL pipelines that succeed while loading corrupted data are dangerous. Build validation gates between staging and production that catch format changes, null spikes, and referential integrity breaks before bad data reaches your warehouse.
Healthcare data integration is hard because the real challenge is not moving data between systems — it is reconciling what the data means across systems. "Provider ID" means five different things in five different systems. Conformed dimensions are the unglamorous, essential solution. Build them early, validate them constantly, and never trust that source systems will behave consistently.
When I joined AIG Insurance in 2024, one of the first problems I inherited was reporting performance. Executives were waiting minutes for Excel reports to refresh from Power BI datasets. The IT team had already scoped a server upgrade. New hardware, more RAM, faster storage. The estimate was six figures.
I asked to look at the Power BI data model first.
What I found was a model with dozens of table relationships, many of them unnecessary. Tables that should have been star-schema dimensions were connected in circular patterns. Measures that should have been calculated at the fact-table level were computed at query time across multiple joins. The model was doing ten times more work than it needed to for every single report refresh.
The fix wasn't hardware. It was architecture.
Every time someone says 'we need better hardware,' check the data model first. In eight years across healthcare and insurance, I have never seen a reporting performance problem that was actually a hardware problem. It was always a data model problem.
I restructured the backend Power BI data model. Removed circular relationships. Collapsed redundant lookup tables into proper star-schema dimensions. Moved expensive calculations from query-time DAX to pre-computed columns where appropriate. Re-indexed the underlying SQL Server views.
Reports ran 40% faster. No server upgrade. No new hardware. The same data, the same reports, the same Power BI Service deployment — just a model that stopped fighting itself.
At AIG, I also integrated SQL Server, SAP BW, Oracle, and Azure SQL into one centralized enterprise BI platform. Four different database engines, four different query languages, four different data models — all feeding into Power BI dashboards that executives treat as a single source of truth. The integration pattern was the same one I'd learned at DXC: conformed dimensions, star schema design, ETL pipelines that validate before they load.
When reports are slow, check the data model before approving a hardware upgrade. Circular relationships, missing indexes, and query-time calculations that should be pre-computed are responsible for the majority of BI performance problems. Architecture fixes are free. Server upgrades are not.
When I designed the Medicaid data warehouse at DXC, the first architectural decision was schema design. Star schema or snowflake. It's the kind of decision that sounds academic until you're sitting in front of a Teradata instance that needs to process millions of Medicaid claims records and produce compliance reports before the morning shift starts.
- Star Schema in Healthcare Data Warehouses
A star schema organizes data into a central fact table (e.g., claims, encounters, prescriptions) surrounded by denormalized dimension tables (e.g., provider, patient, facility, date). By pre-joining dimension attributes into flat tables, star schema eliminates complex runtime joins and delivers the fast scan performance that healthcare analytics dashboards and compliance reports require.
I chose star schema for Medicaid. The reasoning was specific, not theoretical:
| Factor | Star Schema | Snowflake Schema |
|---|---|---|
| Query performance | Faster — fewer joins, denormalized dimensions | Slower — normalized dimensions require multi-level joins |
| Best for healthcare | Claims analytics, compliance reports, executive dashboards | Complex provider hierarchies, facility drill-downs, organizational rollups |
| Ease of use | Intuitive for business analysts and report writers | Requires deeper SQL knowledge to navigate normalized tables |
| Storage efficiency | Higher redundancy — dimension data is repeated | Lower redundancy — dimension data is normalized |
| ETL complexity | Simpler loads — dimensions are flat | More complex — must maintain referential integrity across normalized tables |
| Real example | FLMMIS Medicaid claims warehouse (Teradata) | AIG provider hierarchy for multi-level organizational reporting |
The exception at AIG was provider hierarchies. Insurance providers have complex organizational structures: individual practitioners belong to group practices, which belong to networks, which belong to regional organizations. That kind of deeply hierarchical data is where snowflake schema actually earns its complexity — each level of the hierarchy is a separate normalized table, allowing drill-down queries without storing every combination.
In practice, most healthcare data warehouses use both. Star schema for the core fact tables (claims, encounters, prescriptions). Snowflake extensions for hierarchical dimensions (provider, facility, organizational structure). The mistake is treating it as an either/or decision.
Star schema is the right default for healthcare data warehouses. It delivers faster queries, simpler SQL, and better optimizer performance on platforms like Teradata. Use snowflake extensions only for deeply hierarchical dimensions like provider organizations or facility structures. The worst decision is no dimensional modeling at all — which is what most healthcare organizations are still running.
Nobody writes LinkedIn posts about SSIS packages. Nobody gives conference talks about Teradata FastLoad vs MultiLoad trade-offs. Nobody gets excited about Informatica PowerCenter mappings.
But these are the tools that keep healthcare data moving. Every Medicaid compliance report, every executive dashboard, every claims analysis depends on ETL pipelines that run overnight and load data before the first analyst logs in.
FastLoad vs MultiLoad: The Monday Morning Problem
On the Medicaid data warehouse, we used Teradata as the backend. Loading millions of records nightly required choosing between two utilities: FastLoad and MultiLoad. Both move data from source to target. They are not interchangeable.
FastLoad is faster. It bypasses the transactional log, streams data directly to the Access Module Processor, and can load millions of rows in minutes. The trade-off: it only works on empty tables. No updates. No upserts. No existing rows.
MultiLoad handles updates and mixed operations. It can insert, update, delete, and upsert. The trade-off: it's slower, and it has overhead for error handling and transaction management.
Pick the wrong one and the consequences are concrete. Use FastLoad on a table that has existing records, and the load fails. Use MultiLoad on a large initial load, and the batch window runs long. If your ETL batch window ends at 6 AM and the load finishes at 7 AM, your Monday morning compliance reports run on Saturday's data.
Use FastLoad for initial loads and full-refresh dimensions where the target table is truncated first. Use MultiLoad for incremental fact table loads and slowly changing dimensions that require upserts. Never assume one utility works for both patterns.
SQL Profiler: Finding the Query That Ate 60% of ETL Time
At DXC, the Medicaid ETL pipeline was running close to its batch window limit. Not failing — just taking longer each month as data volume grew. I used SQL Profiler to trace every query in the pipeline and found a single stored procedure that consumed over 60% of total processing time. The procedure was doing a full table scan on a fact table with no index on the join column. Adding the right index dropped that procedure's runtime from 45 minutes to under 3. The entire pipeline gained back an hour of headroom.
Performance tuning is not glamorous. But that one index change is the reason the compliance team had fresh data on Monday mornings for the next two years.
The Shift to Azure Databricks
At AIG, the ETL paradigm is evolving. Azure Databricks processes high-volume datasets using distributed Spark clusters. IBM DataStage handles large-scale healthcare dataset transformations. The traditional SSIS packages still run, optimized with checkpoints and event handling for reliability.
The cloud isn't replacing on-premise ETL everywhere. It's supplementing it. Many healthcare organizations run hybrid stacks: SSIS for internal SQL Server pipelines, Informatica for cross-platform integration, and Databricks for the workloads that exceed what a single server can handle.
The ETL stack that powers healthcare data is not glamorous, but it is the difference between fresh data and stale data. FastLoad vs MultiLoad, index optimization, and batch window management are the skills that keep compliance reports running on time. Cloud tools like Azure Databricks are extending — not replacing — the traditional ETL stack.
I've seen these across two industries and three employers. Different organizations, different tools, same patterns. Every one of these mistakes turns a solvable architecture problem into a six-figure remediation project.
Mistake 1: Treating the Warehouse as a Data Dump
The fastest way to build a useless data warehouse is to load every table from every source system without dimensional modeling. You end up with a staging area that someone relabeled "data warehouse." The data is all there. Nobody can query it efficiently because there's no star schema, no conformed dimensions, and no fact-to-dimension relationships.
A data warehouse is not a backup of your source systems. It is a dimensional model designed to answer specific business questions fast.
Mistake 2: Building Dashboards Before Fixing the Model
This is the "pretty charts, wrong numbers" trap. A team spends months building Power BI or Tableau dashboards on top of a poorly modeled warehouse. The dashboards look great. The numbers don't reconcile. When someone finally checks, the problem isn't the visualization — it's the data model underneath.
At AIG, the dashboards were fine. The data model behind them had circular relationships that made reports slow and some calculations incorrect. Fixing the model fixed everything downstream.
Mistake 3: Skipping Data Quality Gates in ETL
When the state agency flat file changed its date format, our ETL pipeline didn't fail. It loaded corrupted dates into the warehouse. No alert. No error. Just silent corruption that showed up 12 hours later when a compliance dashboard showed zero claims for an entire day.
After that, every pipeline I build has validation gates between staging and production. Format checks, null rate thresholds, referential integrity verification, value range constraints. If any check fails, the pipeline stops. Bad data stays in staging until a human investigates.
Mistake 4: Ignoring Conformed Dimensions
Five systems. Five definitions of "Provider ID." If you don't reconcile them into one conformed dimension, you get five different answers to the question "how many providers submitted claims last month?" This isn't a theoretical risk. It's what happens in most healthcare organizations that skip the dimensional modeling step.
Conformed dimensions are boring. They take weeks to build. They require mapping tables, cross-referencing multiple identifier systems, and handling edge cases. But they are the only thing that makes cross-system reporting trustworthy.
Mistake 5: Betting on the Tool Instead of the Architecture
"We're evaluating Power BI vs Tableau." I've heard this in every organization I've worked for. It's the wrong conversation. Power BI and Tableau are both excellent tools. Neither one will save a data warehouse that's a pile of denormalized tables with no fact-dimension relationships.
- Power BI, Tableau, and QlikView all perform well on top of a properly designed star schema — the tool choice is secondary
- A $500K Tableau deployment on top of a poorly modeled warehouse will produce the same wrong numbers as a free Excel pivot table
- The fix for slow reports is almost never a better BI tool — it is a better data model
- Architecture decisions (star schema, conformed dimensions, ETL validation) outlast any tool selection by years
The five mistakes that break healthcare data projects are all architecture failures, not tool failures. Dimensional modeling, conformed dimensions, data quality gates, and proper ETL design are the foundations. Choose your BI tool after the architecture is right — not before.
- 01Healthcare data management is the translation layer between transaction systems designed for individual operations and analytics systems that need a unified view of millions of records — the hardest part is reconciling source system differences, not choosing tools
- 02Star schema is the right default for healthcare data warehouses: faster queries, simpler SQL, and better optimizer performance on platforms like Teradata — use snowflake extensions only for deeply hierarchical dimensions
- 03Conformed dimensions that reconcile different entity identifiers (Provider ID, Patient ID, Facility ID) across source systems are the single most important and most neglected component of healthcare data integration
- 04ETL pipeline reliability depends on validation gates between staging and production — silent data corruption (successful loads of bad data) is more dangerous than pipeline failures that trigger alerts
- 05Reporting performance problems are almost never hardware problems — check the data model for circular relationships, missing indexes, and query-time calculations before approving infrastructure upgrades
- 06The five mistakes that break healthcare data projects are all architecture failures: treating the warehouse as a data dump, building dashboards before fixing the model, skipping data quality gates, ignoring conformed dimensions, and betting on tools instead of architecture
What is healthcare data management?
Healthcare data management is the practice of integrating clinical, administrative, and financial data from multiple source systems — EHRs, claims engines, CRMs, ERPs, and external feeds — into a unified analytical data warehouse. It requires dimensional modeling (typically star schema), ETL pipeline orchestration, conformed dimensions for cross-system identity resolution, and data quality enforcement to ensure compliance reporting accuracy.
Why is healthcare data integration so difficult?
Healthcare data integration is difficult because source systems use different schemas, different update frequencies, and different definitions for the same entities. A "Provider ID" may be an NPI number in one system, an internal enrollment code in another, and a facility TIN in a third. Reconciling these identities into conformed dimensions requires cross-referencing multiple lookup tables, handling many-to-many relationships, and building exception logic for unmatched records.
What is a star schema in a healthcare data warehouse?
A star schema organizes healthcare data into a central fact table (such as claims, encounters, or prescriptions) surrounded by denormalized dimension tables (provider, patient, facility, date). By pre-joining dimension attributes into flat tables, star schema eliminates complex runtime joins and delivers the fast query performance that healthcare compliance reports and analytics dashboards require.
What tools are used for healthcare data management?
Enterprise healthcare data management typically uses ETL tools (SSIS, Informatica PowerCenter, IBM DataStage, Azure Databricks) for data movement, databases (SQL Server, Teradata, Oracle, Azure SQL) for storage, and BI platforms (Power BI, SSRS, Tableau) for reporting. The tool matters less than the data model — a well-designed star schema with conformed dimensions will outperform a poorly modeled warehouse on any platform.
What is the difference between star schema and snowflake schema for healthcare?
Star schema uses denormalized dimension tables for faster query performance and simpler SQL — best for claims analytics, compliance reports, and executive dashboards. Snowflake schema normalizes dimension tables into multiple related tables, reducing storage redundancy but requiring more complex joins — best for deeply hierarchical data like provider organizations or facility structures. Most healthcare warehouses use star schema as the default with snowflake extensions for hierarchical dimensions.
How do you ensure data quality in healthcare data warehouses?
Data quality in healthcare warehouses requires validation gates between the ETL staging area and the production warehouse. These gates check format consistency (date formats, code values), null rate thresholds, referential integrity against conformed dimensions, and value range constraints. If any check fails, the pipeline stops and the data stays in staging until investigated. Silent data corruption — where the pipeline succeeds but loads bad data — is more dangerous than visible pipeline failures.
What is a conformed dimension in healthcare data?
A conformed dimension is a single, authoritative reference table that reconciles all different identifiers for an entity (such as a provider, patient, or facility) across multiple source systems into one canonical record. In healthcare, a conformed provider dimension maps NPI numbers, internal codes, TINs, DEA numbers, and legacy identifiers to one row per provider, enabling consistent cross-system reporting.
- 01Waste Not, Want Not: The Facts About Healthcare Spending — Shrank, W.H., Rogstad, T.L., Parekh, N., JAMA (2019)
- 02Non-federal Acute Care Hospital Electronic Health Record Adoption — Office of the National Coordinator for Health Information Technology (ONC/ASTP) (2021)
- 03The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling — Kimball, R., Ross, M., Wiley Publishing (2013)
- 04Florida State Medicaid Profile — Centers for Medicare & Medicaid Services (CMS) (2025)
- 05Run Large Bulkloads Efficiently with Fastload — Teradata Developers Portal (2024)