
Daniel Abraham Mamudgi
Data Engineer, MS Computer Science
Daniel has 4+ years of data engineering experience building scalable, cloud-based data pipelines. At Optum (UnitedHealth Group), he designed ETL workflows processing data from 20+ US states, built a JSON-to-Airflow DAG framework saving $120K annually, and integrated Kafka-based real-time streams. Recently completed his MS in Computer Science at University of Illinois Chicago with a 4.0 GPA, where he researched high-performance computing. Recognized twice as 'Talent in Spotlight' for high-impact contributions.
How long does it take to become a mid-level data engineer?
Typically 2-4 years of hands-on experience building production data pipelines. The key accelerator isn't time — it's exposure to complex, real-world problems: multi-source data integration, handling data quality issues at scale, and owning end-to-end pipeline delivery. I reached mid-level in about 3 years by working on high-impact healthcare data projects.
What skills do I need to become a data engineer?
Core skills: Python (data manipulation, scripting), SQL (complex queries, optimization), and at least one distributed processing framework (Spark/PySpark). Add cloud platform expertise (AWS or Azure), orchestration tools (Airflow), and data modeling fundamentals. Soft skills matter too — communicating with stakeholders and understanding business context separates mid-level from senior.
Is AWS or Azure better for data engineering?
Both are excellent. AWS has more market share and a broader ecosystem (S3, Redshift, Glue, EMR, Athena). Azure excels if your organization uses Microsoft products and offers tight Databricks integration. I've used both extensively — learn one deeply first, then add the other. The concepts transfer well.
Do I need a master's degree to become a data engineer?
No. I got my first data engineering role with a bachelor's degree and built 3 years of experience before pursuing my MS. A master's can accelerate certain career paths (research, specialized ML roles) but isn't required. Focus on building real projects and demonstrating impact.
- Data Engineer
A Data Engineer designs, builds, and maintains the infrastructure and systems that enable organizations to collect, store, transform, and analyze data. This includes building ETL/ELT pipelines, managing data warehouses and lakes, ensuring data quality, and creating the foundation that data scientists, analysts, and business teams depend on for insights.
When I joined Optum as a Data Engineering Analyst in 2020, I quickly learned that data engineering is the backbone of any data-driven organization. Data scientists get the headlines, but without reliable data pipelines, they have nothing to work with.
Here's what data engineers actually do day-to-day:
The role sits at the intersection of software engineering and data management. You need coding skills, but also understanding of data modeling, distributed systems, and cloud infrastructure.
Data engineering is infrastructure work — you build the systems that make data accessible and reliable. Success is measured by pipeline uptime, data quality, and how effectively you enable downstream teams to do their work.
Based on my journey from entry-level to leading critical data initiatives, here's the realistic roadmap for data engineering career progression.
| Stage | Timeline | Focus Areas | Key Milestones |
|---|---|---|---|
| Foundation | 0-6 months | Python, SQL, basic ETL concepts | First pipeline deployed |
| Junior | 6-18 months | Cloud basics, orchestration, data modeling | Own end-to-end pipelines |
| Mid-Level | 18-36 months | Architecture, optimization, mentoring | Lead critical projects |
| Senior | 36+ months | Strategy, cross-team influence, technical vision | Define data architecture |
The timelines are guidelines, not rules. I've seen engineers reach mid-level in 18 months through intense project exposure, and others stay junior for 4+ years by avoiding challenging work.
Whether you're in a bootcamp, self-learning, or just starting your first job, the foundation stage is about building core competencies that everything else rests on.
Python for Data Engineering
Python is the lingua franca of data engineering. You need proficiency in:
- Pandas for tabular data processing
- Working with JSON, CSV, Parquet files
- API interactions with requests library
- File system operations
- Writing maintainable, production-quality code
- Error handling and logging
- Configuration management
- Unit testing basics
# Example: A simple ETL pattern you'll use constantly
import pandas as pd
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def extract(source_path: str) -> pd.DataFrame:
"""Extract data from source file."""
logger.info(f"Extracting from {source_path}")
return pd.read_csv(source_path)
def transform(df: pd.DataFrame) -> pd.DataFrame:
"""Apply business transformations."""
logger.info(f"Transforming {len(df)} rows")
# Clean, validate, enrich
df = df.dropna(subset=['id'])
df['processed_at'] = pd.Timestamp.now()
return df
def load(df: pd.DataFrame, target_path: str) -> None:
"""Load data to destination."""
logger.info(f"Loading {len(df)} rows to {target_path}")
df.to_parquet(target_path, index=False)
# This pattern scales from scripts to production pipelines
SQL Fundamentals
SQL is non-negotiable. At Optum, I wrote complex queries daily against massive healthcare datasets.
- JOINs (inner, left, right, full, cross)
- Window functions (ROW_NUMBER, RANK, LAG, LEAD, SUM OVER)
- CTEs (Common Table Expressions) for readable queries
- Subqueries and correlated subqueries
- Aggregations and GROUP BY with HAVING
- Query optimization basics (indexes, explain plans)
-- Example: Window functions are essential for data engineering
WITH member_activity AS (
SELECT
member_id,
activity_date,
activity_type,
ROW_NUMBER() OVER (
PARTITION BY member_id
ORDER BY activity_date DESC
) as recency_rank,
COUNT(*) OVER (
PARTITION BY member_id
) as total_activities
FROM member_activities
WHERE activity_date >= CURRENT_DATE - INTERVAL '90 days'
)
SELECT *
FROM member_activity
WHERE recency_rank = 1; -- Latest activity per member
Basic ETL Concepts
Understand the fundamentals before diving into tools:
- Batch vs. Streaming: When to use each, tradeoffs
- Data formats: CSV, JSON, Parquet, Avro — pros and cons
- Data modeling basics: Star schema, snowflake, normalization
- Idempotency: Why pipelines must be re-runnable safely
The foundation stage is about becoming dangerous with Python and SQL. Don't rush to learn every tool — master these fundamentals first. Every advanced data engineering skill builds on them.
As a junior data engineer, you're contributing to production systems under guidance. This is where you learn what "production-grade" really means.
Cloud Platform Basics
Pick AWS or Azure and learn it properly. At Optum, I worked primarily with AWS initially, then expanded to Azure for specific projects.
- S3: Object storage (your data lake foundation)
- Redshift: Data warehouse for analytics
- Glue: Managed ETL service
- Athena: Serverless SQL queries on S3
- Lambda: Serverless compute for lightweight processing
- ADLS Gen2: Data lake storage
- Databricks: Unified analytics platform
- Data Factory: Orchestration and ETL
- Synapse Analytics: Data warehouse and analytics
Don't try to learn everything. Pick one cloud, learn the core data services deeply, then build a project that uses them together. Theory without practice doesn't stick.
Orchestration with Apache Airflow
Airflow is the industry standard for pipeline orchestration. At Optum, I built a JSON-to-Airflow DAG code generation framework that migrated 35+ pipelines from Talend.
- DAGs (Directed Acyclic Graphs) for workflow definition
- Operators (Python, Bash, cloud-specific)
- Task dependencies and parallelism
- XComs for passing data between tasks
- Connections and Variables for configuration
- Sensors for waiting on external conditions
# Example: Basic Airflow DAG structure
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.operators.s3 import S3CopyObjectOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data_engineering',
'retries': 3,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'daily_data_pipeline',
default_args=default_args,
schedule_interval='@daily',
start_date=datetime(2024, 1, 1),
catchup=False,
) as dag:
extract_task = PythonOperator(
task_id='extract_data',
python_callable=extract_from_source,
)
transform_task = PythonOperator(
task_id='transform_data',
python_callable=apply_transformations,
)
load_task = PythonOperator(
task_id='load_to_warehouse',
python_callable=load_to_redshift,
)
extract_task >> transform_task >> load_task
Data Quality Fundamentals
This is where junior engineers often stumble. Production data is messy. At Optum, I built a Python-based data quality tool that automated monthly QA checks, reducing manual validation effort by 80%.
- Completeness: Are required fields populated?
- Accuracy: Does the data reflect reality?
- Consistency: Does data match across sources?
- Timeliness: Is data fresh enough for use cases?
- Uniqueness: Are there unexpected duplicates?
- Add validation steps to every pipeline
- Set up alerting for anomalies (row count drops, null spikes)
- Document expected data contracts
- Build self-healing mechanisms where possible
What Junior Engineers Should Focus On
- Write clean, documented code
- Ask questions when requirements are unclear
- Take ownership of assigned pipelines
- Learn from code reviews
- Understand the business context of your data
- Over-engineer solutions before understanding requirements
- Skip testing because "it works locally"
- Ignore monitoring and alerting
- Stay in your comfort zone
The biggest leap from bootcamp to production was understanding that data quality isn't an afterthought — it's the core of the job. A pipeline that runs but produces wrong data is worse than one that fails loudly.
Junior stage is about learning to build reliable, production-grade pipelines. Focus on cloud fundamentals, orchestration, and data quality. Every pipeline you build should have monitoring and validation built in.
The jump from junior to mid-level is less about learning new tools and more about thinking differently. You're no longer just implementing — you're designing.
Architectural Thinking
Mid-level engineers make architectural decisions that affect performance, cost, and maintainability.
- What's the data volume today? In 6 months? In 2 years?
- What are the latency requirements?
- How will this integrate with existing systems?
- What happens when this fails?
- What's the cost at scale?
At Optum, I led the design of transformation workflows for the Optum Care Delivery platform. This required thinking beyond single pipelines to how data flows across the entire organization.
Advanced Spark/PySpark
Spark becomes essential when data volumes exceed what Pandas can handle (typically > 1-10 GB).
- Lazy evaluation and DAG execution
- Partitioning strategies
- Broadcast joins vs. shuffle joins
- Caching and persistence
- Spark SQL and DataFrame API
- Performance tuning (spark.conf settings)
# Example: Optimized PySpark transformation
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, broadcast
spark = SparkSession.builder.appName("OptimizedETL").getOrCreate()
# Read with schema enforcement (faster than inference)
df = spark.read.schema(defined_schema).parquet("s3://bucket/raw/")
# Broadcast small dimension table for efficient join
dim_state = spark.read.parquet("s3://bucket/dim/states/")
df_enriched = df.join(
broadcast(dim_state), # Broadcast hint for small tables
df.state_code == dim_state.code,
"left"
)
# Repartition before expensive operations
df_processed = (
df_enriched
.repartition(200, "partition_key") # Control parallelism
.transform(apply_business_rules)
.cache() # Cache if reused
)
# Write with optimized partitioning
df_processed.write \
.partitionBy("date", "region") \
.mode("overwrite") \
.parquet("s3://bucket/processed/")
Real-Time Data Processing
At Optum, I integrated Kafka-based real-time member identity streams into the data lake. This was a game-changer for enabling centralized, de-duplicated member data across teams.
- Apache Kafka for message streaming
- Event-driven architectures
- Exactly-once vs. at-least-once delivery
- Windowing and watermarks
- Stream-batch integration patterns
| Aspect | Batch Processing | Stream Processing |
|---|---|---|
| Latency | Minutes to hours | Seconds to minutes |
| Complexity | Lower | Higher |
| Cost | Generally lower | Generally higher |
| Use Cases | Reports, analytics, ML training | Real-time dashboards, alerts, fraud detection |
| Tools | Spark, Airflow, dbt | Kafka, Flink, Spark Streaming |
Leading Projects
Mid-level engineers lead significant initiatives. At Optum, I built the JSON-to-Airflow DAG code generation framework that:
- Migrated 35+ ETL pipelines from Talend
- Reduced average execution time by 20%
- Saved $120K annually in infrastructure and licensing costs
- Scoping work and breaking it into phases
- Making technical decisions and documenting rationale
- Coordinating with dependent teams
- Handling blockers and escalating appropriately
- Delivering on commitments
Mentoring Junior Engineers
Teaching others is how you solidify your own understanding and demonstrate leadership potential.
- Code reviews that explain the "why"
- Pairing on complex problems
- Creating documentation and runbooks
- Being available without being a bottleneck
Mid-level is about ownership and impact. You design systems, lead projects, and multiply team effectiveness through mentoring. The technical skills matter, but the mindset shift to thinking about systems and teams is what defines this stage.
Senior data engineers shape technical direction and solve the hardest problems. My research at UIC's Electronic Visualization Laboratory, where I contributed to high-performance computing initiatives, gave me exposure to this level of technical thinking.
Technical Strategy
Seniors think beyond individual projects to organizational capabilities:
- What data infrastructure do we need in 2-3 years?
- How do we reduce technical debt while delivering features?
- What build-vs-buy decisions should we make?
- How do we scale the team's capabilities?
Cross-Team Influence
Seniors work across organizational boundaries:
- Aligning data architecture with business strategy
- Building relationships with engineering, analytics, and product teams
- Standardizing practices across the organization
- Representing data engineering in technical decisions
Platform Thinking
Instead of building pipelines, seniors build platforms that enable others to build pipelines:
- Self-service data infrastructure
- Reusable frameworks and templates
- Governance and security patterns
- Observability and debugging tools
Senior data engineers have organizational impact beyond their direct work. They shape technical direction, influence cross-team decisions, and build platforms that multiply the effectiveness of entire data organizations.
Let me break down the skills that matter most, based on what I've actually used in production.
Python for Data Engineering
- Universal language across the data stack
- Rich ecosystem (Pandas, PySpark, Airflow)
- Easy to read and maintain
- Great for scripting and automation
- Strong community and resources
- Slower than compiled languages for compute-heavy tasks
- GIL limits true parallelism
- Type safety requires discipline (use type hints!)
- Dependency management can be messy
- Writing production-quality code (not just scripts that work)
- Understanding performance implications
- Using type hints and static analysis
- Testing your code properly
SQL Mastery
SQL is where most data engineers spend their time. Master these patterns:
-- Running totals and comparisons
SELECT
date,
revenue,
SUM(revenue) OVER (ORDER BY date) as cumulative_revenue,
revenue - LAG(revenue) OVER (ORDER BY date) as daily_change,
revenue / NULLIF(LAG(revenue) OVER (ORDER BY date), 0) - 1 as pct_change
FROM daily_sales;
-- Validate data completeness
SELECT
COUNT(*) as total_rows,
COUNT(DISTINCT member_id) as unique_members,
SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) as null_emails,
SUM(CASE WHEN created_at > CURRENT_DATE THEN 1 ELSE 0 END) as future_dates
FROM members
WHERE load_date = CURRENT_DATE;
Distributed Processing (Spark)
Spark is essential when data doesn't fit on a single machine. Key areas:
- Understanding the execution model
- Optimizing shuffles (the expensive operation)
- Partitioning strategies
- Memory management
- Integration with cloud storage
I've built production systems on both. Here's my honest comparison.
| Aspect | AWS | Azure |
|---|---|---|
| Market Share | Larger, more job opportunities | Growing, strong in enterprise |
| Data Lake | S3 (industry standard) | ADLS Gen2 (excellent) |
| Data Warehouse | Redshift (mature) | Synapse (newer, integrated) |
| Spark Platform | EMR, Glue | Databricks (excellent integration) |
| ETL Service | Glue (good) | Data Factory (comprehensive) |
| Serverless Query | Athena (great) | Synapse Serverless |
| Learning Curve | Steeper, more services | Gentler if you know Microsoft |
AWS Data Engineering Stack
My projects at Optum and personal work used this stack:
Azure Data Engineering Stack
My Formula 1 data engineering project used:
This follows the medallion architecture (bronze, silver, gold layers) that's becoming standard for lakehouse designs.
Learn one cloud deeply first. The concepts transfer — once you understand S3, ADLS makes sense. Once you know Airflow, Data Factory is familiar. Deep knowledge in one platform beats shallow knowledge in both.
Your portfolio demonstrates what you can build. Here's how to create projects that actually impress.
What Makes a Good Portfolio Project
- End-to-end pipelines (not just one component)
- Real data sources (APIs, public datasets)
- Data quality checks and monitoring
- Clear documentation and architecture diagrams
- Cloud deployment (not just local)
- Tutorial copy-paste projects
- Projects without data quality considerations
- Unfinished work
- Projects without clear business context
Project Ideas That Demonstrate Skill
- Ingest streaming data (Twitter API, stock prices, IoT sensors)
- Process with Kafka + Spark Streaming
- Store in time-series database
- Visualize in dashboard
- Raw (bronze) → Cleaned (silver) → Aggregated (gold)
- Implement on AWS or Azure
- Include data quality checks at each layer
- Document the transformation logic
- Config-driven pipeline generation (like my JSON-to-Airflow work)
- Parameterized for different sources
- Includes logging, alerting, retry logic
- Deployed on cloud with CI/CD
My Portfolio Projects
Portfolio projects should demonstrate end-to-end thinking, not just technical skills. Include data quality, monitoring, and documentation. Deploy on cloud infrastructure. Make it easy for reviewers to understand what you built and why.
Let me walk through a real system I built — the healthcare data standardization platform at Optum.
The Problem
Optum's Care Delivery platform needed to analyze cost efficiency and operational outcomes across multiple business units. The challenge:
- Data from 20+ US states with different formats
- Multiple source systems with inconsistent schemas
- Millions of member records requiring identity resolution
- Strict healthcare compliance requirements (HIPAA)
- Multiple downstream teams depending on reliable data
The Solution Architecture
Source Systems → Extraction Layer → Standardization → Data Lake → Analytics
↓ ↓ ↓ ↓ ↓
20+ states Talend/Airflow Transform S3/HDFS Redshift
Various APIs Scheduled jobs Data model Partitioned Reports
Files/DBs Error handling Quality checks Gold layer Dashboards
Key Components I Built
Each state had different data formats, field names, and business rules. I designed transformation workflows that:
- Mapped heterogeneous schemas to a unified data model
- Handled edge cases specific to each state
- Maintained data lineage for compliance
- Supported incremental and full loads
Manual QA was taking days of engineering time monthly. I built an automated tool that:
- Ran configurable validation rules against datasets
- Generated reports highlighting anomalies
- Compared metrics against historical baselines
- Reduced manual validation effort by 80%
Member identity data was scattered across systems, leading to duplicates and inconsistencies. I integrated Kafka-based real-time streams that:
- Centralized member master data
- De-duplicated records across sources
- Enabled near-real-time updates
- Improved downstream analytics reliability
We needed to migrate 35+ pipelines from Talend to Airflow. Instead of manual conversion, I built a framework that:
- Read pipeline definitions from JSON config files
- Generated Airflow DAG code automatically
- Maintained consistency across all pipelines
- Reduced migration time dramatically
- Saved $120K annually in licensing costs
Lessons Learned
Getting recognized as 'Talent in Spotlight' twice wasn't about building the flashiest systems — it was about reliably delivering data that business teams could trust. In healthcare, that reliability directly affects patient care decisions.
Certifications can accelerate your career, but they're not a substitute for hands-on experience.
Recommended Certifications
- AWS Certified Data Engineer – Associate (new, directly relevant)
- AWS Certified Solutions Architect – Associate (foundational)
- AWS Certified Database – Specialty (if database-focused)
- Microsoft Certified: Azure Data Engineer Associate (DP-203)
- Microsoft Certified: Azure Fundamentals (AZ-900, start here)
- Databricks Certified Data Engineer Associate
- Databricks Certified Data Engineer Professional
Certification Strategy
- After 6-12 months of hands-on experience
- When job postings in your target role require them
- When you need structured learning for new platforms
- Before you have practical experience to contextualize the material
- To collect credentials without applying the knowledge
- At the expense of building real projects
I got my first data engineering role without certifications. What mattered was demonstrating I could build and deliver. Certifications helped later for specific opportunities, but they were never the primary factor.
- Focusing on tools over fundamentals — Python and SQL mastery matters more than knowing every framework
- Building toy projects that don't demonstrate production thinking
- Avoiding the messy, complex work that builds real skills
- Skipping data quality — every pipeline needs validation and monitoring
- Working in isolation instead of understanding business context
- Over-engineering before understanding requirements
- Not documenting your work and impact for performance reviews
- Staying too long in comfort zone without taking on new challenges
Mistake Deep Dive: Tool Collecting
It's tempting to learn every new tool that appears on Hacker News. Don't.
Mistake Deep Dive: Avoiding Complex Work
Early in my career, I learned the most from the projects nobody wanted — the messy data integration work with unclear requirements and legacy systems.
- 01Foundation (0-6 months): Master Python and SQL — everything builds on these fundamentals
- 02Junior (6-18 months): Learn cloud basics, orchestration (Airflow), and data quality practices
- 03Mid-Level (18-36 months): Develop architectural thinking, Spark proficiency, and project leadership skills
- 04Senior (36+ months): Focus on technical strategy, cross-team influence, and platform thinking
- 05Build portfolio projects that demonstrate end-to-end thinking, not just technical skills
- 06Choose AWS or Azure and learn it deeply before expanding to the other
- 07Certifications help but don't replace hands-on experience building real systems
- 08The fastest path to growth: take on complex, messy problems that others avoid
How do I become a data engineer with no experience?
Start with Python and SQL fundamentals through online courses or bootcamps. Build 2-3 portfolio projects using public datasets and cloud free tiers (AWS Free Tier, Azure credits). Focus on end-to-end pipelines that include extraction, transformation, loading, and basic quality checks. Apply for entry-level data engineering or analytics engineering roles, or adjacent roles (data analyst, ETL developer) that can transition to data engineering.
What's the difference between data engineer and data scientist?
Data engineers build and maintain the infrastructure that makes data available — pipelines, warehouses, and data quality systems. Data scientists analyze that data to extract insights and build models. Data engineering is more software engineering; data science is more statistics and ML. Both are essential, and many organizations need more data engineers than data scientists.
Is data engineering a good career in 2026?
Yes. Organizations are drowning in data and need engineers to make it usable. Demand continues to grow as companies invest in data platforms, real-time analytics, and AI/ML infrastructure (which requires solid data foundations). Salaries remain strong, with mid-level engineers typically earning $120K-$160K in the US.
Should I learn SQL or Python first?
SQL. You'll use it immediately in any data role, and it's faster to become productive. Once comfortable with SQL, add Python for more complex transformations, automation, and working with APIs. Both are non-negotiable for data engineering.
How important is Spark for data engineers?
Very important for mid-level and beyond. When data volumes exceed what Pandas can handle (typically 1-10GB), you need distributed processing. Spark is the industry standard. Learn it after you're comfortable with Python and have built some pipelines — it's easier to understand when you know why you need it.
What's the best way to learn cloud platforms?
Hands-on projects using free tiers. AWS Free Tier and Azure credits give you enough resources to build real pipelines. Follow along with tutorials, then build your own project without guidance. The struggle of figuring things out yourself is where learning happens.
How do I transition from software engineering to data engineering?
You're already halfway there. Software engineers have the coding and system design skills. Add: SQL proficiency, understanding of data modeling, cloud data services (S3, Redshift or equivalent), and orchestration tools (Airflow). Build a portfolio project demonstrating data pipeline work. The transition is common and valued — software engineering rigor improves data engineering quality.
- 01Apache Airflow Documentation — Apache Software Foundation (2026)
- 02Apache Spark Documentation — Apache Software Foundation (2026)
- 03AWS Analytics and Data Lakes — Amazon Web Services (2026)
- 04Analytics End-to-End with Azure Synapse — Microsoft (2026)
- 05Databricks Lakehouse Architecture — Databricks (2026)
- 06BigQuery Documentation - Google Cloud Data Warehouse — Google Cloud (2026)
- 07Fundamentals of Data Engineering — Joe Reis, Matt Housley (2022)