The interviewer pulls up a whiteboard and says: "Design a real-time pipeline that handles 10 million events per second with exactly-once delivery guarantees."
You know Kafka. You know Spark. You've built pipelines. But your brain goes blank because you've been studying LeetCode arrays and strings for the last two weeks — and this interview isn't about algorithms at all.
Data engineer interviews are unlike any other technical interview. They test systems thinking, not syntax. They care about trade-offs, not textbook answers. And most candidates prepare for the wrong thing because they follow software engineering interview advice that doesn't apply.
What questions are asked in a data engineer interview?
Data engineer interviews typically cover five areas: SQL (window functions, CTEs, query optimization, data modeling), Python (PySpark, pandas, data manipulation, coding challenges), system design (pipeline architecture, warehouse design, streaming systems), cloud platforms (AWS/Azure/GCP services), and behavioral questions (production incidents, stakeholder conflicts, technical trade-offs).
How do I prepare for a data engineer interview?
Focus on three pillars: (1) SQL fluency — practice window functions, CTEs, and optimization daily; (2) system design thinking — study Kleppmann's DDIA for distributed systems principles and practice designing data pipelines on a whiteboard; (3) behavioral preparation — prepare 4–5 STAR stories about production incidents, stakeholder conflicts, and technical decisions.
What is the hardest part of a data engineer interview?
System design rounds. Unlike SQL or Python coding, there's no single correct answer. Interviewers evaluate your ability to reason about trade-offs (batch vs streaming, cost vs latency, consistency vs availability), ask clarifying questions, and design systems that handle failure gracefully — concepts from distributed systems theory.
How long should I prepare for a data engineer interview?
2–4 weeks of focused preparation is typical for experienced engineers. If you're transitioning from another role (analyst, SWE), plan for 4–6 weeks. The key is daily practice: SQL on weekdays, system design on weekends, and behavioral prep throughout.
- Data Engineer Interview Process
A typical data engineer interview consists of 4–5 rounds: recruiter screen (30 min), SQL/coding assessment (60 min), system design (60 min), behavioral/cultural (45 min), and sometimes a take-home project. The process takes 2–4 weeks from first contact to offer.
| Round | Duration | What's Tested | Difficulty |
|---|---|---|---|
| Recruiter Screen | 30 min | Experience fit, salary expectations, basic SQL knowledge | Low |
| SQL/Coding Assessment | 60 min | Window functions, CTEs, optimization, Python data manipulation | Medium–High |
| System Design | 60 min | Pipeline architecture, trade-offs, distributed systems thinking | High |
| Behavioral | 45 min | Production incidents, stakeholder management, technical decisions | Medium |
| Take-Home (some companies) | 2–4 hours | End-to-end pipeline build, data modeling, code quality | Medium |
SQL is the most heavily tested skill in data engineer interviews. You'll face questions ranging from basic aggregations to complex window functions and query optimization.
Q1: Explain the difference between ROW_NUMBER(), RANK(), and DENSE_RANK()
ROW_NUMBER()assigns a unique sequential integer to each row — no ties, no gaps. Use for deduplication (keeping the latest record per key).RANK()assigns the same rank to tied values but skips the next rank(s). If two rows tie at rank 2, the next row gets rank 4.DENSE_RANK()assigns the same rank to tied values but does NOT skip ranks. If two rows tie at rank 2, the next row gets rank 3.
ROW_NUMBER() for deduplication and pagination. RANK() when you need to show gaps (leaderboards where ties matter). DENSE_RANK() for "top N per group" queries where you want exactly N distinct values.Q2: Write a query to find the second highest salary in each department
SELECT department_id, employee_id, salary
FROM (
SELECT
department_id,
employee_id,
salary,
DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rnk
FROM employees
) ranked
WHERE rnk = 2;
DENSE_RANK over ROW_NUMBER: If two employees share the highest salary, ROW_NUMBER would arbitrarily assign one as rank 2 — giving a wrong answer. DENSE_RANK correctly identifies the actual second-highest salary value.Q3: What's a CTE and when would you use one over a subquery?
WITH. CTEs are preferred when: (1) the same result set is referenced multiple times, (2) the query has multiple transformation stages (readability), or (3) you need recursive queries. Subqueries are fine for simple, one-off inline filters. Most query engines compile CTEs and subqueries to the same execution plan — the difference is readability and maintainability, not performance.Q4: How would you optimize a slow-running query?
- Read the execution plan — identify full table scans, sort operations, and joins with high row estimates
- Check indexing — are filter columns and join keys indexed? Are indexes being used (watch for implicit type casts that bypass indexes)?
- Reduce data early — push filters before joins, use
WHEREbeforeHAVING, limit the columns selected - Evaluate join strategy — nested loop vs hash join vs merge join. For large-to-large table joins, hash joins are usually optimal
- Consider partitioning — if filtering on date ranges, is the table partitioned by date?
- Materialize intermediate results — if a complex subquery runs multiple times, materialize it as a temp table
Q5: Explain slowly changing dimensions (SCD Type 1, 2, and 3)
- Type 1 (Overwrite): Update the row in place. Simple, no history. Use when historical accuracy doesn't matter (fixing a typo in a customer name).
- Type 2 (Add Row): Insert a new row with a new surrogate key, mark the old row as expired (using
effective_date/expiration_date/is_currentflag). Preserves full history. Use for business-critical attributes (customer address, product pricing). - Type 3 (Add Column): Add a
previous_valuecolumn alongside the current value. Tracks one level of change. Rarely used — Type 2 is almost always preferred in practice.
Q6: What is the difference between DELETE, TRUNCATE, and DROP?
DELETEremoves rows matching a condition. Fully logged, can be rolled back, triggers fire. Slowest for large deletes.TRUNCATEremoves ALL rows from a table. Minimally logged, cannot target specific rows, resets identity counters. Much faster thanDELETEfor full clears.DROPremoves the entire table object (data + schema + indexes). The table no longer exists.
Q7: How would you deduplicate a table with millions of rows?
-- Keep the most recent record per entity
DELETE FROM staging.events
WHERE id NOT IN (
SELECT id FROM (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY entity_id
ORDER BY updated_at DESC
) AS rn
FROM staging.events
) deduped
WHERE rn = 1
);
DELETE with subquery may be too slow. Instead: CREATE TABLE ... AS SELECT the deduplicated result, then swap the tables. This is a common pattern in Redshift and BigQuery where DDL operations are cheap.Q8: Explain star schema vs snowflake schema
- Star schema: Fact table at the center surrounded by denormalized dimension tables. One join to reach any dimension. Simpler queries, better query performance, more storage.
- Snowflake schema: Dimension tables are normalized into sub-dimension tables. Multiple joins required. Less storage redundancy, more complex queries.
Q9: Write a query to calculate a 7-day rolling average of daily revenue
SELECT
order_date,
daily_revenue,
AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg
FROM daily_revenue_summary
ORDER BY order_date;
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives exactly 7 rows (6 prior + current). Using RANGE instead of ROWS would handle gaps in dates differently — mention this distinction to show depth.Q10: What is a fact table vs a dimension table?
- Fact tables store measurable business events (orders, clicks, transactions). Contain foreign keys to dimension tables and numeric measures (amount, quantity, duration). Typically very large (millions–billions of rows).
- Dimension tables store descriptive context about the entities involved in facts (customers, products, dates, locations). Contain attributes used for filtering, grouping, and labeling. Typically smaller (thousands–millions of rows).
SQL questions test three layers: syntax fluency (can you write window functions?), modeling knowledge (do you understand Kimball?), and production thinking (how would you handle this at scale?). Prepare across all three.
Python questions for data engineers differ from general Python interviews. They focus on data manipulation, PySpark, and practical pipeline coding — not algorithms and data structures.
Q11: What's the difference between a list and a generator in Python?
Q12: How would you process a 50GB CSV file that doesn't fit in memory?
import pandas as pd
# Chunked reading — process in batches
for chunk in pd.read_csv('large_file.csv', chunksize=100_000):
# Transform each chunk
transformed = chunk[chunk['status'] == 'active']
part_id = chunk.index[0]
transformed.to_parquet(f'output/part_{part_id}.parquet')
dask for parallel chunk processing, (2) use PySpark for distributed processing across a cluster, (3) convert to Parquet first (columnar format, much smaller), (4) use DuckDB for SQL on files without loading into memory.Q13: Explain the difference between map(), filter(), and reduce()
map()— applies a function to each element, returns transformed elements.map(lambda x: x * 2, [1, 2, 3])→[2, 4, 6]filter()— applies a boolean function, returns only elements where True.filter(lambda x: x > 1, [1, 2, 3])→[2, 3]reduce()— applies a function cumulatively to reduce a sequence to a single value.reduce(lambda a, b: a + b, [1, 2, 3])→6
rdd.map(), rdd.filter(), and rdd.reduce(). Understanding them is fundamental to Spark programming.Q14: What's the difference between PySpark DataFrames and pandas DataFrames?
- pandas: Single-machine, in-memory. Data must fit in RAM. Eager execution (operations run immediately). Great for exploration and datasets up to ~10GB.
- PySpark: Distributed across a cluster. Data is partitioned across nodes. Lazy execution (transformations are planned, only executed on action). Handles TB–PB scale.
df.filter() and df.select() build an execution plan but don't process data until an action (collect(), count(), write()) triggers execution. This allows Spark to optimize the entire pipeline before running it. Kleppmann calls this "dataflow optimization" — the system can reorder and combine operations for efficiency.Q15: How do you handle schema evolution in a pipeline?
- Detection: Compare incoming schema against expected schema before processing. Use schema registries (Confluent Schema Registry for Kafka, Glue Schema Registry for AWS).
- Strategy: Define what to do for each change type:
- New column added → accept it, default to NULL for downstream
- Column removed → alert, don't fail if column is optional
- Type changed → fail fast, require manual review
- Implementation: In Spark, use
mergeSchemaoption for Parquet reads. In dbt, useon_schema_change: 'append_new_columns'. Always version your schemas.
Q16: Write a PySpark job to deduplicate a dataset by keeping the latest record per user
from pyspark.sql import Window
from pyspark.sql.functions import row_number, col
window_spec = Window.partitionBy("user_id").orderBy(col("updated_at").desc())
deduped = (
df.withColumn("rn", row_number().over(window_spec))
.filter(col("rn") == 1)
.drop("rn")
)
Q17: What is the GIL and why does it matter for data engineering?
multiprocessing for CPU-bound work, (2) use PySpark/Dask which distribute work across separate processes/nodes, (3) use libraries like NumPy/Pandas that release the GIL during C-level operations. For I/O-bound work (API calls, file reads), threading still works fine because the GIL is released during I/O waits.Q18: How would you write a custom PySpark UDF and when should you avoid them?
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(returnType=StringType())
def clean_email(email):
return email.strip().lower() if email else None
df = df.withColumn("email_clean", clean_email(df.email))
lower(), trim(), regexp_replace()) which execute natively in the JVM. Use UDFs only when no built-in function exists. For complex logic, consider Pandas UDFs (vectorized) which process batches of rows via Arrow — much faster than row-at-a-time UDFs.Q19: Explain *args and **kwargs
*args collects positional arguments into a tuple. **kwargs collects keyword arguments into a dictionary. Used to create flexible functions that accept variable numbers of arguments. In data engineering, you'll see these in pipeline factories and configuration-driven frameworks where the number of parameters varies by data source.Q20: How do you test a data pipeline?
- Unit tests — test individual transformation functions with known input/output. Use pytest with small sample DataFrames.
- Integration tests — test that pipeline stages connect correctly (source → transform → sink). Use test databases or mock external services.
- Data quality tests — validate output data: row counts, null percentages, value distributions, schema conformance. In production, use Great Expectations or dbt tests.
Python interview questions for data engineers emphasize memory-efficient data processing, PySpark distributed computing, and production pipeline patterns — not LeetCode-style algorithms.
System design rounds are the highest-signal interview stage. They test your ability to reason about trade-offs in distributed systems — concepts directly from Kleppmann's DDIA.
Q21: Design a data pipeline that processes 10 million events per day
- Clarify requirements: What's the latency requirement? (Batch is fine for daily reporting; streaming needed for real-time dashboards.) What's the source? (API, Kafka, database CDC?) What are the downstream consumers?
- Propose architecture: Source → Ingestion → Storage → Transformation → Serving
- Make explicit trade-offs:
Q22: Design a real-time data processing system
- Ingestion: Kafka (partitioned by key for ordering guarantees) or Kinesis
- Processing: Flink or Spark Structured Streaming (Flink for true event-time processing; Spark SS for teams already on Spark)
- State management: How to handle late-arriving data? Use event-time watermarks (Kleppmann Ch. 11)
- Exactly-once semantics: Idempotent writes + consumer offset management. Explain that "exactly-once" is really "effectively-once" — achieved through idempotent consumers, not through preventing duplicate delivery
- Serving: Write processed results to a low-latency store (Redis, DynamoDB) for real-time queries, and to the warehouse for historical analysis
Q23: Design a data warehouse for an e-commerce company
- Identify business processes: Orders, page views, inventory, customer interactions
- Design fact tables:
fact_orders(grain: one row per order line item),fact_page_views(grain: one row per page view) - Design dimension tables:
dim_customer(SCD Type 2 for address changes),dim_product(Type 2 for price changes),dim_date,dim_geography - Choose the grain explicitly — this is the most important Kimball concept: "What does one row represent?"
- Address performance: Partition fact tables by date, cluster by frequently filtered dimensions
Q24: How would you handle data quality in a production pipeline?
- Define expectations: Row count ranges, null percentage thresholds, value distributions, referential integrity
- Implement checks: Great Expectations framework, dbt tests (
not_null,unique,accepted_values, custom SQL tests) - Decide on failure strategy: Hard fail (stop pipeline) vs soft fail (alert + continue). Critical metrics hard-fail; informational metrics soft-fail
- Monitor trends: Track data quality metrics over time. A gradual increase in nulls may indicate an upstream problem before it becomes critical
- Data contracts: Define expectations with upstream teams. Schema registries + SLAs for data freshness and quality
Q25: Design a change data capture (CDC) pipeline
- Source: Use Debezium to capture change events from the database WAL (Write-Ahead Log) — this is non-intrusive, doesn't add load to the source database
- Transport: Stream change events to Kafka (one topic per table, or per database)
- Processing: Apply changes to target tables maintaining order per key. Handle schema evolution (new columns, type changes)
- Sink: Write to data warehouse using merge/upsert operations
Q26: How do you choose between batch and streaming processing?
| Factor | Choose Batch | Choose Streaming |
|---|---|---|
| Latency requirement | Hours are acceptable | Minutes or seconds needed |
| Data completeness | Need all data before processing | Can handle late/out-of-order data |
| Complexity | Simpler to implement and debug | More complex state management |
| Cost | Generally cheaper (process once) | Higher (continuous compute) |
| Use case | Daily reports, ML training, backfills | Real-time dashboards, alerting, fraud detection |
| Team experience | Most DE teams start here | Requires streaming expertise |
Q27: What happens when your pipeline fails at 3 AM?
- Detection: Monitoring + alerting (pipeline SLAs, data freshness checks, failure notifications via PagerDuty/Opsgenie)
- Triage: Is this a data issue (bad upstream data) or infrastructure issue (cluster OOM, network timeout)? Check logs, check upstream status
- Recovery: Idempotent pipelines are critical — you should be able to rerun any failed stage without duplicating data. This means: (a) use
INSERT OVERWRITEnotINSERT, (b) write to partitions atomically, (c) maintain processing checkpoints - Prevention: After resolution, add the failure mode to monitoring and create a runbook. The best on-call engineers reduce future incidents, not just resolve current ones
Q28: Design a data mesh for a large organization
System design answers should follow a pattern: clarify requirements → propose architecture → make explicit trade-offs → discuss failure modes. Reference Kleppmann's DDIA principles (reliability, scalability, maintainability) to show depth.
These questions test whether you've built real infrastructure or just studied for certifications.
Q29: Compare AWS Glue, EMR, and Athena — when would you use each?
- Glue: Managed ETL service. Use for scheduled transformations when you want serverless (no cluster management). Good for: small-to-medium jobs, catalog/schema management, crawlers.
- EMR: Managed Spark/Hadoop clusters. Use for heavy processing (TBs of data), complex Spark jobs, ML workloads. More control, more operational overhead.
- Athena: Serverless SQL on S3 data. Use for ad-hoc queries and exploratory analysis — pay per query. Not for repeated ETL transformations (cost adds up).
Q30: What is the difference between a data lake and a data warehouse?
- Data lake: Stores raw data in any format (Parquet, JSON, CSV, Avro) on cheap object storage (S3, GCS, ADLS). Schema-on-read. Flexible but requires discipline to avoid becoming a "data swamp."
- Data warehouse: Stores structured, modeled data optimized for queries (Snowflake, BigQuery, Redshift). Schema-on-write. Performant for BI and analytics.
- Data lakehouse: Combines both — open formats on object storage with warehouse-like query performance (Delta Lake, Iceberg, Hudi). This is where the industry is converging.
Q31: How do you manage costs in a cloud data platform?
Q32: Explain Parquet vs Avro vs ORC
- Parquet: Columnar format. Best for analytical queries that read specific columns. Default choice for data lakes and warehouses.
- Avro: Row-based format with rich schema evolution support. Best for streaming/Kafka messages where you read entire records and schemas change frequently.
- ORC: Columnar format optimized for Hive. Similar to Parquet but less widely supported outside the Hadoop ecosystem.
Behavioral questions test whether you can communicate about technical work. The STAR method (Situation, Task, Action, Result) provides a reliable structure.
**Situation:** "In [role] at [company], we had [specific problem]..." **Task:** "I was responsible for [your specific responsibility]..." **Action:** "I [specific technical actions you took]..." - What tools/technologies did you use? - What trade-offs did you consider? - Who did you collaborate with? **Result:** "As a result, [quantifiable outcome]..." - Pipeline processing time reduced from X to Y - Data freshness improved from hours to minutes - Saved X dollars/month in cloud costs - Reduced on-call incidents by X%
Q33: Tell me about a time a pipeline failed in production
Q34: How do you handle conflicting data requirements from different teams?
is_active_marketing, is_active_finance). We added a data dictionary entry explaining the distinction. This resolved the immediate conflict and prevented future confusion."Q35: Describe a technical decision you made that you'd change in hindsight
Q36: How do you prioritize when you have multiple urgent requests?
Q37: Tell me about a time you improved a system's performance significantly
Every behavioral answer needs three things: a specific situation (not hypothetical), the actions YOU took (not the team), and a quantifiable result. Prepare 5 stories that cover: production failure, stakeholder conflict, performance optimization, technical decision, and learning from a mistake.
- Jumping straight into coding without clarifying requirements — in system design, the clarification IS the test
- Giving textbook answers without production context — 'I would use Kafka' means nothing without explaining WHY for this specific use case
- Ignoring failure modes — every system design answer should address 'what happens when X fails?'
- Over-engineering solutions — proposing Kafka + Flink + Kubernetes for a batch pipeline processing 10K rows/day
- Not knowing your own resume — if you listed Spark, be ready to explain partition strategies, shuffle operations, and memory tuning
- Skipping the STAR structure in behavioral questions — rambling stories without clear situation/action/result lose the interviewer
- Not asking questions at the end — this signals low engagement. Always have 3 prepared questions about the data stack, team structure, and biggest data challenges
- 01Data engineer interviews test five areas: SQL, Python, system design, cloud platforms, and behavioral — prepare across all five
- 02SQL is the elimination round — practice window functions, CTEs, and query optimization daily
- 03System design determines your level — structure answers as: clarify → design → trade-offs → failure modes
- 04Ground your answers in DDIA (distributed systems) and Kimball (data modeling) — interviewers recognize these frameworks
- 05Every behavioral answer needs the STAR structure with quantifiable results
- 06Allow 2–4 weeks of focused preparation; use the 2-week plan as a starting framework
Do I need to know algorithms and data structures for a data engineer interview?
Less than software engineering interviews, but you should know the basics: hash maps (used in joins), trees (used in indexing), and time/space complexity. The focus is on data manipulation and system design, not LeetCode-style algorithm challenges. Some companies (Meta, Google) do include a coding round with medium-difficulty problems.
Which SQL dialect should I practice?
Standard ANSI SQL covers 90% of interview questions. The differences between PostgreSQL, Snowflake SQL, and BigQuery SQL are minor for interview purposes. If the company uses a specific warehouse, review its syntax quirks (BigQuery uses backticks for table names, Snowflake uses QUALIFY). Practice on any platform — the concepts transfer.
How important is knowing the company's specific tech stack?
Moderately important. You should know what tools they use (check the job description and tech blog) and be able to discuss trade-offs between their choices and alternatives. You don't need deep expertise in every tool, but saying 'I see you use Airflow for orchestration — I've used both Airflow and Dagster, and here's how I'd compare them' shows thoughtfulness.
What if I don't have production data engineering experience?
Focus on projects that demonstrate production thinking: error handling, idempotent processing, monitoring, and documentation. Build 2–3 portfolio projects that show end-to-end pipeline design. In interviews, be honest about your experience level but show that you understand production concerns even if your projects were personal. See our Data Engineer Projects guide for ideas.
How do Meta and Google data engineer interviews differ?
Meta emphasizes SQL heavily (2 SQL rounds are common) and focuses on data modeling at scale. Google emphasizes system design and coding (Python/Java) with more focus on distributed systems concepts. Amazon focuses on behavioral (Leadership Principles) alongside technical. All three test the same fundamentals — the weighting differs.
Should I mention books like DDIA in my interview answers?
Yes, when natural. Saying 'Kleppmann's concept of exactly-once semantics is really effectively-once achieved through idempotent consumers' shows deep understanding beyond surface-level tool knowledge. Don't force references, but naturally citing foundational concepts signals seniority.
Prepared by Careery Team
Researching Job Market & Building AI Tools for careerists · since December 2020
- 01Designing Data-Intensive Applications — Martin Kleppmann (2017)
- 02The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling — Ralph Kimball, Margy Ross (2013)
- 03Fundamentals of Data Engineering — Joe Reis, Matt Housley (2022)
- 04Data Pipelines Pocket Reference — James Densmore (2021)
- 05Research: Quantifying GitHub Copilot's impact on developer productivity — GitHub (2022)