S&OP Demand Forecasting at Scale: How Sanjeev Dhanush Challapalli Rebuilt the Forecast for 700+ SKUs at Thermo Fisher

Expert Insight by

Sanjeev Dhanush Challapalli

Supply Chain Analyst

Supply Chain / Demand Planning / S&OPLinkedIn

Sanjeev is a Supply Chain Analyst working across semiconductor, biotech, and pharma. At Thermo Fisher Scientific he supported S&OP demand planning for 700+ SKUs across North America, rebuilt the forecasting process using time-series modeling (ARIMA, moving averages) to close the gap between demand signals and supply decisions, and surfaced $180K in excess and slow-moving inventory through SKU-level consumption and ageing analysis. At ASM International he manages 70-140 weekly spares orders at 95% consignment reconciliation accuracy and recalibrated SAP S/4HANA inventory parameters to cut expedite dependency 20% and late deliveries 12%. At Vestas Pharmaceuticals he built ARIMA-based forecasting models that fed cost budgeting, and cut end-to-end procurement cycle time 20% via Value Stream Mapping. Tools: SAP S/4HANA, Python, SQL, Power BI, Tableau. Six Sigma Green Belt; CSCMP Demand Forecasting; APICS CPIM in progress. MS in Business Analytics, Northeastern University; BTech in Mechanical Engineering, SRM University.

Verified Expert

Most "demand forecasting at scale" content on the internet stops at three sentences about ARIMA and a screenshot of a Power BI line chart. The reality at S&OP scale — 700+ SKUs across North America, multi-tier inventory, monthly consensus cycles where finance, sales, ops, and supply all need the same number — is that the forecasting model is the easy half. The hard half is which model for which SKU, on which data, with which accuracy metric, reviewed in which cadence. Get any one of those wrong and the forecast is technically running but operationally useless.

The Thermo Fisher rebuild closed the gap between demand signals and supply decisions because every layer below the model — the segmentation, the data audit, the accuracy framework, the consensus cycle — was rebuilt with it. The model itself was the last decision, not the first.

Why Demand Forecasting Breaks at S&OP Scale

: S&OP demand forecasting is the cross-functional process of producing a single, reconciled demand number for every SKU at a defined horizon (typically 12-18 months in monthly buckets) that drives the supply, inventory, and financial plans. The forecast is produced statistically, adjusted with sales and marketing intelligence, reconciled across hierarchy levels (SKU, product family, region, total), and signed off in a monthly S&OP consensus cycle so that supply, finance, and commercial functions all act on the same number.

A working forecast at 50 SKUs is one engineer with a spreadsheet and good demand signal. A working forecast at 700+ SKUs is a system. The shift is not a question of more compute — it is a question of structure. Four things silently degrade S&OP forecast quality once a portfolio grows past the spreadsheet limit:

Demand-pattern heterogeneity. A portfolio of 700 SKUs is rarely 700 instances of the same demand pattern. Some SKUs ship every day. Some ship once a quarter in batches of 100. Some have promotional spikes. A single statistical model averaged across all of them lands somewhere between mediocre and actively wrong on the SKUs that matter most.
Data-quality drift. SKU masters accumulate dirty entries — discontinued items still flagged active, new SKUs without enough history, hierarchy misclassifications. A statistical model trained on the dirty data inherits the noise and forecasts confidently around it.
Process drift. The consensus cycle that worked at 50 SKUs ("everyone reviews every line") collapses past 200. Without exception-based review, sales and finance silently stop participating, and the forecast becomes a stat-team output that ops doesn't trust.
Metric drift. A single accuracy metric — usually MAPE — is reported because it sounds rigorous. On a heterogeneous portfolio, MAPE rewards the wrong things: a 50% error on a $10 part counts the same as a 5% error on a $100K part.

700+

SKUs under S&OP demand planning across North America

Thermo Fisher Scientific

$180K

Excess and slow-moving inventory surfaced via SKU-level consumption + ageing analysis

Thermo Fisher Scientific

Quarter-over-quarter

Shortage-frequency reduction after the forecast rebuild

Thermo Fisher Scientific

12-18 months

Standard S&OP forecast horizon (monthly buckets)

The Thermo Fisher rebuild treated all four of these failure modes as gates the new forecast had to clear, not as nice-to-haves. The data audit closed the data-quality gap before any model ran. The segmentation work prevented one-model-fits-all. The accuracy framework reported WMAPE and bias side by side. And the consensus cadence was redesigned around exception-based review so finance and sales actually participated.

Key Takeaway

S&OP demand forecasting at scale is not a modeling problem — it is a system problem. The four silent failure modes (demand-pattern heterogeneity, data-quality drift, process drift, metric drift) all degrade forecast quality independently of how good the statistical model is. A rebuild that addresses only the model and ignores the other three layers ships a forecast that runs but is not trusted.

The Demand Pattern Problem: One Model Can't Fit 700 SKUs

The single most common reason an S&OP forecast underperforms a hand-tuned spreadsheet is that the same model is applied to every SKU. ARIMA on the whole portfolio. Or Prophet on the whole portfolio. Or worse, simple moving average on the whole portfolio. Each of these is the right answer for some SKUs and the wrong answer for the others.

Demand patterns split into four canonical shapes, and each shape favors a different family of methods. The classification originates in the operations-research literature (Syntetos and Boylan, 2005) and has held up in production because it captures the two dimensions that actually matter: how often demand occurs (intermittence) and how variable it is when it does occur (variability).

Pattern	What it looks like	Typical SKU example	Methods that work
Smooth	Demand occurs in most periods, variability is low	High-volume, fast-moving consumables	Moving average, ETS (exponential smoothing), ARIMA
Intermittent	Demand occurs in many periods missing, but quantities are stable when they do	Spare parts, low-velocity but predictable items	Croston's method, TSB (Teunter-Syntetos-Babai) variant
Erratic	Demand occurs in most periods but with high variability	Promotional or seasonally driven items, externally influenced	Prophet, regression-based models with external drivers, ARIMAX
Lumpy	Demand both intermittent AND highly variable when it occurs	Project-driven, capital spares, large-batch items	Croston with bias correction, judgmental overlay, bootstrap simulation

The mistake is not running ARIMA — ARIMA is excellent on smooth, autocorrelated demand. The mistake is running ARIMA on a lumpy SKU, where the model fits noise as if it were signal and produces a confident-looking forecast that misses every spike.

: A demand pattern is a categorical classification of a SKU's demand-time series along two dimensions: ADI (average demand interval — periods between non-zero demand) and CV² (squared coefficient of variation of non-zero demand). The four canonical patterns — smooth (low ADI, low CV²), intermittent (high ADI, low CV²), erratic (low ADI, high CV²), and lumpy (high ADI, high CV²) — predict which family of forecasting methods will outperform on that SKU.

Why one-model-fits-all keeps shipping anyway

The reason monolithic forecasts persist despite their poor performance is operational: a single model is easy to explain, easy to audit, easy to run in batch. Segmenting the portfolio means maintaining multiple model pipelines, multiple parameter sets, and a routing layer that decides which method runs for which SKU. The complexity is real — but so is the gap between a 30% WMAPE on the whole portfolio and a 12% WMAPE on the same portfolio with segmented methods.

Key Takeaway

No single forecasting method wins on a heterogeneous 700+ SKU portfolio. Smooth demand favors ETS and ARIMA; intermittent favors Croston; erratic favors Prophet or regression with external drivers; lumpy needs judgment plus simulation. Segmentation is what turns the model selection from a guess into a decision, and it is the single highest-leverage change in any forecast rebuild.

The ADI / CV² Segmentation Matrix

The four-pattern classification is operational only when it can be computed from the demand history without judgment calls. Two metrics are enough:

ADI (Average Demand Interval) — the average number of periods between non-zero demand. A SKU that ships every period has ADI = 1. A SKU that ships once every 4 months on a monthly bucket has ADI ≈ 4. The threshold separating "smooth/erratic" from "intermittent/lumpy" is conventionally set at ADI = 1.32, derived empirically by Syntetos and Boylan as the inflection point where Croston's method starts beating exponential smoothing.
CV² (Squared Coefficient of Variation) — the variance of non-zero demand divided by its squared mean. A SKU that ships exactly 100 units every time it ships has CV² = 0. A SKU whose non-zero quantities range wildly has CV² > 0.49 by convention.

: ADI / CV² segmentation is a two-dimensional classification scheme that routes every SKU in a portfolio to one of four demand-pattern categories — smooth, intermittent, erratic, or lumpy — based on the average interval between non-zero demand (ADI) and the squared coefficient of variation of non-zero demand (CV²). Conventional thresholds (ADI = 1.32, CV² = 0.49) come from the Syntetos-Boylan operations-research literature. Each category routes to a different family of forecasting methods.

The four-quadrant matrix:

Quadrant	ADI	CV²	Routes to
Smooth	≤ 1.32 (frequent demand)	≤ 0.49 (low variability)	ETS, ARIMA, simple seasonal models
Erratic	≤ 1.32 (frequent demand)	> 0.49 (high variability)	Prophet, regression with drivers, ARIMAX
Intermittent	> 1.32 (sporadic demand)	≤ 0.49 (stable when occurring)	Croston, TSB (Teunter-Syntetos-Babai)
Lumpy	> 1.32 (sporadic demand)	> 0.49 (variable when occurring)	Croston with bias correction + judgmental overlay + bootstrap simulation

In practice, the layered classification used at S&OP scale combines this demand-pattern matrix with ABC (Pareto by inventory value) and XYZ (variability of demand quantity, often computed differently from CV²). The result is a richer 3D classification — a high-value, smooth, A-class SKU is treated very differently from a low-value, lumpy, C-class SKU even when both have the same ADI/CV² coordinates.

ABC × XYZ × Pattern	Treatment	Review cadence
A-class, X (stable), smooth	Auto-forecast with ETS/ARIMA, low-touch monthly review	Monthly exception only
A-class, Z (volatile), erratic	Prophet or regression + drivers, full S&OP consensus review	Monthly + weekly hot-list
B-class, intermittent	Croston/TSB auto-route, sample-based audit	Quarterly methodology audit
C-class, lumpy	Croston + bias correction; consider buy-to-order or vendor-managed	Quarterly review only; flag for portfolio rationalization

The classification refresh cadence matters

Demand patterns are not static. A SKU that was smooth a year ago can become lumpy after a customer change, a product transition, or a market shift. Re-classifying the portfolio on a fixed cadence (quarterly is typical for S&OP) and routing SKUs to the appropriate method automatically prevents the slow drift where 100 SKUs are silently being forecasted with the wrong method because their pattern shifted but the routing did not.

Key Takeaway

ADI and CV² turn demand-pattern classification from a judgment call into a deterministic, reproducible step. Layered with ABC (value) and XYZ (variability), the classification routes each SKU to the right method, the right review cadence, and the right level of human attention — which is what makes a 700+ SKU portfolio actually plannable instead of theoretically modeled.

Method Matching: Moving Average, ETS, ARIMA, Prophet, Croston

Once SKUs are segmented, method selection collapses to a small, defensible set of choices. The five families that handle 95%+ of S&OP demand are:

Moving Average (and Naive Baselines)

Best for: short-history SKUs, baseline benchmarking, or stable demand with no trend or seasonality. The moving-average forecast for the next period is the average of the last N actuals. The naive baseline (next period equals the last period) is even simpler and is the default benchmark every other method must beat.

The baseline is not a placeholder — it is a critical accuracy check. If ARIMA does not beat naive on a SKU, the model is doing nothing useful and should be replaced with naive.

ETS (Exponential Smoothing State Space)

Best for: smooth demand with optional trend or seasonality. The Hyndman-Khandakar ETS framework automatically selects the right combination of error (additive/multiplicative), trend (none/additive/damped), and seasonality (none/additive/multiplicative) by minimizing AIC. It handles trend and seasonal patterns without manual configuration and runs fast on large portfolios.

ETS is often the right default for smooth, A-class SKUs when the rebuild needs a high-quality auto-routing layer rather than per-SKU hand-tuning.

ARIMA (Auto-Regressive Integrated Moving Average)

Best for: smooth demand with autocorrelation, especially where lagged values predict next values. Auto-ARIMA (the auto.arima algorithm in Hyndman's forecast R package, or pmdarima in Python) automatically searches over (p, d, q) and seasonal (P, D, Q) terms.

ARIMA's strength is also its limitation: it assumes the demand-generating process is stationary or differenceable to stationary. On erratic or lumpy demand it overfits the historical noise and forecasts confidently into the wrong shape.

Prophet (Meta's Decomposable Additive Model)

Best for: erratic or seasonal demand, especially where holidays, promotions, and external drivers matter. Prophet decomposes the time series into trend, seasonality, holidays, and external regressors. It tolerates missing data and outliers better than ARIMA.

Prophet's weakness is its strength: the additive decomposition makes it intuitive and explainable, but on intermittent demand the additive trend term can drift in unrealistic directions. Prophet is the wrong tool for low-velocity spare parts; it is often the right tool for promotional consumer goods or capital-equipment demand with seasonality.

Croston's Method (and TSB)

Best for: intermittent and lumpy demand. Croston decomposes the series into two separate components — non-zero demand size and the interval between non-zero demand events — and applies exponential smoothing to each. The TSB (Teunter-Syntetos-Babai) variant corrects a known bias in classical Croston when the demand interval is changing.

Croston is the only one of these methods that explicitly models the "many zeros" pattern of spare-parts and slow-mover demand. Running ARIMA or ETS on intermittent demand routinely produces forecasts of "0.3 units per month" — operationally meaningless. Croston produces an expected demand rate, which is what inventory parameter calculations actually need.

Method	Best for	Wins when	Loses when
Naive / Moving Average	Baseline benchmark; short-history SKUs	Demand has no trend, no seasonality, no autocorrelation worth modeling	Always; it's a baseline, not a winner
ETS	Smooth demand, A-class SKUs, default auto-route	Trend or seasonal patterns are present and consistent	Demand is intermittent or driven by external factors
ARIMA (auto)	Smooth, autocorrelated demand	Past values strongly predict next values	Lumpy demand; fits noise as if it were signal
Prophet	Erratic, promotional, or holiday-driven demand	External drivers, seasonality, and changepoints matter	Intermittent low-velocity demand; over-smooth trend
Croston / TSB	Intermittent and lumpy demand	Many zeros in the demand series	Demand becomes smooth; switch to ETS

Forecast value-add: the override sanity check

Statistical methods are one layer; judgmental overrides from sales and marketing are another. Forecast Value-Add (FVA) is the discipline of measuring whether each override actually improved forecast accuracy or made it worse. A surprising fraction of overrides — often 30-50% in unaudited environments — degrade accuracy. Tracking FVA per contributor surfaces this and turns the consensus cycle from a debate into a data-driven review.

Key Takeaway

Method matching is not "pick the most sophisticated model." It is routing each SKU to the family of methods that fits its demand pattern, then benchmarking every choice against a naive baseline. Naive, ETS, ARIMA, Prophet, and Croston cover ~95% of S&OP demand when paired with the right segmentation; the remaining 5% lives in judgmental overlays that must be measured (FVA) to ensure they help rather than hurt.

The 7-Step Forecast Rebuild Playbook

The compressed timeline at Thermo Fisher worked because the rebuild was sequenced. Each phase gated on the previous one passing, which prevented the most common rebuild failure mode: jumping straight to model selection on dirty data.

Step 01: Audit the demand history before touching a model

Pull 24-36 months of demand history per SKU. Flag SKUs with less than 12 months of history (separate cold-start treatment). Identify obvious data anomalies: stockout-suppressed demand (where actual demand was capped by supply), one-time customer events that should not be repeated in the forecast, returns that were booked as negative demand. Clean before you classify; classify before you model. Skipping this step is the single most common reason forecast rebuilds drift past 6 months.

Step 02: Classify every SKU by demand pattern

Compute ADI and CV² for each SKU on the cleaned history. Apply the Syntetos-Boylan thresholds (ADI 1.32, CV² 0.49) to assign each SKU to smooth, intermittent, erratic, or lumpy. Overlay ABC (by inventory value) and XYZ (by quantity variability). The output is a portfolio-level routing table: every SKU has an assigned method family before any model is configured.

Step 03: Build the baseline forecast first

Run naive and simple moving average across the entire portfolio. Record WMAPE and bias per SKU and per segment. Every subsequent method must beat this baseline on the SKUs it claims to fit. If ETS does not beat naive on smooth A-class SKUs, ETS is not the answer for that segment — go investigate before tuning hyperparameters.

Step 04: Run the matched method per segment, benchmark, and route

Run ETS / ARIMA on smooth segments, Prophet on erratic, Croston/TSB on intermittent and lumpy. For each SKU, compute WMAPE, bias, and the lift over the baseline. Route each SKU to whichever method shows the highest skill — sometimes the matched-family method wins, sometimes a simpler method wins for SKUs near the segmentation boundary. The routing is data-driven, not rule-driven.

Step 05: Layer judgmental overrides with FVA tracking

Sales, marketing, and category-management overrides go on top of the statistical baseline as a separate layer — never replacing it. Track Forecast Value-Add per contributor: did their override improve or degrade accuracy? Surface this monthly. Overrides that consistently degrade accuracy are not approved; overrides that consistently improve it are honored. This single discipline prevents the consensus cycle from devolving into political negotiation.

Step 06: Reconcile the forecast across the hierarchy

Forecasts at SKU level rarely sum cleanly to forecasts at family or region level — and the family-level forecast is usually more accurate (aggregation reduces noise). Use hierarchical reconciliation (top-down, bottom-up, or middle-out depending on the portfolio) so the SKU-level forecast respects the more reliable family-level signal. The MinT (minimum trace) reconciliation method from the Hyndman literature is a defensible default when no business reason favors a specific direction.

Step 07: Run parallel for one full S&OP cycle before retiring the old forecast

Run the rebuilt forecast alongside the legacy forecast for one complete S&OP cycle (typically 4-5 weeks). Compare WMAPE, bias, and operational outcomes (shortages, expedites, dead stock). Only retire the legacy forecast after the rebuild has demonstrably outperformed on the metrics that matter. The parallel-run period is also when sales and finance build trust in the new number — without it, the rebuild is technically live but operationally rejected.

The audit-first discipline

The single line that separates a rebuild that lands in 8-12 weeks from one that drags to 6+ months is whether the data audit happens first. Modeling on dirty data produces confident-looking forecasts that fail the moment they hit production. The audit is unglamorous, takes 1-2 weeks, and is the highest-leverage activity in the entire rebuild.

Key Takeaway

The 7-step playbook gates each phase on the previous one passing. Audit before classify, classify before model, baseline before benchmark, statistical before judgmental, SKU-level before hierarchical, parallel-run before retire. Skipping a phase compresses the schedule on paper but extends it in reality, because the work that was skipped surfaces later as a forecast nobody trusts.

Forecast Accuracy: MAPE, WMAPE, Bias, and What Actually Matters

The accuracy framework is where good rebuilds quietly fail. Reporting a single number — usually MAPE — is rigorous-sounding and operationally misleading. The discipline that holds up under S&OP review is reporting three metrics together, every cycle.

: MAPE is the average of the absolute percentage errors across all forecasts. Formula: MAPE = mean(|actual − forecast| / |actual|) × 100. Strength: intuitive and easy to communicate. Weakness: undefined when actual = 0 (a real problem on intermittent SKUs), and on a heterogeneous portfolio it weights every SKU equally — so a 50% miss on a $10 part counts the same as a 5% miss on a $100K part.

: WMAPE weights each SKU's absolute error by its actual demand (or inventory value, or revenue). Formula: WMAPE = sum(|actual − forecast|) / sum(|actual|) × 100. Strength: the metric reflects what actually matters to the business — errors on high-volume or high-value SKUs count more. WMAPE is the default S&OP accuracy metric in mature demand-planning organizations precisely because it does not let small SKUs dominate the score.

: Bias is the signed (not absolute) average forecast error: bias = mean(forecast − actual). Positive bias means the forecast systematically over-predicts; negative bias means it systematically under-predicts. Bias is critical because absolute-error metrics (MAPE, WMAPE) hide directional patterns: a forecast can have low WMAPE while consistently over-forecasting, which translates directly into excess inventory. Tracking bias separately surfaces this.

Metric	What it measures	When it wins	When it misleads
MAPE	Average of per-SKU percentage errors	Homogeneous portfolios; small datasets where every SKU is equal weight	Heterogeneous portfolios; intermittent demand (undefined when actual = 0); high-value SKUs get drowned out
WMAPE	Total absolute error divided by total demand	S&OP at scale; portfolios with mixed value and volume	Almost never — this is the default S&OP metric for a reason
Bias	Signed mean error (over- or under-forecasting)	Detecting systematic forecast skew that absolute metrics hide	On its own; bias near zero with high WMAPE is still a bad forecast — both must be tracked
Forecast Value-Add (FVA)	Whether each step in the forecasting process improved or degraded accuracy	Validating that judgmental overrides actually help; auditing the consensus cycle	When the baseline is poorly chosen — FVA is meaningful only against a defensible baseline (typically naive)
Tracking Signal	Cumulative bias normalized by mean absolute deviation	Real-time alerting when a forecast starts drifting	On its own; needs threshold tuning and pairs with bias for diagnosis

The four-metric stack — WMAPE, bias, FVA, and a tracking signal — is what separates a forecast that is reportable from a forecast that is operationally trustworthy. WMAPE quantifies the magnitude of error in business terms. Bias surfaces direction. FVA validates the consensus cycle. The tracking signal alerts when a forecast that was working is starting to drift.

The 'technically accurate, operationally useless' trap

A forecast can hit a 15% WMAPE target every cycle and still produce shortages and excess inventory if the bias is consistently negative on A-class SKUs and positive on C-class. Aggregate accuracy hides segment-level skew. The accuracy framework that holds up in production reports WMAPE and bias by segment (A/B/C × demand pattern), not just by portfolio total.

Key Takeaway

S&OP accuracy is reported as a stack, not a number. WMAPE for magnitude (weighted by what the business actually cares about), bias for direction, FVA to validate that human overrides help rather than hurt, and tracking signals to catch drift in real time. Reporting MAPE alone is the most common reason a forecast looks accurate on the dashboard and produces operational problems on the floor.

From Forecast to Decision: The S&OP Consensus Process

A forecast that nobody acts on is academic exercise. The S&OP consensus process is the cross-functional cadence that turns the demand number into supply, inventory, and financial decisions. The model can be technically excellent and the operational outcomes still be poor if this layer is broken.

: The S&OP consensus process is the monthly cross-functional cycle that produces a single, signed-off demand and supply plan. It typically runs as a 5-step cadence: data review, demand review (sales/marketing input), supply review (operations capacity check), pre-S&OP (resolve gaps), and executive S&OP (signoff and exception escalation). The output is one number — the consensus forecast — that finance, operations, and commercial all act on.

The conventional five-step cadence (sometimes described as the "5-step S&OP cycle"):

Step 01: Data Review (Week 1)

Demand planning team publishes the statistical baseline forecast plus actuals from the previous month. WMAPE, bias, and FVA from prior overrides are reported. Sales, marketing, and finance review the data before the demand review meeting — not in it. This pre-work is what makes the rest of the cycle exception-based instead of line-by-line.

Step 02: Demand Review (Week 2)

Sales, marketing, and category management review the statistical baseline and propose overrides backed by intelligence the model cannot see — promotional plans, pipeline movements, customer-specific events. Each override is logged with rationale and contributor (for FVA tracking). The output is the unconstrained demand forecast.

Step 03: Supply Review (Week 3)

Operations and supply planning evaluate the unconstrained demand against capacity, lead times, and inventory positions. Gaps are quantified — capacity shortfalls, supplier risks, inventory imbalances. The output is a supply plan that meets the demand or a documented gap to escalate.

Step 04: Pre-S&OP (Week 3-4)

Cross-functional resolution of gaps before the executive meeting. Trade-offs are quantified: expedite to meet demand vs accept stockout vs reduce demand commitment. Recommendations are prepared for executive signoff. This step prevents the executive S&OP from devolving into a debate.

Step 05: Executive S&OP (Week 4)

Executive review and signoff of the consensus plan. Only unresolved exceptions and major trade-offs are escalated here. The output is the signed-off demand and supply plan that drives finance, operations, and commercial execution for the next cycle.

Exception-based review is the scale unlock

At 50 SKUs, line-by-line review works. At 700+, it does not — the meetings stretch to 4 hours and stakeholders silently disengage. The cadence that scales is exception-based: SKUs flagged by accuracy threshold breach, bias drift, or override conflict are reviewed in detail; everything else is signed off on the baseline. Exception-based review is the operational mechanic that lets a stat team and a sales team actually collaborate at portfolio scale.

Key Takeaway

The forecast becomes a decision through the S&OP consensus process — the 5-step monthly cadence (data, demand, supply, pre-S&OP, executive) that produces one signed-off number for finance, ops, and commercial to execute against. At scale, the cadence works only when reviews are exception-based: the bottom 80% of SKUs accept the baseline; the top 20% by impact get the cross-functional debate. Without this layer, the rebuilt model produces accuracy that nobody operationalizes.

Common S&OP Forecasting Mistakes

After three roles' worth of demand-forecasting work across pharma, biotech, and semiconductor — and one full S&OP rebuild on a 700+ SKU portfolio — the mistakes that quietly degrade forecast quality are remarkably consistent.

Common Mistakes

Running one statistical method across the entire portfolio

ARIMA on lumpy demand fits noise as if it were signal. Prophet on intermittent demand forecasts fractional units that mean nothing operationally. Without segmentation, the same model that wins on smooth A-class SKUs loses on intermittent C-class SKUs — and the aggregate accuracy hides both.

Segment first using ADI / CV² classification. Route smooth and erratic SKUs to ETS or Prophet, intermittent and lumpy to Croston/TSB. Benchmark every method against a naive baseline; do not use a method that does not beat naive on its segment.

Reporting MAPE only

MAPE weights every SKU equally. On a heterogeneous portfolio, a 50% miss on a $10 SKU shows up the same as a 5% miss on a $100K SKU — so accuracy improvements on the SKUs that actually matter get drowned out, and the forecast can hit its MAPE target while shipping operational pain.

Report WMAPE (weighted by demand or value), bias (signed error), and FVA (forecast value-add of overrides) together every cycle. Report by segment, not just portfolio total.

Skipping the data audit

Modeling on dirty data — stockout-suppressed history, returns booked as negative demand, discontinued SKUs still flagged active — produces confident-looking forecasts that fail the moment they hit production. The audit feels unglamorous but is the highest-leverage activity in any rebuild.

Spend the first 1-2 weeks of any forecast rebuild on data audit before any model runs. Document the demand history cleaning rules; apply them consistently; check them quarterly.

Treating judgmental overrides as ground truth

Sales and marketing intelligence is real and valuable — and unaudited overrides routinely degrade forecast accuracy 30-50% of the time. Without measuring whether each override improved or worsened the baseline, the consensus cycle becomes political negotiation instead of data-driven review.

Track Forecast Value-Add per contributor every cycle. Surface negative-FVA overrides; require justification. Approve overrides on a rolling basis, not by default.

Forecasting at SKU level only

SKU-level forecasts are noisier than family- or region-level forecasts because aggregation reduces noise. A SKU-level forecast that does not respect the more reliable family signal will systematically miss in the same direction, even if the per-SKU model is well configured.

Apply hierarchical reconciliation across the SKU/family/region/total hierarchy. The MinT (minimum trace) method is a defensible default; top-down or bottom-up are appropriate when business reasoning favors one direction.

No parallel-run period before retiring the legacy forecast

Switching cold from old to new forecast guarantees stakeholder distrust. Sales, finance, and ops have built mental models around the old forecast's quirks. The new forecast is not just a different number — it is a different process — and trust is built only by running them side by side and showing the rebuild outperforming on metrics that matter.

Run the rebuilt forecast in parallel with the legacy for at least one full S&OP cycle. Report both numbers, both WMAPEs, and the operational outcomes (shortages, expedites, dead stock). Retire the legacy forecast only after the rebuild has demonstrably won.

Reviewing every SKU in every consensus meeting

At 700+ SKUs, line-by-line review stretches meetings to four hours and stakeholders silently disengage. The consensus process becomes a ritual nobody respects. The forecast that emerges is statistical and untrusted, or judgmental and untracked.

Switch to exception-based review. SKUs flagged by accuracy threshold breach, bias drift, override conflict, or business event get the cross-functional debate. The bottom 80% of SKUs are signed off on the baseline. Reviews compress to 60-90 minutes and stay focused on what matters.

Key Takeaway

The seven mistakes are remarkably consistent across organizations and industries. Each is preventable with the discipline already documented in the playbook above — segment before model, report WMAPE plus bias plus FVA, audit before classify, track override value-add, reconcile hierarchically, parallel-run before retire, and review by exception. None of these is technically novel; what makes them rare in practice is the discipline of doing all seven, every cycle.

Pros

Full control over segmentation logic, method choice, and accuracy framework — tunable to the actual portfolio
Lower long-term total cost of ownership once the team is built; no per-seat or per-SKU licensing fees
The team owning the rebuild becomes the team that operates the forecast, which compresses learning cycles
Tooling stays modular: ARIMA from statsmodels, Prophet from Meta, Croston from any time-series library, BI in Power BI or Tableau — all can be swapped without vendor lock-in
Forecast logic is auditable end-to-end, which matters in regulated industries (pharma, biotech) where forecast inputs to financial planning are scrutinized

Cons

Higher upfront investment: data engineering, modeling, BI, and S&OP cadence design are non-trivial
Requires retaining domain talent; turnover in the demand-planning team can stall the operating cadence
No vendor accountability when accuracy drifts; the team owns the failure mode
Hierarchical reconciliation, FVA tracking, and exception-based review need to be built rather than configured
Less defensible to executives unfamiliar with the build vs buy trade-off — vendor solutions are often easier to justify on the budget line even when in-house outperforms operationally

Key Takeaways: S&OP Demand Forecasting at Scale

01S&OP forecasting at 700+ SKUs is a system problem, not a modeling problem — demand-pattern heterogeneity, data drift, process drift, and metric drift are the four silent failure modes a rebuild must address
02Demand-pattern segmentation (smooth, intermittent, erratic, lumpy) via ADI / CV² classification routes each SKU to the family of methods that fits its pattern — the highest-leverage change in any rebuild
03Method matching: ETS or ARIMA on smooth, Prophet on erratic, Croston/TSB on intermittent and lumpy, with naive as the baseline every method must beat
04Report WMAPE, bias, and FVA together — not MAPE alone — and report them by segment, not just portfolio total
05The 7-step rebuild playbook gates each phase on the previous: audit before classify, classify before model, baseline before benchmark, statistical before judgmental, SKU-level before hierarchical, parallel-run before retire
06The S&OP consensus process turns the forecast into decisions through a 5-step monthly cadence — and only scales when reviews are exception-based, not line-by-line
07The seven common mistakes (one-model-fits-all, MAPE only, skipping audit, unaudited overrides, no hierarchical reconciliation, no parallel run, exhaustive review) are preventable with the discipline of doing all seven preventives every cycle

FAQ

How big does a portfolio need to be before segmentation matters?

Segmentation starts paying off around 50-100 SKUs and becomes essential past 200. Below 50, the overhead of maintaining multiple model pipelines may exceed the accuracy gain. Past 200, running one model across the portfolio leaves accuracy on the table for the SKUs that matter most. The 700+ SKU rebuild at Thermo Fisher sat squarely in the 'segmentation is essential' range.

Is Prophet better than ARIMA for S&OP demand forecasting?

Neither is universally better — they win on different demand patterns. Prophet excels on erratic, seasonal, or externally driven demand (promotions, holidays, events). ARIMA excels on smooth, autocorrelated demand without strong external drivers. Running them head-to-head on the same SKU and picking the winner is more useful than picking a global default. Both should always be benchmarked against a naive baseline.

What forecast horizon should S&OP target?

Typical S&OP horizon is 12-18 months in monthly buckets. The first 3 months drive operational decisions (orders, expedites, inventory); months 4-12 drive supply planning, capacity, and procurement contracts; months 13-18 drive financial planning and longer-term capacity decisions. Different horizons may need different methods — short-horizon forecasts often benefit from more reactive models, longer-horizon from more stable seasonal/trend models.

How do you handle new SKUs with no demand history?

Cold-start SKUs need a separate process: analog-based forecasting (use the demand profile of a similar existing SKU as the starting point), parameter overlays from product management's launch plan, and aggressive review cadence (weekly or bi-weekly) until the SKU accumulates 6-12 months of history and can transition to the standard segmented forecast. Forcing a new SKU into the standard pipeline produces forecasts that look statistical but are essentially fabricated.

What is the relationship between demand forecasting and inventory parameters (safety stock, reorder points)?

The forecast drives the inventory parameters: safety stock formulas use forecast error variance, reorder points use forecast lead-time demand, and economic order quantity uses forecast demand rate. A bad forecast cascades — the inventory parameters built on it are wrong, and the operational outcomes (stockouts and excess) follow. This is why the forecast rebuild is the upstream lever that needs to land before parameter recalibration is meaningful.

How often should the demand-pattern classification be refreshed?

Quarterly is the typical cadence for S&OP. Demand patterns shift over time — a SKU that was smooth a year ago can become lumpy after a customer change, product transition, or market event. Re-running ADI / CV² classification quarterly and re-routing SKUs to the appropriate method prevents the slow drift where 100 SKUs are silently being forecasted with the wrong method.

How do you quantify the business value of a forecast accuracy improvement?

Translate the WMAPE improvement into operational outcomes: reduced safety stock (because forecast variance is lower), reduced expedite frequency (because forecast bias is lower), reduced stockouts (because demand spikes are anticipated better), and reduced excess inventory (because forecasts are not systematically over). Each of these has a dollar value the finance team can quantify. The Thermo Fisher rebuild surfaced $180K in excess inventory partly because the rebuilt forecast exposed which SKUs had been chronically over-forecasted.

Sources

01Forecasting: Principles and Practice (3rd edition) — Rob J. Hyndman and George Athanasopoulos
02The accuracy of intermittent demand estimates — Aris A. Syntetos and John E. Boylan
03Forecasting: Methods and Applications (Croston's method) — J.D. Croston
04Prophet: forecasting at scale — Sean J. Taylor and Benjamin Letham (Meta)
05statsmodels: Time Series Analysis (ARIMA, ETS) — Statsmodels project
06Optimal combination forecasts for hierarchical time series (MinT reconciliation) — Shanika L. Wickramasuriya, George Athanasopoulos, Rob J. Hyndman
07ASCM (Association for Supply Chain Management) S&OP Reference — ASCM
08IBF (Institute of Business Forecasting & Planning) — IBF
09Forecast Value Added Analysis: Step by Step — Michael Gilliland (SAS)
10SAP Integrated Business Planning for Demand — SAP