2026-05-08 · 6 min read · #time-series #lessons

What Air Cargo Forecasting Taught Me About Real-World ML

My master's thesis was on forecasting air cargo demand. It sounds dry. It turned out to be the project that taught me more about applied ML than any course I took.

The setup: you have years of weekly cargo volumes for a major German airline. You also have oil prices, exchange rates, a calendar of public holidays, and a long memory of which lanes spike when. The question: can you predict next month's volume well enough to plan capacity?

Here are the five things I wish I'd known on day one.

1. The baseline is almost always closer than you think

Before you reach for an LSTM, fit a "seasonal naive" model — just predict that next week looks like the same week last year, plus a trend. On my data, that baseline was within 8% of the best deep model I trained. The deep model took two weeks to tune. The baseline took ten minutes.

This isn't a knock on deep learning. It's a reality check: if your fancy model isn't substantially better than the dumb one, the marginal accuracy isn't worth the operational cost of running it.

Diminishing returns are real. The jump from "nothing" to "decent features" is bigger than the jump from "decent" to "exotic."

2. Features beat models, almost always

The single biggest accuracy jump I got was from adding three features: a lagged oil price (because fuel costs ripple into freight demand 4–6 weeks later), a holiday calendar with country-specific weights, and a "Q4 ramp" indicator capturing the pre-Christmas surge.

Those three features did more for accuracy than switching between five different model architectures. The lesson sticks: spend two hours on features for every hour you spend on models.

3. The evaluation metric matters more than the model

I started by optimising for RMSE. The airline didn't care about RMSE. They cared about being able to plan a quarter ahead — so what mattered was the 4-week-ahead MAPE on the worst-performing lane, not the average across the whole network.

Once we changed the loss to weight harder-to-predict lanes higher, the model gave noticeably worse average numbers and noticeably better operationally useful numbers. Boss happy. Average MAE: irrelevant.

If you can't write the loss as "we want to minimise X because Y will happen if we don't," you're optimising the wrong thing.

4. Distribution shift will eat your lunch

My validation strategy was time-based: train on 2017–2021, validate on 2022. Models did well on validation. They did much worse on the held-out 2023 test. Why? COVID. The shape of demand shifted, and our 2022 validation happened to look more like the training years than 2023 did.

The fix wasn't a better model. It was rolling-origin evaluation: train, predict 4 weeks, slide forward, repeat. This gave us a much better estimate of "how the model behaves when the world keeps changing," which is the only kind of evaluation that matters in production.

5. The "last mile" is operational, not statistical

The model could be 99% accurate, but if its output is a CSV that lands in an analyst's inbox every Monday at 6:30am — and they don't open Outlook until 10 — you have a 0%-useful model. The forecast needs to land where the planning decision is made, in the format that decision-makers already trust.

So we built a small PowerBI dashboard with three views: this week, next 4 weeks, "what changed since last forecast." That last view was the one people kept open in a browser tab. Without it, the project would have shipped, been admired, and quietly died.

Closing

None of this is groundbreaking. But almost every "ML in production failed" story I've heard since traces back to one of these five points — usually two or three at once. The thesis is online if you want the technical detail; the bullet points above are the version that's lasted me four years.

← All posts