Lesson 6 of 13

The data is the teacher

Explain why the training data sets both what a model can do and its blind spots.

01 · Learn · the idea

Two teams build the same thing: a machine that reads road signs from a camera. Same design, same training loop, same everything. One team’s machine works beautifully. The other team’s machine is dangerous. The only difference is the pile of photos each team handed it to learn from. That pile — nothing else — decides what the machine can do and where it falls apart. In machine learning, the data is the teacher, and the machine never learns a single thing its teacher didn’t show it.

The machine has no other source of truth

You’ve seen how this works. The model doesn’t get told the rules (i1). It gets shown examples and digs the pattern out of them (i1), tuning its numbers to fit what it saw (i2, i4). That’s the whole supply of knowledge. There is no back door — no textbook it consults, no world it looks out at, no common sense it was born with. The examples are the entire world the model ever meets.

So whatever is in the pile becomes what the model can do. And whatever is missing from the pile becomes a hole the model can’t see. It doesn’t know the hole is there. Remember, the model always computes an answer (i2) — for any input, it runs its numbers and returns a guess. Feed it something it was never shown, and it doesn’t stop and say “I don’t know.” It confidently returns an answer anyway. Just a bad one.

This is the part that catches people out. A gap in the data isn’t like a warning light. The model has no way to notice that it was never shown rain, or night, or a certain kind of face. It behaves exactly as sure of itself in the dark as in daylight. The confidence is the same; only the correctness collapses.

A hand-written rulebook at least fails loudly — it hits a case with no rule and stops. A learned model fails quietly. It always has a number to give you, so it always gives one. The blindness is invisible from the inside.

Representativeness beats raw size

You might think the fix is just “more data.” Not quite. A million examples that all look the same teach less than a smaller, varied pile. If every photo is a sunny street, then a million of them still teach the model nothing about rain. Size isn’t coverage. What matters is whether the examples span the situations the model will actually face. A small, well-spread pile beats a giant, lopsided one.

And it cuts the other way too. If the pile has errors, or leans heavily one direction, the model swallows all of it as truth. Wrong labels, missing cases, a skewed mix — the model can’t tell these apart from real signal. It just fits what it’s given. This is the old rule: garbage in, garbage out. The model is a mirror of its data, warts and all.

A worked example: the road-sign reader

Take a vision model trained on 100,000 road photos. Sounds like plenty. But look at the mix: 95,000 were shot on clear sunny days, only 5,000 in rain, and none at night. The teacher was almost entirely a sunny teacher.

Now test it. On a fresh sunny test set, it reads signs correctly about 98% of the time. Excellent. Ship it? Watch what happens on a rainy test set: the same model drops to about 55% — barely better than a coin toss on a yes-or-no call. It never really learned rain, because rain was nearly absent from its teacher. At night it’s worse still. It saw zero night photos, so it isn’t reading signs at all — it’s guessing, while reporting the same steady confidence it had at noon.

The machine isn’t broken. It’s a faithful mirror of a lopsided pile.

Now fix the teacher, not the machine. Add 40,000 varied rain-and-night photos to the training set and run the training loop again. Rainy accuracy climbs back toward ~95%. Nothing about the model’s design changed — same numbers, same loop, same math. Its teacher got more complete, so the model got more capable. The blind spots closed because the data’s blind spots closed.

Why this is the whole game

This is why the same model can be brilliant in one setting and useless in another. The second setting simply wasn’t in its teacher. A tool trained on one kind of case, moved to a case it never saw, doesn’t gracefully adapt — it guesses with a straight face. The performance you measured says nothing about the situations your test set didn’t cover.

On the whole, it’s worth holding this steadily: a model knows exactly as much of the world as its examples showed it, and not one thing more. Its skills are its data’s skills. Its blind spots are its data’s blind spots — and it can’t feel either one. When you meet a machine that seems to understand, the honest question isn’t “how smart is it,” but “what was it shown, and what was quietly left out?” We built machines that mirror the pile we hand them. The pile is doing more of the deciding than the machine is, and it’s the part nobody looks at.

02 · Try · the lab

03 · Check · quick quiz

1. A sign-reading model was trained on 100,000 photos: 95,000 sunny, 5,000 rainy, none at night. It scores 98% on a sunny test and 55% on a rainy test. What best explains the rainy score?

The model is broken and needs to be rebuilt from scratch
Rain was almost absent from the training data, so the model barely learned it
Rainy photos are simply harder for any camera to read
The model got tired after processing so many sunny photos

Answer

Rain was almost absent from the training data, so the model barely learned it — The data is the teacher. Rain was nearly missing from the pile, so the model has a blind spot there. Nothing about the model is broken — it faithfully mirrors a lopsided training set. Fix the data, not the machine.

2. That same model is shown a night-time sign it never saw anything like in training. What does it do?

It stops and reports that it doesn't have enough information
It refuses to answer to stay safe
It confidently returns an answer anyway — most likely a wrong guess
It automatically fetches more night photos to learn from

Answer

It confidently returns an answer anyway — most likely a wrong guess — A model always computes an answer for any input (i2). It has no way to feel that night was missing from its teacher, so it doesn't hesitate — it returns a guess with the same confidence it shows in daylight. The blindness is invisible from the inside.

3. Team A has 2,000,000 training photos, all taken on sunny mornings on one street. Team B has 80,000 photos spread across sun, rain, snow, and night. Whose model will handle varied real-world driving better?

Team B — its data covers the situations the model will actually face
Team A — more photos always means a better model
They will perform the same, since size and variety cancel out
Impossible to say without knowing the camera resolution

Answer

Team B — its data covers the situations the model will actually face — Representativeness beats raw size. Two million near-identical sunny photos still teach nothing about rain or night. A smaller, well-spread pile that covers the real conditions wins. Coverage, not count, is what matters.

4. The road-sign model does great in the sunny city it was tested in, but reads signs poorly in a snowy region. What's the honest question to ask about it?

How much electricity does it use per photo?
How advanced is the model's internal design?
Is the model old and in need of an update?
What conditions were in its training data, and what was quietly left out?

Answer

What conditions were in its training data, and what was quietly left out? — A model's skills are its data's skills, and its blind spots are its data's blind spots. The snowy region likely wasn't in its teacher. The performance you measured in one setting says nothing about the situations the training data never covered.

The machine has no other source of truth

What’s absent is a blind spot the model can’t feel

Representativeness beats raw size

A worked example: the road-sign reader

Why this is the whole game