Lesson 7 of 13
Memorising versus learning
Tell overfitting from real learning, and explain the train/test split.
01 · Learn · the idea
Two students prepare for the same exam. One gets hold of last year’s paper and memorises the answers, word for word, question by question. On last year’s paper, they score a flawless 100%. The other student learns the actual subject — a little rusty on last year’s exact questions, maybe 90%, but they understand the material. Then the real exam arrives, with new questions. The memoriser is lost. The learner is fine. A machine can fail in exactly this way, and it is one of the most important failures to understand, because a model that has memorised looks brilliant right up until the moment it meets something new.
The trap hiding inside the training loop
Recall the training loop from earlier. It only ever does one thing: shrink the error on the examples it was shown. That is its entire job, and it is very good at it. But there is a trap folded inside that single-minded goal.
If the model is flexible enough — enough adjustable numbers, enough freedom to bend — it can drive the training error all the way to zero. Not by finding the real pattern, but by bending itself to pass through every single example exactly, flukes and all. Real data is noisy. Some points sit high or low by pure chance. A model chasing zero error will contort itself to hit those chance wobbles too, treating noise as if it were signal.
That is overfitting: fitting the noise in the examples instead of the pattern underneath. The model didn’t learn the subject. It memorised the past paper.
The tell: hold some examples back
Here is the fix, and it is beautifully simple. Before training, split your examples into two piles. Most go into a training set the model learns from. But hold some back — a test set the model never, ever sees during training. Then, after training finishes, you check the model on that held-back pile.
Now you have two scores. The training score — how well it does on what it studied. And the test score — how well it does on questions it never saw. The gap between them is the overfitting alarm.
- Both scores high, and close together? The model learned the real pattern. It generalises — it works on new data.
- Training score high but test score low? It memorised. It aced the past paper and flunked the real exam.
Generalisation — doing well on data it never saw — is the actual goal. Not a perfect training score. A model that nails its training data and flops on new data is worse than useless, because it looks like a triumph and fails in the world.
A worked example: predicting exam scores
Say you want to predict a student’s exam score from the hours they studied. You collect 12 students — a training set. Plot them and they roughly rise: more hours, higher scores. But it’s real life, so the points wobble. Some diligent student had a bad day; some lucky one over-performed.
You try three models and check each on the 12 training points, then on a held-back test set.
Too simple — a flat, straight line. It barely follows the upward trend. Training accuracy: 68%. Test accuracy: 66%. Both poor, and close together. This model is underfitting — too rigid to capture even the basic shape. It didn’t memorise; it just never learned.
Just right — a gentle curve. It follows the trend without chasing every wobble. Training accuracy: 90%. Test accuracy: 88%. Both high, and almost no gap. It learned the real pattern, and that pattern holds up on students it never saw.
Too flexible — a wildly wiggly curve. It threads through every single one of the 12 training points perfectly. Training accuracy: 100% — flawless on what it memorised. Test accuracy: 62% — worse than the gentle curve. The wiggle chased the noise in those 12 points, and the noise doesn’t repeat in new data, so on the test set it lurches all over the place.
Read the pattern by watching the gap. Good fit: 90 versus 88, almost no gap. Overfit: 100 versus 62, an enormous gap. A perfect training score is a red flag, not a trophy.
The two ways to fail
Notice there were two failures, not one, sitting on either side of the good fit.
Underfitting is a model too simple to catch the pattern — both scores low, like the flat line at 68 and 66. It can’t even fit what it was shown.
Overfitting is a model too flexible, memorising noise — training score high, test score low, like the wiggle at 100 and 62. It fits what it was shown too well and learns the wrong lesson.
The sweet spot lives in between: flexible enough to catch the real pattern, disciplined enough to ignore the noise. Finding that spot is a large part of the craft of building a model that works.
On the whole
A model that scores perfectly on its training data has told you almost nothing. It might have learned the subject, or it might have memorised the answers — and from the training score alone, you cannot tell which. That is why the held-back test set matters so much: it is the only honest question, the one the model couldn’t have studied for.
It is worth carrying this quietly. When a system is described as accurate, the real question is: accurate on what — the examples it was tuned on, or the world it will actually meet? Those can be very far apart. The machine is only ever shrinking one number on one pile of data. Whether that becomes real understanding or expensive memorisation depends on a test it was never allowed to see — and it is easy, from the outside, to mistake one for the other.
02 · Try · the lab
03 · Check · quick quiz
1. A model scores 100% on its training examples. What does that tell you about how well it will do on new data?
- Almost nothing on its own — a perfect training score can mean it learned the pattern OR that it memorised the examples, and you can't tell which without a test
- It will do excellently on new data, since 100% is the best possible result
- It has definitely learned the real pattern behind the data
- It means the training data was too easy and should be made harder
Answer
Almost nothing on its own — a perfect training score can mean it learned the pattern OR that it memorised the examples, and you can't tell which without a test — The training loop only ever shrinks error on the examples shown. A perfect score could be genuine learning or pure memorisation of noise. Only a held-back test set reveals which — a perfect training score is a red flag, not a trophy.
2. You build a model to predict exam scores. It scores 100% on the training set but only 62% on a held-back test set. What has gone wrong?
- The test set is broken and should be thrown away
- The model is underfitting — too simple to capture the pattern
- The model is overfitting — it memorised the noise in the training points instead of the real pattern, so it fails on data it never saw
- Nothing is wrong; a 38-point gap is normal and healthy
Answer
The model is overfitting — it memorised the noise in the training points instead of the real pattern, so it fails on data it never saw — A huge gap between a high train score and a low test score is the classic overfitting alarm. The model bent itself through every training point, chasing chance wobbles that don't repeat in new data. The gap, not the training score, is what to watch.
3. Why do we hold back a test set that the model never trains on?
- To save computing power during training
- To have a batch of examples the model could not have memorised, giving an honest measure of how it does on data it never saw
- Because more data always makes training slower
- To give the model a second chance to lower its training error
Answer
To have a batch of examples the model could not have memorised, giving an honest measure of how it does on data it never saw — The test set is the only honest question — the model couldn't have studied for it. Doing well on data it never saw is generalisation, and generalisation, not a perfect training score, is the actual goal.
4. A model scores 68% on its training set and 66% on its test set — both low, and close together. Which failure is this, and why?
- Overfitting — the scores are too close to each other
- Neither failure; low scores that are close together always mean the model is fine
- Overfitting — any model scoring below 70% has memorised the data
- Underfitting — the model is too simple to capture even the basic pattern, so it does poorly on both the data it saw and the data it didn't
Answer
Underfitting — the model is too simple to capture even the basic pattern, so it does poorly on both the data it saw and the data it didn't — Underfitting is the opposite failure to overfitting: a model too rigid to learn the pattern at all. Both scores are low because it never captured the trend — there's no memorisation gap, just weakness everywhere. The sweet spot sits between underfitting and overfitting.