This is a story about something nobody thought was possible. For 70 years, computers couldn't understand a single word. Then, in the last decade, that changed. What you're holding is the story of how it happened. And what it means for you.
What's inside
Five chapters. Short ones. Each one builds on the last.
You can flip through in order. Or jump to whatever's pulling you in. About 30 minutes of reading.
With a few things to try along the way.
Before we start
Computers only understand one thing. Numbers. So how do you teach a computer what a word means?
It's been impossible. For 70 years. Until we figured something out.
Meaning doesn't live inside a word. Meaning lives in the relationships between words. That's the whole secret.
Every AI you've ever used is built on that one idea.
The idea that started it all
Long before computers, a linguist named J.R. Firth said something.
You shall know a word by the company it keeps.
Think about that. You've never heard the word scalpel before. But you see it next to surgeons. Operating rooms. Incisions.
You know exactly what it means. You didn't need a dictionary. You learned from the company the word kept.
That's the entire idea modern AI is built on.
Chapter one. Words become numbers. Every word the AI knows is stored as a list of numbers. A coordinate. On a map. Words with similar meanings land close together. Here's how we got there.
The wall, 1960s–2012
Firth had the idea. But nobody could turn it into something a computer could actually use. The wall lasted half a century.
Here's the first thing they tried. Take a 50,000-word dictionary. Number every word.
Cat is 4,823. Feline is 12,047. Helicopter is some other number.
Done? No. Those are just three random numbers.
Cat and feline mean almost the same thing. But the numbers can't tell. The numbers don't mean anything.
The wall, continued
Second attempt. Use Firth's idea directly. For every pair of words in a giant pile of text, count how often they appear near each other.
Doctor with hospital? Constantly. Doctor with kettle?
Almost never. It worked. Sort of.
The problem was the size. A 100,000-word vocabulary needs a 100,000 by 100,000 table. 10 billion cells.
Almost all of them empty. Useful information was buried in oceans of nothing. The field needed something completely different.
The hidden chapter, 2003 — 2011
Quietly. In academic papers nobody outside the field read. The cracks started years before the world was paying attention.
A researcher named Yoshua Bengio. He trained a tiny neural network to predict the next word in a sentence. While playing that game, the network quietly invented a list of numbers for every word.
Words with similar meanings ended up with similar numbers. That's the real birth of word embeddings. Five years later, two more researchers — Collobert and Weston — proved you could train these vectors once.
And reuse them. For dozens of different tasks. Pretrain.
Fine-tune. The whole modern recipe. Already on the table.
What changed when the world finally noticed wasn't the idea. It was the scale. That's the page you're about to flip to.
The first breakthrough
2013. A small team at Google. Led by a researcher named Tomas Mikolov.
They took Bengio's approach. Stripped it down for speed. And ran it on billions of words.
The trick was almost embarrassingly simple. Teach a program a game. Show it a word.
Ask it to guess the words around it. Have it play the game billions of times. And here's the punchline.
Throw away the game. Keep the addresses the program built while learning. Those are the embeddings.
They called it Word2Vec. And it changed everything.
Two clever tricks behind Word2Vec
Two clever tricks made this actually run.
Predicting the right word out of 50,000 was brutally slow. So they reframed the game. For each correct word, throw in a few random fakes. Just learn to tell them apart. Training got hundreds of times faster.
How many words around the target you look at decides what kind of map you build. Small window — you learn grammar. Large window — you learn topic. Same algorithm. Different lens. Different map.
The Spotify analogy
Here's the easiest way to think about it. Spotify scores every track. Energy.
Danceability. Tempo. Songs with similar profiles get recommended together.
Word2Vec does the same thing. Just with 300 scores instead of a handful. And the wild part — the model invented those 300 categories on its own.
Nobody told it what to look for. Two words with similar profiles? Similar meaning.
Two words with completely different profiles? Different meaning. A nutrition label.
For words.
The surprise
Once words live in numerical space, something pretty incredible happens. You can do math on them. Take the word king.
Subtract man. Add woman. You land almost exactly on queen.
Try Paris. Subtract France. Add Italy.
You land on Rome. Nobody programmed that. It just fell out of the training.
The model invented directions inside the space. A male-to-female direction. A country-to-capital direction.
And those directions actually mean something. That's the moment everyone realized. There's real structure in there.
The other side of the same trick
But the same trick that gives you queen also gives you something darker.
Not because nurses are women. Because the training text paired doctor with male pronouns more often. And nurse with female ones.
The geometry just inherited the pattern. Including the ugly ones.
AI doesn't have opinions. It has a very precise reflection of whatever it was trained on.
Whatever bias is in the source becomes a measurable direction in the math.
Time to try it. Four words. Three live in the same neighborhood. One doesn't. Click the outlier. Then see the map the AI would actually draw. You'll feel exactly what the model is doing.
Chapter two. The race to BERT. Five years. Six papers. Each one solved exactly one thing the previous one couldn't. Not one genius. A relay race.
Five years, six papers
The starting line.
2017 — 2018, the leap
Then everything changed in eighteen months.
The puzzle behind it all
First, a puzzle. Read this sentence.
The trophy didn't fit in the suitcase because it was too big.
What does 'it' refer to? The trophy or the suitcase? You knew instantly.
Your brain didn't read every word equally. It pulled hard on fit. Trophy. Suitcase. It skipped the and because.
That's attention. Deciding which words matter more when you figure out what another word means.
Researchers had been bolting attention onto old models for years. The 2017 paper asked an obvious question. What if attention isn't an add-on? What if it's the whole engine?
The engine change
A Google team publishes a paper. With a cocky title.
Attention is all you need.
And it changes the entire field.
Before the Transformer, AI read like you read with one finger. Left to right. One word at a time. Trying to remember what came earlier. Slow.
The Transformer throws all of that out. Every word looks at every other word. All at once. In parallel. And it decides for itself which words matter.
Two wins.
The moment
October 2018. A Google team — Devlin, Chang, Lee, Toutanova — publishes a paper. It becomes the most-cited NLP paper of the decade.
Here's what BERT did differently. Instead of predict the next word — which is what GPT was doing — BERT trained on fill in the blank.
Hide 15% of the words. Ask the model to guess them. That one change makes the model bidirectional.
To guess a blank, you need to see the words before AND after. Full context. Always.
On 11 benchmarks, BERT set new state-of-the-art the same day it released.
Two things BERT changed forever
Two things BERT changed forever.
BERT-base was built as 12 layers. 12 attention heads. 64 numbers per head. Multiply: 768. That's the size of every word's vector inside the model. BERT was so dominant that 768 became the default for the whole industry.
Pretraining BERT cost a fortune. But you only had to do it once. Anyone with a laptop and a few hundred examples could fine-tune it for a specific job in an hour.
Before BERT, training a serious language model needed Google-scale resources. After BERT, a small team could ship something competitive. That's the moment AI stopped being a giant-tech-company sport.
The through-line
Look at the through-line.
Watch the bank example.
She deposited the check at the bank. They sat on the muddy bank watching boats.
Word2Vec gives bank the same address both times. A blurry average.
BERT reads the surrounding words. Gives each one a different number. Same word. Different sentence. Different number.
The map became dynamic.
Six cards. Wrong order. Put them in sequence. You don't even need to know the dates. Each step is a direct answer to the flaw in the one before.
What you just noticed
You just derived the order without ever seeing a date. Each step in the race fixed exactly one thing the previous step couldn't do.
That's what progress actually looks like in a research field. Not a flash of genius from one person. A relay race.
Where the baton is the unsolved problem.
2019 — today
BERT didn't end the story. It opened a door. Once researchers saw what was possible, the question became — what if we go bigger? Change the goal? Feed it more?
Then OpenAI flips direction. Instead of fill-in-the-blank, they double down on predict-the-next-word. And scale it. Massively. GPT-3 has 175 billion internal numbers. Suddenly the model can write essays. Answer questions. Generate code. Hold a conversation. The chatbot era starts here.
2020 — today
Almost every breakthrough since 2018 has followed the same recipe. Transformer architecture. Massive pretraining. Targeted fine-tuning. The architectural innovations slowed down. The scale, the data, the feedback loops became the difference.
Chapter three. Giving AI a library. Pretraining teaches a model a lot. But what about information that's new? Or specific to your company? Or too recent to be in the training data? There's a fix. Here's how it works.
The problem RAG solves
A pretrained model is frozen in time. Ask it about something that happened last week. Or your company's policies.
It will invent a confident-sounding answer. And it won't be true.
In 2020, a Meta research team led by Patrick Lewis published the fix. Retrieval-augmented generation. RAG.
The idea in one sentence — before the model answers, go look something up.
In a trusted source. Hand the model the relevant paragraph. Then ask the question.
The model isn't being asked to remember. It's being asked to read and summarize. From a known, reliable source.
Hallucination drops. Answers stay current. And the model can work with your private data — even though it never saw it during training.
How it actually works
How it actually works. Four steps.
The answer is grounded in real documents. Documents you can point to. And verify.
You are the retrieval step. Read the question. Click the chunks you think are relevant. Then hit Generate Answer. You'll see what the LLM says — with your picks. And without.
What you just learned
RAG is not magic. It's just look it up before you answer. Hallucinations drop dramatically.
But only if the right chunks are retrieved. Garbage chunks in. Garbage answer out.
Building the knowledge base matters. Tuning retrieval matters. As much as the model itself.
If your RAG system gives bad answers, the first place to look isn't the model. It's whether the right paragraph made it into the prompt.
Chapter four. What AI still can't do. BERT solved context. RAG solved knowledge. But there are still hard problems left. And every AI failure you've ever seen traces back to one of them.
Challenge #1 — The confident liar
Challenge one. The confident liar.
The model was trained to produce text that sounds like the text it learned from. Good text is confident. So the model produces confident text. Even when it's guessing.
It has no internal alarm for I don't know this.
A 2023 survey catalogued dozens of ways this goes wrong.
What helps? RAG. Training the model to express uncertainty. Tool use — letting it call out to a calculator or a database to verify facts instead of inventing them.
But hallucination isn't solved. It's the most visible open problem in the field.
Challenge #2 — The appearance of thinking
Challenge two. Reasoning. Try this on yourself first.
A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?
Most people instantly say 10 cents. The correct answer is 5 cents. 5¢ ball + $1.05 bat = $1.10.
A lot of AI models fail this exact problem. The same way humans do. But for a different reason.
Humans grab the intuitive-but-wrong answer. Models fail because they're doing something that LOOKS like reasoning. But isn't quite.
They've read a lot of well-reasoned text. So they know what good reasoning looks like. But recognizing the SHAPE of correct reasoning is not the same as actually doing it.
They can write a beautiful argument. With perfect structure. And land on the wrong answer.
Challenge #3 — Memory
Challenge three. Memory.
Even inside a single conversation, the model has a limit. It's called the context window.
Think of it like a desk. You can spread so many papers on it. Once it's full, something falls off the edge when you add something new.
Bigger desks have helped. But the desk is still finite.
A 2023 Stanford paper showed something pretty incredible. Information placed in the middle of a long context gets used worse than info at the start or end.
True long-term memory — the kind where the model remembers last Tuesday's conversation and connects it to something six months ago — doesn't exist yet. Every conversation starts from zero.
Challenge #4 — The correlation trap
Challenge four. The correlation trap. AI models learned everything from text.
Text is full of things that appear together. But appearing together is not the same as causing each other. Wet streets and rain show up in the same paragraphs.
Constantly. The model learns they're tightly linked. But which causes which?
Rain causes wet streets. Wet streets don't cause rain. If you want to predict whether it rained, wet streets are a perfectly fine clue.
But if you want to stop the streets from being wet? You can't dry the streets. You'd have to stop the rain.
The direction of the arrow matters. AI is great at these things go together. Much weaker at this one causes the other.
When the decision turns on direction, don't trust the model alone.
Challenge #5 — The overconfident student
Challenge five. The overconfident student.
Imagine a student. When they don't know an answer, they pick the most plausible-sounding one. And state it with full confidence.
When they do know, same exact confidence. You can't tell which is which.
That's an AI model out of the box.
The capital of France is Paris sounds the same whether the model knows it cold. Or is making it up.
This is different from hallucination. Hallucination is the model saying something wrong. Calibration is whether its tone matches its actual reliability.
Train your ear. Ask the model how confident it is. Ask for sources. Ask what it's NOT sure about.
A good assistant will tell you. A bad one keeps sounding sure. Never trust confidence as a proxy for correctness.
Challenge #6 — The absent body
Challenge six. The absent body.
You know what cold means. Because you've been cold. A language model has only read about cold. The word points to other words. Not to experience.
This shows up as weird errors. Models suggesting physically impossible things. Missing that an action would cause obvious harm. Writing grammatically perfect sentences about scenarios that make no sense.
Humans also know a massive body of facts nobody wrote down. Because they were too obvious to mention.
You can't fit a swimming pool in your pocket. A candle won't relight underwater.
That unwritten knowledge isn't in the training data. Pure language models will always be a few steps removed from physical reality.
Five AI-written paragraphs about a fictional firm. Two are fully accurate. Two contain fabricated facts. One has an outdated fact. Mark each one. Then hit Grade me. See which one tricked you.
What you just learned
Hallucinations are caught by process. Not by reading more carefully. You can be the smartest reader in the room.
You will still miss fabricated facts. Especially if the writing is fluent. Especially if the structure looks right.
Every AI-generated fact that matters needs a verification step. Not a vibe check. A look-it-up.
That's what RAG tries to automate. And what your team has to do by hand whenever that discipline fails.
Chapter five. How Claude Code thinks. We've covered how AI reads language in general. Now the specific one. The one you're using every day. How does it actually work, end to end, when you ask it to do something?
The short definition
Anthropic calls Claude Code an agentic coding tool that reads your codebase, edits files, runs commands, and integrates with your development tools.
The word that matters is agentic.
A regular chatbot answers your question. An agent takes your intent. Breaks it into steps. Executes them. Using tools. Reading files. Running commands. And loops until the goal is met.
You say what you want. In plain English. Claude Code searches your codebase. Plans an approach. Reads the relevant files. Writes or edits code. Runs tests. Reports back. If anything fails, it tries again.
How it knows what it knows
Three things make Claude Code different.
It can read files. Edit. Search. Run shell commands. Open git branches. Through the Model Context Protocol, it plugs into Google Drive. Jira. Slack. Anything you build.
A file called CLAUDE.md at the root of your project. It gets read at the start of every session. Use it for coding standards. Architectural decisions. Team conventions.
For bigger tasks, Claude Code can spawn other agents. A lead agent assigns work to specialists. Collects their output. Synthesizes.
Claude Code isn't an LLM that writes code. It's a Claude model. Plus a toolbox. Plus a memory file. Plus a permission system. Running in a loop until the task is complete. The intelligence is in the orchestration.
You are Claude Code. A real task. At each step, four tools to choose from. Pick the one Claude Code would actually use. Wrong picks get a nudge. Right picks move you forward. You'll feel two things become obvious fast. Searching beats reading everything. Verification is non-negotiable.
What to walk out with
Six things to remember.
Meaning doesn't live in words. It lives in the relationships between words.
Every word is a list of hundreds of numbers — a coordinate in meaning-space.
BERT was the moment AI could read a word in the context of a whole sentence.
Pretrained models are frozen. RAG pipes in fresh, private, or specific information.
Don't trust tone.
The CLAUDE.md you write is its grounding.
One last thing
One frame for everything that's still hard.
The early problems of AI were about understanding language as language. Those got mostly solved.
The hard problems left over are all about the gap between language and reality.
BERT taught models to read. The next decade is teaching them to know.
That's a fundamentally harder problem. And it's why your judgment — what to trust, when to verify, where to push back — is the part of this work that doesn't get automated away.
Bibliography
Every claim in this book is sourced. Sources 1 through 6. Firth.
Mikolov. Pennington. Vaswani.
Peters. Radford. The papers that built modern AI for language.
Open access. Linked. Verifiable.
Bibliography, continued
Sources 7 through 12. Devlin on BERT. Lewis on RAG.
Huang on hallucination. Liu on long contexts. Plus Anthropic's Claude Code documentation.
Every footnote in the book points back here.
A note on the receipts
Every specific claim in this book was verified against its primary source. Paper titles. Authors.
Years. Benchmark numbers. Training corpus sizes.
Quotes are direct from the papers cited. If something looks wrong, the burden is on this book. Not on you.