A Drink & Think Storybook

How AI
actually thinks

This is a story about something nobody thought was possible. For 70 years, computers couldn't understand a single word. Then, in the last decade, that changed. What you're holding is the story of how it happened. And what it means for you.

For The CleverProfits team

Length ~30 minutes + activities

Published CleverProfits, 2026

What's inside

Five short chapters.
Five small activities.

Five chapters. Short ones. Each one builds on the last.

You can flip through in order. Or jump to whatever's pulling you in. About 30 minutes of reading.

With a few things to try along the way.

Before we start

What does it mean
for a computer
to understand?

Computers only understand one thing. Numbers. So how do you teach a computer what a word means?

It's been impossible. For 70 years. Until we figured something out.

Meaning doesn't live inside a word. Meaning lives in the relationships between words. That's the whole secret.

Every AI you've ever used is built on that one idea.

The idea that started it all

1957.
A linguist in London
says something that, decades later,
will change everything.

Long before computers, a linguist named J.R. Firth said something.

You shall know a word by the company it keeps.

Think about that. You've never heard the word scalpel before. But you see it next to surgeons. Operating rooms. Incisions.

You know exactly what it means. You didn't need a dictionary. You learned from the company the word kept.

That's the entire idea modern AI is built on.

Chapter One

Words
become
numbers.

Chapter one. Words become numbers. Every word the AI knows is stored as a list of numbers. A coordinate. On a map. Words with similar meanings land close together. Here's how we got there.

The wall, 1960s–2012

For fifty years,
computers couldn't tell
cat from helicopter.

Firth had the idea. But nobody could turn it into something a computer could actually use. The wall lasted half a century.

Here's the first thing they tried. Take a 50,000-word dictionary. Number every word.

Cat is 4,823. Feline is 12,047. Helicopter is some other number.

Done? No. Those are just three random numbers.

Cat and feline mean almost the same thing. But the numbers can't tell. The numbers don't mean anything.

The wall, continued

Attempt two.
Count which words
appear together.

Second attempt. Use Firth's idea directly. For every pair of words in a giant pile of text, count how often they appear near each other.

Doctor with hospital? Constantly. Doctor with kettle?

Almost never. It worked. Sort of.

The problem was the size. A 100,000-word vocabulary needs a 100,000 by 100,000 table. 10 billion cells.

Almost all of them empty. Useful information was buried in oceans of nothing. The field needed something completely different.

The 100k × 100k counting table

The hidden chapter, 2003 — 2011

Before the breakthrough,
two breakthroughs
nobody talks about.

Quietly. In academic papers nobody outside the field read. The cracks started years before the world was paying attention.

A researcher named Yoshua Bengio. He trained a tiny neural network to predict the next word in a sentence. While playing that game, the network quietly invented a list of numbers for every word.

Words with similar meanings ended up with similar numbers. That's the real birth of word embeddings. Five years later, two more researchers — Collobert and Weston — proved you could train these vectors once.

And reuse them. For dozens of different tasks. Pretrain.

Fine-tune. The whole modern recipe. Already on the table.

What changed when the world finally noticed wasn't the idea. It was the scale. That's the page you're about to flip to.

The first breakthrough

2013. Word2Vec.

2013. A small team at Google. Led by a researcher named Tomas Mikolov.

They took Bengio's approach. Stripped it down for speed. And ran it on billions of words.

The trick was almost embarrassingly simple. Teach a program a game. Show it a word.

Ask it to guess the words around it. Have it play the game billions of times. And here's the punchline.

Throw away the game. Keep the addresses the program built while learning. Those are the embeddings.

They called it Word2Vec. And it changed everything.

One word = 300 numbers

Two clever tricks behind Word2Vec

How they actually
made it train.

Two clever tricks made this actually run.

First. Negative sampling.

Predicting the right word out of 50,000 was brutally slow. So they reframed the game. For each correct word, throw in a few random fakes. Just learn to tell them apart. Training got hundreds of times faster.

Second. The window.

How many words around the target you look at decides what kind of map you build. Small window — you learn grammar. Large window — you learn topic. Same algorithm. Different lens. Different map.

The Spotify analogy

Every song has a profile.
Words work the same way.

Here's the easiest way to think about it. Spotify scores every track. Energy.

Danceability. Tempo. Songs with similar profiles get recommended together.

Word2Vec does the same thing. Just with 300 scores instead of a handful. And the wild part — the model invented those 300 categories on its own.

Nobody told it what to look for. Two words with similar profiles? Similar meaning.

Two words with completely different profiles? Different meaning. A nutrition label.

For words.

King — readable dimensions

The surprise

Arithmetic starts working.

Once words live in numerical space, something pretty incredible happens. You can do math on them. Take the word king.

Subtract man. Add woman. You land almost exactly on queen.

Try Paris. Subtract France. Add Italy.

You land on Rome. Nobody programmed that. It just fell out of the training.

The model invented directions inside the space. A male-to-female direction. A country-to-capital direction.

And those directions actually mean something. That's the moment everyone realized. There's real structure in there.

Vector arithmetic

The other side of the same trick

The math doesn't
only learn nice things.

But the same trick that gives you queen also gives you something darker.

Take doctor. Subtract man. Add woman.

The model says nurse.

Not because nurses are women. Because the training text paired doctor with male pronouns more often. And nurse with female ones.

The geometry just inherited the pattern. Including the ugly ones.

AI doesn't have opinions. It has a very precise reflection of whatever it was trained on.

Whatever bias is in the source becomes a measurable direction in the math.

⚡ Activity 1 · ~3 minutes

Which one doesn't belong?

Time to try it. Four words. Three live in the same neighborhood. One doesn't. Click the outlier. Then see the map the AI would actually draw. You'll feel exactly what the model is doing.

Chapter Two

The race
to BERT.

Chapter two. The race to BERT. Five years. Six papers. Each one solved exactly one thing the previous one couldn't. Not one genius. A relay race.

Five years, six papers

Five years.
Each step solved exactly one
thing the previous one couldn't.

The starting line.

2013.

Word2Vec. Words become 300-number addresses.

2014.

Stanford ships GloVe. Same idea, different math. A small but clever insight — don't compare individual words. Compare ratios.

2017.

Facebook ships FastText. It breaks each word into character pieces. Now the model can handle words it's never seen. Typos. New words. Technical jargon. It just recognizes the parts.

2017 — 2018, the leap

Then everything
changed in eighteen months.

Then everything changed in eighteen months.

2017

The Transformer. Throws out sequential reading. Every word looks at every other word at the same time.

February 2018

ELMo. First system to give the same word different numbers in different sentences.

June 2018

GPT-1. The first Transformer trained on a mountain of text. But it could only read left to right.

October 2018

BERT. Reads context in both directions. At the same time. Crushes 11 benchmarks on arrival.

The puzzle behind it all

First, what is
attention?

First, a puzzle. Read this sentence.

The trophy didn't fit in the suitcase because it was too big.

What does 'it' refer to? The trophy or the suitcase? You knew instantly.

Your brain didn't read every word equally. It pulled hard on fit. Trophy. Suitcase. It skipped the and because.

That's attention. Deciding which words matter more when you figure out what another word means.

Researchers had been bolting attention onto old models for years. The 2017 paper asked an obvious question. What if attention isn't an add-on? What if it's the whole engine?

The engine change

2017.
"Attention is all you need."

A Google team publishes a paper. With a cocky title.

Attention is all you need.

And it changes the entire field.

Before the Transformer, AI read like you read with one finger. Left to right. One word at a time. Trying to remember what came earlier. Slow.

The Transformer throws all of that out. Every word looks at every other word. All at once. In parallel. And it decides for itself which words matter.

Two wins.

1

Computers love parallel — much faster.

2

And distance stops mattering. A word at position 1 can relate directly to a word at position 500.

Every word attends to every other

The moment

October 2018.
BERT arrives.

October 2018. A Google team — Devlin, Chang, Lee, Toutanova — publishes a paper. It becomes the most-cited NLP paper of the decade.

Here's what BERT did differently. Instead of predict the next word — which is what GPT was doing — BERT trained on fill in the blank.

Hide 15% of the words. Ask the model to guess them. That one change makes the model bidirectional.

To guess a blank, you need to see the words before AND after. Full context. Always.

On 11 benchmarks, BERT set new state-of-the-art the same day it released.

Two things BERT changed forever

768.
And who could build AI.

Two things BERT changed forever.

First — the famous number, 768.

BERT-base was built as 12 layers. 12 attention heads. 64 numbers per head. Multiply: 768. That's the size of every word's vector inside the model. BERT was so dominant that 768 became the default for the whole industry.

Second — the economics.

Pretraining BERT cost a fortune. But you only had to do it once. Anyone with a laptop and a few hundred examples could fine-tune it for a specific job in an hour.

Before BERT, training a serious language model needed Google-scale resources. After BERT, a small team could ship something competitive. That's the moment AI stopped being a giant-tech-company sport.

The through-line

Every step fixed one thing.

Look at the through-line.

One-hot

each word a unique number, no relationships.

Word2Vec

words on a map, but one fixed address per word.

ELMo

context-sensitive, but old engine.

GPT-1

new engine, but read only left to right.

BERT

bidirectional. Pretrained. Fine-tunable. Context finally solved.

Watch the bank example.

She deposited the check at the bank. They sat on the muddy bank watching boats.

Word2Vec gives bank the same address both times. A blurry average.

BERT reads the surrounding words. Gives each one a different number. Same word. Different sentence. Different number.

The map became dynamic.

⚡ Activity 2 · ~5 minutes

Sort the race
by logic — not dates.

Six cards. Wrong order. Put them in sequence. You don't even need to know the dates. Each step is a direct answer to the flaw in the one before.

What you just noticed

You can derive the order
without ever seeing a date.

You just derived the order without ever seeing a date. Each step in the race fixed exactly one thing the previous step couldn't do.

That's what progress actually looks like in a research field. Not a flash of genius from one person. A relay race.

Where the baton is the unsolved problem.

2019 — today

The world after BERT.
How we got from there
to the AI you use now.

BERT didn't end the story. It opened a door. Once researchers saw what was possible, the question became — what if we go bigger? Change the goal? Feed it more?

2019

RoBERTa. Same BERT architecture, trained longer, on more text. Beats BERT. Lesson learned. Scale and discipline often beat new ideas.

Then OpenAI flips direction. Instead of fill-in-the-blank, they double down on predict-the-next-word. And scale it. Massively. GPT-3 has 175 billion internal numbers. Suddenly the model can write essays. Answer questions. Generate code. Hold a conversation. The chatbot era starts here.

2020 — today

From research
to the assistants on your phone.

2020

T5 — Google. Reframes every language task as the same shape. Text in. Text out. Translation? Same. Summarization? Same. Question answering? Same. One unified model for everything.

2022 onward

ChatGPT. Claude. Gemini. All descend from these ideas. Massive Transformers. Pretrained on essentially all written text. Fine-tuned with human feedback to be helpful, honest, conversational. The assistants you use every day. BERT and GPT lineage. All the way down.

Almost every breakthrough since 2018 has followed the same recipe. Transformer architecture. Massive pretraining. Targeted fine-tuning. The architectural innovations slowed down. The scale, the data, the feedback loops became the difference.

Chapter Three

Giving AI
a library.

Chapter three. Giving AI a library. Pretraining teaches a model a lot. But what about information that's new? Or specific to your company? Or too recent to be in the training data? There's a fix. Here's how it works.

The problem RAG solves

LLMs don't know what they don't know.

A pretrained model is frozen in time. Ask it about something that happened last week. Or your company's policies.

It will invent a confident-sounding answer. And it won't be true.

In 2020, a Meta research team led by Patrick Lewis published the fix. Retrieval-augmented generation. RAG.

The idea in one sentence — before the model answers, go look something up.

In a trusted source. Hand the model the relevant paragraph. Then ask the question.

The model isn't being asked to remember. It's being asked to read and summarize. From a known, reliable source.

Hallucination drops. Answers stay current. And the model can work with your private data — even though it never saw it during training.

How it actually works

Four steps. That's it.

How it actually works. Four steps.

1

First. Index. Break your documents into chunks. Turn each chunk into a list of 300 to 1,500 numbers. Store them in a vector database.

2

Second. Retrieve. Turn the user's question into a vector. Find the chunks with the closest vectors. Those are your most relevant paragraphs.

3

Third. Augment. Glue those chunks onto the prompt. Here's the question. Here are the five most relevant paragraphs.

4

Fourth. Generate. The LLM answers, using what you gave it.

The answer is grounded in real documents. Documents you can point to. And verify.

Question in, grounded answer out

⚡ Activity 3 · ~6 minutes

Be the retriever.

You are the retrieval step. Read the question. Click the chunks you think are relevant. Then hit Generate Answer. You'll see what the LLM says — with your picks. And without.

What you just learned

RAG is not magic.
It's just "look it up
before you answer."

RAG is not magic. It's just look it up before you answer. Hallucinations drop dramatically.

But only if the right chunks are retrieved. Garbage chunks in. Garbage answer out.

Building the knowledge base matters. Tuning retrieval matters. As much as the model itself.

If your RAG system gives bad answers, the first place to look isn't the model. It's whether the right paragraph made it into the prompt.

Chapter Four

What AI
still
can't do.

Chapter four. What AI still can't do. BERT solved context. RAG solved knowledge. But there are still hard problems left. And every AI failure you've ever seen traces back to one of them.

Challenge #1 — The confident liar

Hallucination.

Challenge one. The confident liar.

The model was trained to produce text that sounds like the text it learned from. Good text is confident. So the model produces confident text. Even when it's guessing.

It has no internal alarm for I don't know this.

A 2023 survey catalogued dozens of ways this goes wrong.

Invented citations

Made-up quotes

Fabricated product features for real companies

What helps? RAG. Training the model to express uncertainty. Tool use — letting it call out to a calculator or a database to verify facts instead of inventing them.

But hallucination isn't solved. It's the most visible open problem in the field.

Challenge #2 — The appearance of thinking

Reasoning.

Challenge two. Reasoning. Try this on yourself first.

A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?

Most people instantly say 10 cents. The correct answer is 5 cents. 5¢ ball + $1.05 bat = $1.10.

A lot of AI models fail this exact problem. The same way humans do. But for a different reason.

Humans grab the intuitive-but-wrong answer. Models fail because they're doing something that LOOKS like reasoning. But isn't quite.

They've read a lot of well-reasoned text. So they know what good reasoning looks like. But recognizing the SHAPE of correct reasoning is not the same as actually doing it.

They can write a beautiful argument. With perfect structure. And land on the wrong answer.

Challenge #3 — Memory

The goldfish problem.

Challenge three. Memory.

Even inside a single conversation, the model has a limit. It's called the context window.

Think of it like a desk. You can spread so many papers on it. Once it's full, something falls off the edge when you add something new.

Bigger desks have helped. But the desk is still finite.

A 2023 Stanford paper showed something pretty incredible. Information placed in the middle of a long context gets used worse than info at the start or end.

True long-term memory — the kind where the model remembers last Tuesday's conversation and connects it to something six months ago — doesn't exist yet. Every conversation starts from zero.

Challenge #4 — The correlation trap

Causation.
Wet streets don't cause rain.

Challenge four. The correlation trap. AI models learned everything from text.

Text is full of things that appear together. But appearing together is not the same as causing each other. Wet streets and rain show up in the same paragraphs.

Constantly. The model learns they're tightly linked. But which causes which?

Rain causes wet streets. Wet streets don't cause rain. If you want to predict whether it rained, wet streets are a perfectly fine clue.

But if you want to stop the streets from being wet? You can't dry the streets. You'd have to stop the rain.

The direction of the arrow matters. AI is great at these things go together. Much weaker at this one causes the other.

When the decision turns on direction, don't trust the model alone.

Direction matters

Challenge #5 — The overconfident student

Calibration.
It sounds the same
whether it's right or wrong.

Challenge five. The overconfident student.

Imagine a student. When they don't know an answer, they pick the most plausible-sounding one. And state it with full confidence.

When they do know, same exact confidence. You can't tell which is which.

That's an AI model out of the box.

The capital of France is Paris sounds the same whether the model knows it cold. Or is making it up.

This is different from hallucination. Hallucination is the model saying something wrong. Calibration is whether its tone matches its actual reliability.

Train your ear. Ask the model how confident it is. Ask for sources. Ask what it's NOT sure about.

A good assistant will tell you. A bad one keeps sounding sure. Never trust confidence as a proxy for correctness.

Challenge #6 — The absent body

Grounding
& common sense.

Challenge six. The absent body.

You know what cold means. Because you've been cold. A language model has only read about cold. The word points to other words. Not to experience.

This shows up as weird errors. Models suggesting physically impossible things. Missing that an action would cause obvious harm. Writing grammatically perfect sentences about scenarios that make no sense.

Humans also know a massive body of facts nobody wrote down. Because they were too obvious to mention.

You can't fit a swimming pool in your pocket. A candle won't relight underwater.

That unwritten knowledge isn't in the training data. Pure language models will always be a few steps removed from physical reality.

⚡ Activity 4 · ~6 minutes

Hallucination hunt.

Five AI-written paragraphs about a fictional firm. Two are fully accurate. Two contain fabricated facts. One has an outdated fact. Mark each one. Then hit Grade me. See which one tricked you.

What you just learned

Hallucinations are caught
by process.
Not by reading more carefully.

Hallucinations are caught by process. Not by reading more carefully. You can be the smartest reader in the room.

You will still miss fabricated facts. Especially if the writing is fluent. Especially if the structure looks right.

Every AI-generated fact that matters needs a verification step. Not a vibe check. A look-it-up.

That's what RAG tries to automate. And what your team has to do by hand whenever that discipline fails.

Chapter Five

How
Claude Code
thinks.

Chapter five. How Claude Code thinks. We've covered how AI reads language in general. Now the specific one. The one you're using every day. How does it actually work, end to end, when you ask it to do something?

The short definition

An agentic coding tool
that reads, edits, and runs things.

Anthropic calls Claude Code an agentic coding tool that reads your codebase, edits files, runs commands, and integrates with your development tools.

The word that matters is agentic.

A regular chatbot answers your question. An agent takes your intent. Breaks it into steps. Executes them. Using tools. Reading files. Running commands. And loops until the goal is met.

You say what you want. In plain English. Claude Code searches your codebase. Plans an approach. Reads the relevant files. Writes or edits code. Runs tests. Reports back. If anything fails, it tries again.

How it knows what it knows

Tools, memory, and sub-agents.

Three things make Claude Code different.

Tools.

It can read files. Edit. Search. Run shell commands. Open git branches. Through the Model Context Protocol, it plugs into Google Drive. Jira. Slack. Anything you build.

Memory.

A file called CLAUDE.md at the root of your project. It gets read at the start of every session. Use it for coding standards. Architectural decisions. Team conventions.

Sub-agents.

For bigger tasks, Claude Code can spawn other agents. A lead agent assigns work to specialists. Collects their output. Synthesizes.

Claude Code isn't an LLM that writes code. It's a Claude model. Plus a toolbox. Plus a memory file. Plus a permission system. Running in a loop until the task is complete. The intelligence is in the orchestration.

⚡ Activity 5 · ~7 minutes

You are Claude Code.
Complete the task.

You are Claude Code. A real task. At each step, four tools to choose from. Pick the one Claude Code would actually use. Wrong picks get a nudge. Right picks move you forward. You'll feel two things become obvious fast. Searching beats reading everything. Verification is non-negotiable.

What to walk out with

Six things to
remember.

Six things to remember.

01 — The core insight

Meaning doesn't live in words. It lives in the relationships between words.

02 — Words become numbers

Every word is a list of hundreds of numbers — a coordinate in meaning-space.

03 — Context changed everything

BERT was the moment AI could read a word in the context of a whole sentence.

04 — RAG closes the knowledge gap

Pretrained models are frozen. RAG pipes in fresh, private, or specific information.

05 — Hallucination is caught by process

Don't trust tone.

06 — Claude Code is an agent, not a chatbot

The CLAUDE.md you write is its grounding.

One last thing

One frame for
everything that's still hard.

One frame for everything that's still hard.

The early problems of AI were about understanding language as language. Those got mostly solved.

The hard problems left over are all about the gap between language and reality.

Hallucination is language without grounded truth.

Reasoning is language without reliable logic.

Memory is language without continuity.

Common sense is language without physical experience.

Causation is language without directional structure.

Calibration is language without self-knowledge.

BERT taught models to read. The next decade is teaching them to know.

That's a fundamentally harder problem. And it's why your judgment — what to trust, when to verify, where to push back — is the part of this work that doesn't get automated away.

Bibliography

Every claim in this
book is sourced.

Every claim in this book is sourced. Sources 1 through 6. Firth.

Mikolov. Pennington. Vaswani.

Peters. Radford. The papers that built modern AI for language.

Open access. Linked. Verifiable.

1 Firth, J.R. (1957). "A synopsis of linguistic theory, 1930–1955." In Studies in Linguistic Analysis, Oxford: Philological Society, pp. 1–32. Origin of the quote "You shall know a word by the company it keeps." Source verified via Revisiting Firth & Harris (arXiv 2205.07750).

2 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781. Google. https://arxiv.org/abs/1301.3781

3 Pennington, J., Socher, R., & Manning, C.D. (2014). "GloVe: Global Vectors for Word Representation." EMNLP 2014, Doha, Qatar, pp. 1532–1543. Stanford. https://aclanthology.org/D14-1162/

4 Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762. Google Brain. https://arxiv.org/abs/1706.03762

5 Peters, M.E., et al. (2018). "Deep Contextualized Word Representations" (ELMo). NAACL 2018, New Orleans. arXiv:1802.05365. AllenNLP. https://aclanthology.org/N18-1202/

6 Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training" (GPT-1). OpenAI technical report. OpenAI PDF

Bibliography, continued

Sources
7 through 12.

Sources 7 through 12. Devlin on BERT. Lewis on RAG.

Huang on hallucination. Liu on long contexts. Plus Anthropic's Claude Code documentation.

Every footnote in the book points back here.

7 Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805. Google AI Language. https://arxiv.org/abs/1810.04805

8 BERT training corpus size — BooksCorpus (~800M words) + English Wikipedia (~2.5B words) = ~3.3B words total. Documented in the BERT paper (source 7) and confirmed in BookCorpus and BERT's official GitHub.

9 Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. arXiv:2005.11401. Meta (formerly Facebook AI Research). https://arxiv.org/abs/2005.11401

10 Huang, L., et al. (2023). "A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions." arXiv:2311.05232. Published ACM Transactions on Information Systems, 2024. https://arxiv.org/abs/2311.05232

11 Liu, N.F., et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172. Published TACL, 2024. https://arxiv.org/abs/2307.03172

12 Anthropic (2026). "Claude Code overview." Official product documentation. Retrieved April 2026. https://code.claude.com/docs/en/overview

A note on the receipts

On fact-checking.

Every specific claim in this book was verified against its primary source. Paper titles. Authors.

Years. Benchmark numbers. Training corpus sizes.

Quotes are direct from the papers cited. If something looks wrong, the burden is on this book. Not on you.

How AIactually thinks

Five short chapters.Five small activities.

Words as numbers

The history race

Giving AI a library

What AI still can't do

How Claude Code thinks

Takeaways & sources

What does it meanfor a computerto understand?

1957.A linguist in Londonsays something that, decades later,will change everything.

Wordsbecomenumbers.

For fifty years,computers couldn't tellcat from helicopter.

Attempt two.Count which wordsappear together.

Before the breakthrough,two breakthroughsnobody talks about.

2013. Word2Vec.

How they actuallymade it train.

First. Negative sampling.

Second. The window.

Every song has a profile.Words work the same way.

Arithmetic starts working.

The math doesn'tonly learn nice things.

Which one doesn't belong?

The raceto BERT.

Five years.Each step solved exactly onething the previous one couldn't.

Then everythingchanged in eighteen months.

First, what isattention?

2017."Attention is all you need."

October 2018.BERT arrives.

768.And who could build AI.