Foundations

How LLMs Actually Work

No math. No PhD. Just interactive demos that show you what's happening under the hood when you talk to AI.

1. What's a Token?

AI doesn't read words — it reads tokens. A token is usually a piece of a word, a whole word, or punctuation. Type anything below and watch it get split into tokens in real-time.

Tokens: 0 Characters: 0 Ratio: 0 chars/token Est. cost: $0.000

Tokens will appear here...

2. Context Window = Short-Term Memory

An LLM can only "remember" a fixed amount of text at once — its context window. Everything you send (your prompt, the conversation history, system instructions) has to fit inside it. When it fills up, older stuff gets forgotten.

Here's what that looks like in practice. When you use an AI coding tool like Claude Code, the context window isn't just holding your message — it's packed with system prompts, memory files, tool definitions, and the entire conversation history. Your actual working space is whatever's left over.

The Context Window Problem — how system prompt, instructions, memory, tools, and conversation history fill a 200K token window

0 tokens used Claude 3.5 — 200K tokens

Claude (Anthropic)

200,000 tokens — fits ~150,000 words (~300 pages)

GPT-4o (OpenAI)

128,000 tokens — fits ~96,000 words (~190 pages)

Gemini 1.5 (Google)

1,000,000 tokens — fits ~750,000 words (~1,500 pages)

GPT-3 (2020)

4,096 tokens — fits ~3,000 words (~6 pages)

So what happens when you hit the limit? The AI doesn't just stop — it compresses. It summarizes the conversation to free up space, but that means details get lost: exact error messages, intermediate steps, nuanced reasoning. The bigger your context window, the longer you can go before this kicks in.

Context Compression — what happens when the context window fills up: auto-compact triggers, LLM summarizes, and some detail is lost

Click a document type above to see how much of the context window it fills. Then switch models to compare.

3. Attention: How AI Connects the Dots

When generating each word, the model doesn't treat all input equally. It pays attention to the most relevant parts. Click any word below to see what the model would focus on when generating from that position.

Click a word to see which other words the model pays attention to. Brighter = stronger attention.

4. What Happens When You Hit Send

From the moment you press Enter to the moment you see a response — here's every step, in order.

Your text gets tokenized

Your message is split into tokens — pieces of words the model can process. "I love pizza" becomes something like ["I", " love", " pizza"]. This is the same process you saw in the tokenizer above.

Tokens become numbers (embeddings)

Each token gets converted into a long list of numbers (a vector) that captures its meaning. "King" and "queen" end up as nearby coordinates in this number space. It's how the model understands that words relate to each other.

Attention layers process context

The model runs your tokens through dozens of attention layers. Each layer figures out how tokens relate to each other — which words modify which, what "it" refers to, how the sentence structure works. This is the expensive part.

Model predicts the next token

After processing, the model outputs a probability for every possible next token. It might say: "the" (32%), "a" (18%), "my" (12%), "their" (8%)... It picks one (influenced by temperature) and that's the first token of the response.

Repeat — one token at a time

The predicted token gets added to the sequence, and the whole process runs again to predict the next token. This is why you see AI "typing" word by word — it's literally generating one token at a time. A 500-word response = ~375 prediction cycles.

Tokens decode back into text

The generated token IDs get converted back into readable text and streamed to your screen. The model has no memory of this conversation unless you send the whole history again next time — it's stateless.

5. Temperature = Creativity Dial

Every AI model has a temperature setting (0.0 to 1.0) that controls randomness. At low temperature, the model always picks the most likely next word — safe, consistent, boring. At high temperature, it takes bigger creative risks — sometimes brilliant, sometimes nonsense. Move the slider and watch the same prompt produce wildly different outputs.

Prompt
"Write a one-sentence tagline for a new productivity app."

Focused Creative 0.5

0.0 — Always picks safest word 1.0 — Rolls the dice

Run 1

Run 2

Run 3

Key Takeaways

LLMs are autocomplete on steroids

They predict the next token based on everything before it. That's the entire trick — it just works shockingly well at scale.

Context windows are short-term memory

The model forgets everything between conversations. If you want it to "remember," you have to send the context every time (or use RAG — that's the next lesson).

Structure your prompts because attention is literal

The model physically "attends" to different parts of your input. Clear structure (headers, XML tags, examples) gives it better anchors to focus on.

Hallucinations are confidence without knowledge

The model always picks the most probable next token. It has no way to say "I don't know" — it will always generate something, even if it's wrong.