How LLMs Actually Work — For Non-ML People

You use AI tools every day. You type a prompt, and code comes back. But what's actually happening between your prompt and the response? You don't need a PhD in machine learning to understand this — but having a working mental model of what's happening under the hood will make you dramatically better at using these tools.

Think of it this way: you don't need to understand combustion engineering to drive a car, but knowing that the engine needs fuel, air, and spark helps you understand why the car won't start on a cold morning. Same principle here.

What Is a Large Language Model?

A Large Language Model (LLM) is, at its core, a prediction engine. Given a sequence of text, it predicts the most likely next piece of text. That's it. Every impressive thing you've seen an LLM do — writing code, answering questions, translating languages — is a consequence of this one capability applied at massive scale.

The "Large" in LLM refers to the size of the model — billions of parameters trained on enormous amounts of text from the internet, books, code repositories, and other sources. The "Language" part means it operates on text. The "Model" part means it's a mathematical function that takes input and produces output.

When you type "Write a React component that displays a list of users," the model isn't "understanding" your request the way a human would. It's generating text that is statistically likely to follow your prompt, based on patterns learned from billions of examples of similar text. The result looks like understanding because the patterns are that good.

Tokens — The Atomic Unit

LLMs don't process text character by character or word by word. They process tokens — chunks of text that typically represent about three-quarters of a word.

"Hello, world!"
→ Tokens: ["Hello", ",", " world", "!"]
→ 4 tokens
 
"function calculateTotal(items) {"
→ Tokens: ["function", " calculate", "Total", "(", "items", ")", " {"]
→ 7 tokens

Common words are usually one token. Longer or unusual words get split into multiple tokens. Code tends to use more tokens than plain English because of special characters and unusual variable names.

Why tokens matter to you:

Pricing is based on tokens. When you see "$3 per million input tokens," that's the unit of cost.
Context windows are measured in tokens. A 200K token context window means the model can process about 150,000 words at once.
Speed is affected by token count. More tokens in = more processing time. More tokens out = longer generation time.

When your AI tool seems slow or expensive, tokens are usually the explanation.

Context Windows — The Model's Working Memory

The context window is the total amount of text the model can consider at once — your prompt, the system instructions, any files it's reading, and its own response, all combined.

Context window (e.g., 200K tokens):
┌──────────────────────────────────────┐
│ System prompt (instructions)    ~500 │
│ CLAUDE.md content             ~2,000 │
│ Files you asked it to read   ~50,000 │
│ Your conversation history    ~10,000 │
│ Current prompt                  ~200 │
│ Model's response              ~5,000 │
│ [Remaining space]           ~132,300 │
└──────────────────────────────────────┘

Why this matters:

Everything the model knows about your current task has to fit in this window. If your project has 500 files and the model can only fit 50 of them in context, it's working with 10% of the picture. Tools like Claude Code are smart about choosing which files to read, but the context window is always a constraint.

When the model seems to "forget" something you told it earlier in a long conversation, it's because that information has scrolled out of the context window — or the model is paying less attention to older context.

Attention — How Models Focus

Inside the model, a mechanism called attention determines which parts of the input are most relevant to generating each part of the output.

When the model is writing the third line of a function, the attention mechanism is focusing heavily on:

The function signature (what parameters are available)
The first two lines (what's already been done)
Your prompt (what you asked for)
Similar patterns from training (how this kind of code usually looks)

And paying less attention to:

An unrelated file you loaded earlier
A conversation from 20 messages ago
System instructions that don't apply to the current task

This is why specificity in prompts matters. The more relevant context you provide close to your request, the more attention the model pays to it. A prompt that says "modify the function on line 42 of auth.ts" gives the model a clear focus point. A prompt that says "fix the thing I mentioned earlier" forces the model to search through the entire context for the reference.

Temperature — Creativity vs Consistency

Temperature is a setting that controls how predictable the model's output is. Think of it as a dial between "always give the safest answer" and "take creative risks."

Temperature 0.0: Always picks the most likely next token
  → "The function returns true" (predictable, repetitive)
 
Temperature 0.5: Usually picks likely tokens, sometimes surprises
  → "The function validates the input and returns true" (balanced)
 
Temperature 1.0: More willing to pick less likely tokens
  → "The function gracefully handles edge cases before returning" (creative)

For code generation, lower temperatures produce more predictable, conventional code. Higher temperatures produce more varied solutions but also more mistakes.

Most AI coding tools set the temperature for you. Claude Code and similar tools typically use lower temperatures for code (where correctness matters) and slightly higher temperatures for explanations (where variety is welcome).

When this matters to you: If you're getting repetitive outputs, the temperature might be too low. If you're getting wildly inconsistent outputs, it might be too high. Most tools don't expose this setting directly, but understanding it helps you understand the model's behavior.

How Code Generation Actually Works

When you prompt an AI to write code, here's what happens:

Tokenize your prompt, system instructions, and any context
Process all tokens through the model's layers (attention, transformations)
Generate the response one token at a time, each token influenced by everything before it
Each token choice is based on probability — which token is most likely given all the context

This is why the model can write syntactically correct code — it's seen billions of examples and the patterns for "what comes after function in JavaScript" are deeply encoded. It's also why it can make logical mistakes — it's predicting likely text, not reasoning through the problem the way a human does.

The Practical Implications

Understanding these mechanics changes how you use AI tools:

Give relevant context. The model's output quality is directly proportional to the quality of context in its window. More relevant context = better output.

Be specific. Specific prompts activate specific attention patterns. Vague prompts activate generic patterns.

Shorter conversations can be better. As conversations get long, older context gets less attention. For complex tasks, starting a fresh conversation with well-organized context can outperform continuing a long one.

Token costs add up. Every file the model reads costs tokens. Be intentional about what you ask it to read.

Try this now

Compare a vague prompt and a specific prompt on the same coding task and note the quality difference.
Trim one bloated prompt down to only the context that actually matters.
Start a fresh conversation for one complex task and compare the result to continuing a long thread.

Prompt to give your agent

"Explain this model behavior in practical coding terms:

why the output changed between runs

why context seems to fade in a long conversation

how I should structure prompts for better focus

Use tokens, context windows, attention, and temperature in the explanation, but keep it tied to real development workflows."

What you must review yourself

Whether you are giving the model relevant context instead of maximum context
Whether long chats should be reset and reorganized rather than endlessly extended
Whether prompt quality, not model quality, is the first thing to inspect when outputs drift
Whether token usage is worth the cost and latency for the task

Common Mistakes to Avoid

Assuming the model "understands." Useful output does not mean human-style reasoning.
Ignoring context window limits. More context can reduce focus if it is not relevant.
Treating token count as irrelevant. Tokens cost money and slow interactions.
Blaming the model when the prompt is the problem. Prompt quality often explains output quality.
Thinking more output is automatically better. Constraint often improves usefulness.

Key takeaways

LLMs are token predictors, not tiny humans in a box
Attention and context windows explain why prompt structure matters
Temperature affects consistency and creativity even when you cannot tune it directly
Better mental models lead to better prompt discipline and lower waste

What's Next

Now that you know how the models work, let's compare the companies building them. OpenAI, Anthropic, and open-source projects each take different approaches — with different strengths, weaknesses, and pricing that affect which one you should use for different tasks.

Use the lesson prompt before you improvise