How Does an LLM Know It's Done? — Senior iOS Engineer

It’s a bit of a trick question: an LLM doesn’t actually “know” it’s finished in the way a human feels a sense of closure. There’s no internal sigh of relief, no mental checkmark. Instead, it’s a purely mathematical process where the model predicts that the most likely next bit of data is a special, invisible command to shut up.

Here’s the breakdown.

The “Stop” Sign: The EOS Token

The primary way an LLM finishes a task is by generating a specific piece of data called an End of Sequence (EOS) token.

Think of it like the “The End” slide at the end of a movie. It isn’t a word like “apple” or “blue” — it’s a special marker, often represented in code as <|endoftext|> or </s>. It lives in the model’s vocabulary alongside real words, but it has no visual representation.

When the model is generating text, it looks at everything it has written so far and calculates the probability of every possible next token in its vocabulary. Word after word, it keeps picking the most likely continuation. Eventually, the probability of the EOS token climbs above the probability of any actual word. Once the model picks that stop token, the software running the LLM sees it, stops requesting more tokens, and returns the response to you.

flowchart LR A[User Prompt] --> B[Generate Next Token] B --> C{Is it EOS?} C -->|Yes| D[Stop & Return Response] C -->|No| B

That’s it. No consciousness. Just probabilities. The loop above runs for every single token — sometimes hundreds of times per response — until the EOS check finally flips to “Yes.”

How It Learns When to Stop

A raw base model — one that hasn’t been fine-tuned — is actually quite bad at stopping. If you asked it a question, it might answer, then start rambling, then write its own follow-up questions, then drift into a recipe for lasagna. It has no concept of “the task is complete.”

We teach it through two main methods, applied one after the other:

flowchart LR A["Raw Base Model
bad at stopping"] --> B["Supervised Fine-Tuning
(SFT)
learn pattern"] B --> C["Reinforcement Learning
(RLHF)
humans rank"] C --> D["Trained Model
knows when to stop"]

1. Supervised Fine-Tuning (SFT)

During training, engineers feed the model thousands of examples of perfect question-and-answer pairs. Every single one of those answers ends with an EOS token.

The model starts to notice a pattern: “After I have provided a logical conclusion to the user’s request, the very next thing should be the stop sign.” Over many iterations, the probability of outputting EOS after a complete response gets dialled up. It becomes a learned behaviour, not an explicit rule.

2. Reinforcement Learning (RLHF)

This is where humans rank the model’s outputs. If a model stops too early — leaving a sentence hanging — or rambles on for three extra paragraphs of nonsense, humans give it a low score.

Through thousands of these comparisons, the model “learns” that high-scoring responses are the ones that provide the necessary information and then immediately trigger the EOS token. No fluff. No cliffhangers. Just the right amount.

Safety Nets: Why It Sometimes Cuts Off

Sometimes the model gets stuck. Maybe it falls into a loop, repeating the same phrase. Maybe the task is genuinely too long. In those cases, the EOS token never surfaces, and the system needs a hard stop.

flowchart TD A[Generate Token] --> B{EOS found?} B -->|Yes| C[Graceful Stop] B -->|No| D{Hit max token limit?} D -->|Yes| E[Hard Cutoff] D -->|No| F{Matches stop sequence?} F -->|Yes| G[Immediate Stop] F -->|No| A

Two common guardrails at work:

Max Tokens — a pre-set limit like 4,096 tokens. If the model hits this ceiling, the system kills the generation regardless of whether the response was “done.” This is why longer outputs sometimes get cut off mid-sentence.
Stop Sequences — developers can configure the system to halt if certain strings appear. For example: “If you ever type the word User:, stop immediately.” This prevents the AI from hallucinating a conversation with itself, where it starts playing both sides of the dialogue.

So What’s Actually Happening?

Every time you see an LLM finish a response and stop, here’s what just happened under the hood:

The model generated text one token at a time, each time recalculating probabilities based on everything it had written so far.
After the last real word, the probability distribution shifted: the EOS token became the most likely next item.
The inference engine saw EOS, closed the stream, and returned the response to you.

No decision. No awareness. Just a probability curve that finally tilted toward silence.