Karpathy's MicroGPT: A 200-Line GPT That Actually Works
Andrej Karpathy just published microgpt.py — a single 200-line Python file with zero external dependencies that implements a complete GPT from scratch. No PyTorch, no Hugging Face, no CUDA. Just raw Python.
The Hacker News thread hit 960+ points inside a few hours. The comments are a goldmine: developers who've been using LLMs for two years finally understanding what's actually happening inside the black box.
This post walks through what Karpathy built, why it matters for working developers, and what it reveals about when you should roll your own versus when you should reach for an API.
What MicroGPT Actually Is
This isn't a toy demo. MicroGPT contains the full algorithmic content of a GPT in one file:
- Dataset loading — reads a corpus of documents (32K names by default)
- Character-level tokenizer — maps unique characters to integer token IDs
- Autograd engine from scratch — computes gradients without PyTorch via a
Valueclass - GPT-2-style architecture — multi-head self-attention, feed-forward layers, positional embeddings
- Adam optimizer — hand-implemented, no library dependency
- Training loop + inference loop — everything needed to train and sample
The whole thing fits in 200 lines. It generates plausible new names (kamon, vialan, keylen, alerin) by learning statistical patterns from the training set. It's the conceptual heart of GPT-4, just without the trillion parameters and RLHF.
Karpathy describes it as "the culmination of a decade-long obsession to simplify LLMs to their bare essentials" — following his earlier micrograd, makemore, and nanoGPT projects.
The Key Components, Explained
1. Dataset and Tokenizer
MicroGPT starts with a corpus of documents — in this case, 32,000 names. Each name is a document. The tokenizer is the simplest possible: every unique character becomes a token ID.
uchars = sorted(set(''.join(docs))) # unique chars become token IDs 0..n-1
BOS = len(uchars) # special Beginning of Sequence token
vocab_size = len(uchars) + 1 # 26 letters + 1 BOS = 27 tokens
Production tokenizers (like OpenAI's tiktoken) operate on sub-word chunks for efficiency — "token" becomes ["tok", "en"] rather than individual characters. But character-level tokenization is sufficient for understanding the algorithm.
Each training document gets wrapped in BOS tokens: [BOS, e, m, m, a, BOS]. The model learns that BOS marks document boundaries.
2. Autograd: Backprop Without PyTorch
This is where MicroGPT gets impressive. The Value class implements automatic differentiation from scratch — the same computation graph traversal that PyTorch does, just without GPU optimization or memory efficiency:
class Value:
def __init__(self, data, _children=()):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
def __add__(self, other):
out = Value(self.data + other.data, (self, other))
def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward
return out
Every operation records its children. When you call backward() on the final loss, the gradients flow back through the computation graph — this is what backpropagation actually is at the code level.
3. The Transformer Architecture
MicroGPT implements the core transformer blocks: token embeddings, positional encodings, multi-head self-attention, residual connections, and layer normalization. The attention mechanism is the key insight:
# Attention: query, key, value projections
# Q @ K^T scaled by sqrt(d_k) -> softmax -> @ V
# This is the "pay attention to relevant tokens" operation
Each attention head learns to look at different aspects of the context. Multiple heads in parallel allow the model to attend to different types of relationships simultaneously — syntactic patterns, semantic similarity, positional proximity.
4. The Training Loop
The training loop is three lines conceptually: forward pass (compute predictions), backward pass (compute gradients), optimizer step (nudge parameters in the direction that reduces loss):
for step in range(max_steps):
# sample a random batch of training sequences
x, y = get_batch(train_data, batch_size, context_length)
# forward pass: compute logits and loss
logits, loss = model(x, y)
# backward pass + parameter update
loss.backward()
optimizer.step()
This is the same loop that trained GPT-4 — just with vastly more data, parameters, and compute.
What This Teaches You (Even If You Never Run It)
Reading MicroGPT clarifies several things that are commonly misunderstood:
Models don't "understand" anything — they learn statistical patterns that predict which token comes next. Your conversation with ChatGPT is, from the model's perspective, just a document to complete. The "intelligence" emerges from the scale of patterns learned across billions of documents.
Context length is a hard constraint — The model only sees the last context_length tokens. There's no persistent memory, no retrieval, no understanding of what came before the context window. Everything you've read about RAG and long-context models is engineering around this fundamental limitation.
Temperature controls creativity vs. accuracy — During inference, the logits (raw scores) get divided by temperature before softmax. Lower temperature (0.1) = more deterministic, picks the most likely token. Higher temperature (1.5) = more creative, samples from the distribution more broadly. This is why temperature=0 gives you consistent answers and temperature=1.5 gives you hallucinations.
The parameters ARE the model — There's no separate "knowledge database." Everything the model knows is encoded in the values of the weight matrices. This is why fine-tuning works: you're updating those values to encode new patterns.
The Gap Between MicroGPT and Production LLMs
MicroGPT demonstrates the algorithm. It doesn't demonstrate what makes GPT-4 or Claude actually useful in production. That gap is enormous:
| Dimension | MicroGPT | Production LLM |
|---|---|---|
| Parameters | ~50K | 7B–1.8T |
| Training data | 32K names | Trillions of tokens |
| Training compute | CPU, minutes | Thousands of H100s, months |
| Alignment | None | RLHF, Constitutional AI, etc. |
| Inference speed | Slow (pure Python) | Optimized CUDA kernels, quantization |
| Context length | 32 tokens | 128K–1M tokens |
| Deployment cost | Runs on laptop | Hundreds of GPU-hours per day |
The algorithm is the same. The scale is not. This is the central insight: understanding MicroGPT teaches you how LLMs work, but it doesn't get you closer to deploying one in production.
When to Run Your Own vs. Use an API
After reading Karpathy's post, some developers will want to train their own GPT. Most of the time, that's the wrong move. Here's the practical decision framework:
Train your own when:
- You need domain-specific capabilities that general models lack (protein folding, chip design, specialized code)
- You have proprietary data you can't send to an external API
- You need 100% control over the model's behavior and can't use fine-tuning via API
- You're doing academic research on model architecture
Use an API when:
- You're building an application (95%+ of developers)
- You want state-of-the-art performance without infrastructure overhead
- You need to ship fast — training even a small model takes weeks of iteration
- Your use case doesn't require a model trained from scratch
For most production workloads — text generation, image creation, audio synthesis, video generation — the right move is an API. You get immediate access to frontier models, pay only for what you use, and don't manage GPU infrastructure.
Running Frontier Models via ModelsLab API
If MicroGPT taught you how the algorithm works and now you want to put it to use, ModelsLab's API gives you access to 200+ AI models — LLMs, image generation, video, audio — in a unified API.
Here's how simple it is to call a frontier LLM via the API:
import requests
API_KEY = "your-modelslab-api-key"
response = requests.post(
"https://modelslab.com/api/v6/llm/chat",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "qwen3.5-72b", # or claude-3-5-sonnet, gpt-4o, llama-3-405b, etc.
"messages": [
{"role": "user", "content": "Explain autograd in 3 sentences"}
],
"temperature": 0.7,
"max_tokens": 512
}
)
print(response.json()["choices"][0]["message"]["content"])
The same API surface works for image generation (Flux.1, SDXL), video generation (Kling, Seedance), and audio (Whisper, TTS). You don't manage the GPU cluster, model weights, or serving infrastructure — ModelsLab handles all of that.
The Right Mental Model
MicroGPT is one of the best things Karpathy has ever published, and he's published a lot. Run it, read it, understand the autograd engine and why attention works the way it does.
Then close the file and build with APIs.
Understanding the math makes you a better AI engineer — you'll know why hallucinations happen, why context length matters, why fine-tuning works for some use cases and not others. You won't waste months trying to train models that production APIs already solve better.
The 200 lines in MicroGPT contain the full concept. Everything else is execution — and that's what infrastructure providers like ModelsLab are for.
Try the ModelsLab LLM API: Start with the documentation, get a free API key, and run your first request in under 5 minutes.