Karpathy's MicroGPT: 200-Line GPT Explained (2026)

Karpathy's MicroGPT: A 200-Line GPT That Actually Works

Andrej Karpathy just published microgpt.py — a single 200-line Python file with zero external dependencies that implements a complete GPT from scratch. No PyTorch, no Hugging Face, no CUDA. Just raw Python.

The Hacker News thread hit 960+ points inside a few hours. The comments are a goldmine: developers who've been using LLMs for two years finally understanding what's actually happening inside the black box.

This post walks through what Karpathy built, why it matters for working developers, and what it reveals about when you should roll your own versus when you should reach for an API.

What MicroGPT Actually Is

This isn't a toy demo. MicroGPT contains the full algorithmic content of a GPT in one file:

Dataset loading — reads a corpus of documents (32K names by default)
Character-level tokenizer — maps unique characters to integer token IDs
Autograd engine from scratch — computes gradients without PyTorch via a Value class
GPT-2-style architecture — multi-head self-attention, feed-forward layers, positional embeddings
Adam optimizer — hand-implemented, no library dependency
Training loop + inference loop — everything needed to train and sample

The whole thing fits in 200 lines. It generates plausible new names (kamon, vialan, keylen, alerin) by learning statistical patterns from the training set. It's the conceptual heart of GPT-4, just without the trillion parameters and RLHF.

Karpathy describes it as "the culmination of a decade-long obsession to simplify LLMs to their bare essentials" — following his earlier micrograd, makemore, and nanoGPT projects.

The Key Components, Explained

1. Dataset and Tokenizer

MicroGPT starts with a corpus of documents — in this case, 32,000 names. Each name is a document. The tokenizer is the simplest possible: every unique character becomes a token ID.

uchars = sorted(set(''.join(docs)))  # unique chars become token IDs 0..n-1
BOS = len(uchars)                     # special Beginning of Sequence token
vocab_size = len(uchars) + 1          # 26 letters + 1 BOS = 27 tokens

Production tokenizers (like OpenAI's tiktoken) operate on sub-word chunks for efficiency — "token" becomes ["tok", "en"] rather than individual characters. But character-level tokenization is sufficient for understanding the algorithm.

Each training document gets wrapped in BOS tokens: [BOS, e, m, m, a, BOS]. The model learns that BOS marks document boundaries.

2. Autograd: Backprop Without PyTorch

This is where MicroGPT gets impressive. The Value class implements automatic differentiation from scratch — the same computation graph traversal that PyTorch does, just without GPU optimization or memory efficiency:

class Value:
    def __init__(self, data, _children=()):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
def __add__(self, other):
    out = Value(self.data + other.data, (self, other))
    def _backward():
        self.grad += out.grad
        other.grad += out.grad
    out._backward = _backward
    return out

Every operation records its children. When you call backward() on the final loss, the gradients flow back through the computation graph — this is what backpropagation actually is at the code level.

3. The Transformer Architecture

MicroGPT implements the core transformer blocks: token embeddings, positional encodings, multi-head self-attention, residual connections, and layer normalization. The attention mechanism is the key insight:

# Attention: query, key, value projections
# Q @ K^T scaled by sqrt(d_k) -> softmax -> @ V
# This is the "pay attention to relevant tokens" operation

Each attention head learns to look at different aspects of the context. Multiple heads in parallel allow the model to attend to different types of relationships simultaneously — syntactic patterns, semantic similarity, positional proximity.

4. The Training Loop

The training loop is three lines conceptually: forward pass (compute predictions), backward pass (compute gradients), optimizer step (nudge parameters in the direction that reduces loss):

for step in range(max_steps):
    # sample a random batch of training sequences
    x, y = get_batch(train_data, batch_size, context_length)
    # forward pass: compute logits and loss
    logits, loss = model(x, y)
    # backward pass + parameter update
    loss.backward()
    optimizer.step()

This is the same loop that trained GPT-4 — just with vastly more data, parameters, and compute.

What This Teaches You (Even If You Never Run It)

Reading MicroGPT clarifies several things that are commonly misunderstood:

Models don't "understand" anything — they learn statistical patterns that predict which token comes next. Your conversation with ChatGPT is, from the model's perspective, just a document to complete. The "intelligence" emerges from the scale of patterns learned across billions of documents.

Context length is a hard constraint — The model only sees the last context_length tokens. There's no persistent memory, no retrieval, no understanding of what came before the context window. Everything you've read about RAG and long-context models is engineering around this fundamental limitation.

Temperature controls creativity vs. accuracy — During inference, the logits (raw scores) get divided by temperature before softmax. Lower temperature (0.1) = more deterministic, picks the most likely token. Higher temperature (1.5) = more creative, samples from the distribution more broadly. This is why temperature=0 gives you consistent answers and temperature=1.5 gives you hallucinations.

The parameters ARE the model — There's no separate "knowledge database." Everything the model knows is encoded in the values of the weight matrices. This is why fine-tuning works: you're updating those values to encode new patterns.

The Gap Between MicroGPT and Production LLMs

MicroGPT demonstrates the algorithm. It doesn't demonstrate what makes GPT-4 or Claude actually useful in production. That gap is enormous:

Dimension	MicroGPT	Production LLM
Parameters	~50K	7B–1.8T
Training data	32K names	Trillions of tokens
Training compute	CPU, minutes	Thousands of H100s, months
Alignment	None	RLHF, Constitutional AI, etc.
Inference speed	Slow (pure Python)	Optimized CUDA kernels, quantization
Context length	32 tokens	128K–1M tokens
Deployment cost	Runs on laptop	Hundreds of GPU-hours per day

The algorithm is the same. The scale is not. This is the central insight: understanding MicroGPT teaches you how LLMs work, but it doesn't get you closer to deploying one in production.

When to Run Your Own vs. Use an API

After reading Karpathy's post, some developers will want to train their own GPT. Most of the time, that's the wrong move. Here's the practical decision framework:

Train your own when:

You need domain-specific capabilities that general models lack (protein folding, chip design, specialized code)
You have proprietary data you can't send to an external API
You need 100% control over the model's behavior and can't use fine-tuning via API
You're doing academic research on model architecture

Use an API when:

You're building an application (95%+ of developers)
You want state-of-the-art performance without infrastructure overhead
You need to ship fast — training even a small model takes weeks of iteration
Your use case doesn't require a model trained from scratch

For most production workloads — text generation, image creation, audio synthesis, video generation — the right move is an API. You get immediate access to frontier models, pay only for what you use, and don't manage GPU infrastructure.

Running Frontier Models via ModelsLab API

If MicroGPT taught you how the algorithm works and now you want to put it to use, ModelsLab's API gives you access to 200+ AI models — LLMs, image generation, video, audio — in a unified API.

Here's how simple it is to call a frontier LLM via the API:

import requests
API_KEY = "your-modelslab-api-key"

response = requests.post(
"https://modelslab.com/api/v6/llm/chat",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "qwen3.5-72b",  # or claude-3-5-sonnet, gpt-4o, llama-3-405b, etc.
"messages": [
{"role": "user", "content": "Explain autograd in 3 sentences"}
],
"temperature": 0.7,
"max_tokens": 512
}
)
print(response.json()["choices"][0]["message"]["content"])

The same API surface works for image generation (Flux.1, SDXL), video generation (Kling, Seedance), and audio (Whisper, TTS). You don't manage the GPU cluster, model weights, or serving infrastructure — ModelsLab handles all of that.

The Right Mental Model

MicroGPT is one of the best things Karpathy has ever published, and he's published a lot. Run it, read it, understand the autograd engine and why attention works the way it does.

Then close the file and build with APIs.

Understanding the math makes you a better AI engineer — you'll know why hallucinations happen, why context length matters, why fine-tuning works for some use cases and not others. You won't waste months trying to train models that production APIs already solve better.

The 200 lines in MicroGPT contain the full concept. Everything else is execution — and that's what infrastructure providers like ModelsLab are for.

Try the ModelsLab LLM API: Start with the documentation, get a free API key, and run your first request in under 5 minutes.

Karpathy's MicroGPT: A 200-Line GPT Explained for Developers 2026

Karpathy's MicroGPT: A 200-Line GPT That Actually Works

What MicroGPT Actually Is

The Key Components, Explained

1. Dataset and Tokenizer

2. Autograd: Backprop Without PyTorch

3. The Transformer Architecture

4. The Training Loop

What This Teaches You (Even If You Never Run It)

The Gap Between MicroGPT and Production LLMs

When to Run Your Own vs. Use an API

Running Frontier Models via ModelsLab API

The Right Mental Model

Explore Plugins for Pro

Build Apps with
ModelsLab
ML
API

Karpathy's MicroGPT: A 200-Line GPT Explained for Developers 2026

Karpathy's MicroGPT: A 200-Line GPT That Actually Works

What MicroGPT Actually Is

The Key Components, Explained

1. Dataset and Tokenizer

2. Autograd: Backprop Without PyTorch

3. The Transformer Architecture

4. The Training Loop

What This Teaches You (Even If You Never Run It)

The Gap Between MicroGPT and Production LLMs

When to Run Your Own vs. Use an API

Running Frontier Models via ModelsLab API

The Right Mental Model

Explore Plugins for Pro

Build Apps with ModelsLabML API

Build Apps with
ModelsLab
ML
API