Create & Edit Images Instantly with Google Nano Banana 2

Try Nano Banana 2 Now
Skip to main content

Karpathy's MicroGPT: A 200-Line GPT Explained for Developers

Adhik JoshiAdhik Joshi
||7 min read|LLM
Karpathy's MicroGPT: A 200-Line GPT Explained for Developers

Integrate AI APIs Today

Build next-generation applications with ModelsLab's enterprise-grade AI APIs for image, video, audio, and chat generation

Get Started
Get Started

Karpathy's MicroGPT: A 200-Line GPT That Actually Works

Andrej Karpathy just published microgpt.py — a single 200-line Python file with zero external dependencies that implements a complete GPT from scratch. No PyTorch, no Hugging Face, no CUDA. Just raw Python.

The Hacker News thread hit 960+ points inside a few hours. The comments are a goldmine: developers who've been using LLMs for two years finally understanding what's actually happening inside the black box.

This post walks through what Karpathy built, why it matters for working developers, and what it reveals about when you should roll your own versus when you should reach for an API.

What MicroGPT Actually Is

This isn't a toy demo. MicroGPT contains the full algorithmic content of a GPT in one file:

  • Dataset loading — reads a corpus of documents (32K names by default)
  • Character-level tokenizer — maps unique characters to integer token IDs
  • Autograd engine from scratch — computes gradients without PyTorch via a Value class
  • GPT-2-style architecture — multi-head self-attention, feed-forward layers, positional embeddings
  • Adam optimizer — hand-implemented, no library dependency
  • Training loop + inference loop — everything needed to train and sample

The whole thing fits in 200 lines. It generates plausible new names (kamon, vialan, keylen, alerin) by learning statistical patterns from the training set. It's the conceptual heart of GPT-4, just without the trillion parameters and RLHF.

Karpathy describes it as "the culmination of a decade-long obsession to simplify LLMs to their bare essentials" — following his earlier micrograd, makemore, and nanoGPT projects.

The Key Components, Explained

1. Dataset and Tokenizer

MicroGPT starts with a corpus of documents — in this case, 32,000 names. Each name is a document. The tokenizer is the simplest possible: every unique character becomes a token ID.

uchars = sorted(set(''.join(docs)))  # unique chars become token IDs 0..n-1
BOS = len(uchars)                     # special Beginning of Sequence token
vocab_size = len(uchars) + 1          # 26 letters + 1 BOS = 27 tokens

Production tokenizers (like OpenAI's tiktoken) operate on sub-word chunks for efficiency — "token" becomes ["tok", "en"] rather than individual characters. But character-level tokenization is sufficient for understanding the algorithm.

Each training document gets wrapped in BOS tokens: [BOS, e, m, m, a, BOS]. The model learns that BOS marks document boundaries.

2. Autograd: Backprop Without PyTorch

This is where MicroGPT gets impressive. The Value class implements automatic differentiation from scratch — the same computation graph traversal that PyTorch does, just without GPU optimization or memory efficiency:

class Value:
    def __init__(self, data, _children=()):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
    
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other))
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out

Every operation records its children. When you call backward() on the final loss, the gradients flow back through the computation graph — this is what backpropagation actually is at the code level.

3. The Transformer Architecture

MicroGPT implements the core transformer blocks: token embeddings, positional encodings, multi-head self-attention, residual connections, and layer normalization. The attention mechanism is the key insight:

# Attention: query, key, value projections
# Q @ K^T scaled by sqrt(d_k) -> softmax -> @ V
# This is the "pay attention to relevant tokens" operation

Each attention head learns to look at different aspects of the context. Multiple heads in parallel allow the model to attend to different types of relationships simultaneously — syntactic patterns, semantic similarity, positional proximity.

4. The Training Loop

The training loop is three lines conceptually: forward pass (compute predictions), backward pass (compute gradients), optimizer step (nudge parameters in the direction that reduces loss):

for step in range(max_steps):
    # sample a random batch of training sequences
    x, y = get_batch(train_data, batch_size, context_length)
    # forward pass: compute logits and loss
    logits, loss = model(x, y)
    # backward pass + parameter update
    loss.backward()
    optimizer.step()

This is the same loop that trained GPT-4 — just with vastly more data, parameters, and compute.

What This Teaches You (Even If You Never Run It)

Reading MicroGPT clarifies several things that are commonly misunderstood:

Models don't "understand" anything — they learn statistical patterns that predict which token comes next. Your conversation with ChatGPT is, from the model's perspective, just a document to complete. The "intelligence" emerges from the scale of patterns learned across billions of documents.

Context length is a hard constraint — The model only sees the last context_length tokens. There's no persistent memory, no retrieval, no understanding of what came before the context window. Everything you've read about RAG and long-context models is engineering around this fundamental limitation.

Temperature controls creativity vs. accuracy — During inference, the logits (raw scores) get divided by temperature before softmax. Lower temperature (0.1) = more deterministic, picks the most likely token. Higher temperature (1.5) = more creative, samples from the distribution more broadly. This is why temperature=0 gives you consistent answers and temperature=1.5 gives you hallucinations.

The parameters ARE the model — There's no separate "knowledge database." Everything the model knows is encoded in the values of the weight matrices. This is why fine-tuning works: you're updating those values to encode new patterns.

The Gap Between MicroGPT and Production LLMs

MicroGPT demonstrates the algorithm. It doesn't demonstrate what makes GPT-4 or Claude actually useful in production. That gap is enormous:

DimensionMicroGPTProduction LLM
Parameters~50K7B–1.8T
Training data32K namesTrillions of tokens
Training computeCPU, minutesThousands of H100s, months
AlignmentNoneRLHF, Constitutional AI, etc.
Inference speedSlow (pure Python)Optimized CUDA kernels, quantization
Context length32 tokens128K–1M tokens
Deployment costRuns on laptopHundreds of GPU-hours per day

The algorithm is the same. The scale is not. This is the central insight: understanding MicroGPT teaches you how LLMs work, but it doesn't get you closer to deploying one in production.

When to Run Your Own vs. Use an API

After reading Karpathy's post, some developers will want to train their own GPT. Most of the time, that's the wrong move. Here's the practical decision framework:

Train your own when:

  • You need domain-specific capabilities that general models lack (protein folding, chip design, specialized code)
  • You have proprietary data you can't send to an external API
  • You need 100% control over the model's behavior and can't use fine-tuning via API
  • You're doing academic research on model architecture

Use an API when:

  • You're building an application (95%+ of developers)
  • You want state-of-the-art performance without infrastructure overhead
  • You need to ship fast — training even a small model takes weeks of iteration
  • Your use case doesn't require a model trained from scratch

For most production workloads — text generation, image creation, audio synthesis, video generation — the right move is an API. You get immediate access to frontier models, pay only for what you use, and don't manage GPU infrastructure.

Running Frontier Models via ModelsLab API

If MicroGPT taught you how the algorithm works and now you want to put it to use, ModelsLab's API gives you access to 200+ AI models — LLMs, image generation, video, audio — in a unified API.

Here's how simple it is to call a frontier LLM via the API:

import requests

API_KEY = "your-modelslab-api-key"

response = requests.post(
    "https://modelslab.com/api/v6/llm/chat",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "qwen3.5-72b",  # or claude-3-5-sonnet, gpt-4o, llama-3-405b, etc.
        "messages": [
            {"role": "user", "content": "Explain autograd in 3 sentences"}
        ],
        "temperature": 0.7,
        "max_tokens": 512
    }
)
print(response.json()["choices"][0]["message"]["content"])

The same API surface works for image generation (Flux.1, SDXL), video generation (Kling, Seedance), and audio (Whisper, TTS). You don't manage the GPU cluster, model weights, or serving infrastructure — ModelsLab handles all of that.

The Right Mental Model

MicroGPT is one of the best things Karpathy has ever published, and he's published a lot. Run it, read it, understand the autograd engine and why attention works the way it does.

Then close the file and build with APIs.

Understanding the math makes you a better AI engineer — you'll know why hallucinations happen, why context length matters, why fine-tuning works for some use cases and not others. You won't waste months trying to train models that production APIs already solve better.

The 200 lines in MicroGPT contain the full concept. Everything else is execution — and that's what infrastructure providers like ModelsLab are for.


Try the ModelsLab LLM API: Start with the documentation, get a free API key, and run your first request in under 5 minutes.

Share:
Adhik Joshi

Written by

Adhik Joshi

Plugins

Explore Plugins for Pro

Our plugins are designed to work with the most popular content creation software.

API

Build Apps with
ML
API

Use our API to build apps, generate AI art, create videos, and produce audio with ease.