Andrej Karpathy published autoresearch to GitHub in March 2026 with a comment that stuck: "One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun." The project takes that observation literally.
The idea: give an AI agent a real LLM training setup and let it run experiments autonomously overnight. The agent modifies actual Python code, runs 5-minute training loops, checks whether loss improved, keeps or discards each change, and repeats until morning. You wake up to a structured log of 60–80 experiments and, hopefully, a better model.
This post walks through setting up autoresearch on an H100 cloud GPU using GPULab — the infrastructure Karpathy's workflow actually needs to make the cycle times practical.
The Architecture: Chief Scientist + Junior Engineers in tmux
The repo has three files that matter:
prepare.py— fixed utility code. Downloads training data, trains a BPE tokenizer, handles the dataloader and evaluation harness. The agent never touches this.train.py— the file the agent edits. Full GPT model, Muon + AdamW optimizer, training loop. Architecture, hyperparameters, batch size — everything in here is fair game.program.md— your instructions to the research org. The agent reads this to understand what to optimize for, what to try, and what constraints to stay within.
Karpathy describes this as "programming the program.md" — you're writing the research strategy in Markdown, not Python. The agents (junior engineers) execute the experiments in tmux sessions while you're offline. You're the chief scientist who sets direction; they run the lab overnight.
This framing is more than a metaphor. The HN thread shows Karpathy responding to questions directly — the architecture genuinely treats program.md as a specification for an autonomous research team.
Why H100 GPUs for This Workflow
The autoresearch loop requires a GPU that completes a training cycle reliably within the 5–7 minute target. That target isn't arbitrary — it's what makes 60–80 experiments feasible in a 12-hour overnight window.
On older hardware:
- A100 80GB: ~8–10 minutes per cycle → 40–45 experiments overnight
- V100 32GB: 15–25 minutes per cycle → 20–30 experiments overnight
- H100 SXM5 80GB: 5–7 minutes per cycle → 70–100 experiments overnight
The difference is real research velocity. Getting 3x more experiments from the same overnight window means faster convergence on what actually works. H100s also have enough VRAM (80GB) to handle nanochat's full model without gradient checkpointing, which adds overhead and slows cycles further on smaller GPUs.
GPULab provides H100 SXM5 instances on demand. Provisioning takes under 2 minutes, and you can terminate after reviewing your morning results — a 12-hour run typically costs $30–45 depending on current spot rates.
Setup: GPULab H100 + autoresearch
Start a GPULab H100 instance (Ubuntu 22.04, CUDA 12.x image). Once connected via SSH:
# Clone autoresearch
git clone https://github.com/karpathy/autoresearch
cd autoresearch
# Install dependencies (Python 3.10+, PyTorch 2.x)
pip install -r requirements.txt
# One-time data prep: downloads training data, trains BPE tokenizer
python prepare.py
# Verify GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"
# Should output: NVIDIA H100 SXM5 80GB (or similar)
Set your LLM API key — the agent uses an LLM to decide what changes to propose for train.py:
export ANTHROPIC_API_KEY="sk-ant-..."
# or if using OpenAI:
export OPENAI_API_KEY="sk-..."
Writing program.md: Where the Research Actually Happens
The default program.md in the repo is intentionally minimal — a baseline. Writing a strong one is where you actually do research work. Here's an example that gives the agents clear direction:
# Research Goal
Reduce validation loss on the nanochat task below 2.0 within 50 experiments.
# Current Baseline
- Architecture: 6-layer GPT, 6 heads, 256 embed dim
- Optimizer: Muon + AdamW (default settings from train.py)
- Training: ~500 steps per 5-minute cycle
# What to Explore (priority order)
1. Learning rate schedule variants: cosine decay, warmup fraction (try 0.05, 0.10, 0.15)
2. Batch size: try 64, 128, 256 (within VRAM limits)
3. Architecture depth vs width: 8 layers × 4 heads vs 6 layers × 6 heads
4. Dropout: 0.0 vs 0.1 vs 0.2 on attention layers
# Hard Constraints
- Stay within 70GB VRAM (leave buffer for system overhead)
- Each experiment must complete in under 8 minutes
- Do not modify prepare.py
- Revert any change that increases val loss by more than 0.05 from current best
# Evaluation
Primary: validation loss after full training cycle. Lower is better.
Log every experiment: change description, val_loss_before, val_loss_after, decision (keep/revert).
The agents read this, propose a change aligned with your priorities, implement it in train.py, run the training cycle, evaluate, and log the result. Your job as "chief scientist" is writing better program.md iterations based on what the logs reveal.
Adding ModelsLab Inference API to the Eval Loop
The default evaluation metric is validation loss — fast and automatic. But val loss doesn't always capture generation quality. You can add a generation quality check using ModelsLab's inference API as a secondary evaluation signal.
Add this as an optional evaluation hook — run it only when val loss improves to avoid unnecessary API calls:
import requests
MODELSLAB_KEY = "your-key" # get from modelslab.com
def generation_quality_check(prompts: list[str], threshold: float = 0.7) -> float:
"""
Quick generation quality check via ModelsLab LLM API.
Returns fraction of prompts that produce coherent output (>50 chars).
Run only when val_loss improves to control API spend.
"""
scores = []
for prompt in prompts:
try:
resp = requests.post(
"https://modelslab.com/api/v6/llm/chat",
json={
"key": MODELSLAB_KEY,
"prompt": prompt,
"max_new_tokens": 150,
},
timeout=30,
)
output = resp.json().get("message", "")
scores.append(len(output.strip()) > 50)
except Exception:
scores.append(False)
return sum(scores) / len(scores) if scores else 0.0
# In your eval function, call this only when val_loss improves:
if new_val_loss < best_val_loss:
quality = generation_quality_check(TEST_PROMPTS)
log(f"val_loss={new_val_loss:.4f} quality={quality:.2f}")
This gives you two signals: validation loss (fast, from the training loop) and generation coherence (slower, from the inference API). Combining them catches cases where val loss improves but actual output quality doesn't — which happens more than you'd expect.
Running the Overnight Loop
Start the agent inside a tmux session so it keeps running after you disconnect:
# Create a named tmux session
tmux new-session -d -s research
# Launch the agent loop inside tmux
tmux send-keys -t research \
'python -m autoresearch.run --program program.md --max-experiments 80' \
Enter
# Detach — everything continues running
tmux detach
# To check progress from anywhere (or reconnect in the morning):
# SSH back into the instance, then:
tmux attach -t research
With --max-experiments 80 and 7-minute average cycle times, the loop runs for roughly 9 hours. Set your cap based on how long you want the overnight window and your GPU budget.
Morning: Reading the Results
The agent produces a structured experiment log. Here's what a typical entry looks like after a productive overnight run:
Experiment 047 | 03:14 AM
Change: Reduced warmup_fraction from 0.10 to 0.05, increased peak_lr from 3e-4 to 4e-4
val_loss_before: 2.184
val_loss_after: 2.071
Delta: -0.113 ✓ KEPT
Generation quality: 0.82 (above 0.70 threshold)
Experiment 048 | 03:22 AM
Change: Added residual dropout 0.1 to attention layers
val_loss_before: 2.071
val_loss_after: 2.109
Delta: +0.038 — REVERTED
Experiment 049 | 03:30 AM
Change: Increased batch_size from 128 to 256
val_loss_before: 2.071
val_loss_after: 2.063
Delta: -0.008 ✓ KEPT
Most experiments will revert. That's expected — this is what research looks like when you run enough trials. The value is in the 15–20% that improve the baseline. After 80 experiments, you have data on what directions actually work for your specific setup, not just theory.
Practical Notes Before You Run
- Cap experiments explicitly. The loop runs indefinitely without
--max-experiments. 80 is a reasonable overnight ceiling before logs get unwieldy. - Verify CUDA before disconnecting. Run a short test training cycle and confirm GPU utilization is correct before starting the overnight run. A silent CUDA setup issue kills all 80 experiments.
- Budget GPU time. 12 hours × H100 rate = $30–45 depending on current spot pricing. Track this across nights — multiple overnight sessions add up quickly.
- Read logs critically. If val loss doesn't improve after 40 experiments, the problem is usually program.md needing new directions, not more compute. Rewrite the strategy before adding more GPU hours.
- Start with the default program.md. Karpathy's baseline is simple for a reason — get one successful overnight run working before adding complexity to your research instructions.
What This Is (and Isn't)
autoresearch is not a magic ML research accelerator. It's a workflow that removes the most tedious parts of empirical ML research — running experiments one at a time, babysitting training loops, manually logging what you tried. The agent loop handles the execution. You still need to interpret the results and write better strategies.
The "chief scientist + junior engineers" framing is accurate: you're doing higher-level strategic work while the agents handle execution. On H100 cloud GPUs, the execution is fast enough that this workflow becomes practical for overnight sessions rather than multi-day runs.
The autoresearch repo is at github.com/karpathy/autoresearch. GPULab H100 instances are available at gpulab.ai. ModelsLab inference API documentation is at docs.modelslab.com.
