Seedance 2.0 is here - create consistent, multimodal AI videos faster with images, videos, and audio in one prompt.

Try Now
Skip to main content

Karpathy autoresearch on H100 GPUs: Overnight ML Experiments 2026

||12 min read|LLM
Karpathy autoresearch on H100 GPUs: Overnight ML Experiments 2026

Start Building with ModelsLab APIs

One API key. 100,000+ models. Image, video, audio, and LLM generation.

99.9% UptimePay-as-you-goFree tier available
Get Started

What Is Karpathy's Autoresearch?

On the night of March 7, 2026, Andrej Karpathy pushed a 630-line Python script to GitHub and went to sleep. By morning, his AI agent had run 50 experiments, discovered a better learning rate, committed the proof to git, and moved on to the next hypothesis — all without a single human instruction in between.

The project is called autoresearch, and it reframes how ML research gets done. Instead of a human sitting at a terminal cycling through hyperparameter changes one at a time, an AI agent takes over the execution loop: it modifies code, trains for exactly 5 minutes, checks if validation loss improved, keeps or discards the change, and repeats. You set the research direction before bed. You read the results over coffee.

Karpathy described the dynamic with a line that stuck with the community: "One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun." Autoresearch takes that observation literally — replacing the tedious execution cycle while keeping humans in the strategic seat.

Within a week of release, the repo had crossed 33,000 GitHub stars. By early April 2026, it reached over 66,000 stars with 9,600 forks, making it one of the fastest-growing ML research tools in recent memory.

How the Agent Loop Works: A Nine-Step Ratchet

The autoresearch workflow operates as a ratchet — it only moves forward. Each iteration follows nine steps:

StepActionDetails
1Read research prioritiesAgent reads program.md for direction
2Examine current stateReviews train.py and recent results.tsv
3Propose hypothesisSuggests an architecture or optimizer change
4Modify codeEdits train.py and commits to a git branch
5TrainRuns training for exactly 5 minutes (wall-clock)
6Handle failuresLogs crashes automatically and reverts
7EvaluateMeasures val_bpb, records in results.tsv
8Keep or discardKeeps commit if val_bpb improved; otherwise git reset HEAD~1
9RepeatReturns to step 1

This produces roughly 12 experiments per hour. An overnight run on a single GPU covers 80 to 100 experiments — a volume that would take a human researcher several days of manual cycling.

The system maintains a single lineage rather than a population, using git history as its research memory. The agent reads its own commit log and results.tsv to calibrate future proposals, building on validated directions and avoiding paths that already failed.

The Three Files That Matter

Autoresearch is deliberately minimal. The entire system revolves around three files:

prepare.py — The Immutable Foundation

This file handles one-time data preparation: downloading the FineWeb-Edu training dataset, training a BPE tokenizer (8,192 tokens), and defining the evaluation harness. The agent never modifies prepare.py. It provides the fixed ground truth against which all experiments are measured.

train.py — The Agent's Sandbox

At 630 lines, train.py contains the full GPT model definition, a Muon + AdamW optimizer configuration, and the complete training loop. Everything is fair game for the agent to modify: architecture depth and width, attention mechanisms, activation functions, learning rate schedules, batch sizes, weight initialization, and optimizer hyperparameters.

The single-file constraint is intentional. As Karpathy noted: keeping everything in one file makes the scope manageable and diffs reviewable. When you wake up to 80 experiments, you want readable commit messages, not sprawling multi-file changes.

program.md — Where Humans Do Research

This is the most underappreciated component. program.md is a Markdown file that encodes your research strategy: what to optimize, what to explore, what constraints to respect, and what to avoid. Karpathy describes authoring this file as "programming the program.md" — you write your research direction in natural language, not Python.

A well-written program.md specifies baselines, experimental priorities, VRAM constraints, and failure-handling rules. It even hardcodes the directive: "NEVER STOP. Once the experiment loop has begun, do NOT pause."

The framing is that you are the chief scientist setting the research agenda. The AI agents are junior engineers running the lab overnight in tmux sessions. Your job is strategic; theirs is execution.

The val_bpb Metric: Why It Enables Everything

The evaluation metric is val_bpb — validation bits per byte. This is not an arbitrary choice. Bits per byte is vocabulary-size-independent, which means the agent can change tokenization strategy or model architecture between runs and still get a fair comparison.

This is what makes the ratchet work. If the metric were vocabulary-dependent (like raw cross-entropy loss), changing the tokenizer between experiments would make comparisons meaningless. With val_bpb, every experiment is evaluated on equal footing regardless of what the agent changed.

Lower val_bpb is better. The agent keeps changes that reduce it and reverts changes that increase it. Simple, unambiguous, and resistant to gaming — critical properties when an autonomous agent is making hundreds of decisions without human oversight.

Why H100 GPUs: The Math Behind Research Velocity

The fixed 5-minute training budget per experiment is what makes the overnight workflow practical. But the GPU determines how much useful training happens in those 5 minutes — and the differences are substantial:

GPUCycle TimeExperiments Overnight (12h)Training Steps per Cycle
V100 32GB15-25 min20-30~150 steps
A100 80GB8-10 min40-45~350 steps
H100 SXM5 80GB5-7 min70-100~500 steps
H200 141GB4-6 min80-110~550 steps

On an H100, each 5-minute cycle completes roughly 500 training steps — enough for the model to show meaningful signal about whether a change helped or hurt. On a V100, the same wall-clock budget gets you only ~150 steps, where signal-to-noise is much worse and the agent makes less-informed keep/discard decisions.

The 80GB VRAM on H100s also matters. Autoresearch's default configuration peaks at ~45GB memory usage, leaving comfortable headroom for the agent to experiment with larger batch sizes or wider architectures without hitting out-of-memory failures that waste experiment slots.

Getting 3x more experiments from the same overnight window is not just a speed improvement — it is a fundamentally different research velocity that enables convergence on real improvements.

While Karpathy runs custom models on H100 clusters, developers who want to experiment with state-of-the-art AI models without managing GPU infrastructure can access 100,000+ models through ModelsLab's API — covering text generation, image synthesis, video, and audio with simple API calls.

Real-World Results: What the Numbers Show

Autoresearch has produced concrete, measurable improvements across multiple deployments:

Karpathy's Initial Overnight Run

In the first session, the agent ran 83 experiments and kept 15 improvements, driving val_bpb from 1.000 down to 0.975. Over a two-day extended run of ~700 experiments, roughly 20 changes were kept. These stacking improvements cut the time-to-GPT-2-quality by 11% on the nanochat leaderboard — from 2.02 hours to 1.80 hours.

Notable agent discoveries included:

  • A missing scalar multiplier in QKnorm
  • Value Embedding regularization that reduced overfitting
  • Banded attention tuning for improved efficiency
  • AdamW beta adjustments (beta2 from 0.95 to 0.98)
  • Weight decay scheduling modifications

These are not trivial hyperparameter tweaks. Several are architectural insights that human researchers had overlooked.

Shopify's Production Deployment

Shopify CEO Tobi Lutke adapted autoresearch for an internal query-expansion model. After 37 experiments in 8 hours, the system achieved a 19% improvement in validation score — and a 0.8B parameter model that outperformed the previous 1.6B model it was replacing. That is a 2x parameter reduction with better performance, driven entirely by autonomous experimentation.

SkyPilot's Multi-GPU Scaling

The SkyPilot team gave Claude Code access to a Kubernetes cluster with 16 GPUs (13 H100s and 3 H200s). Over 8 hours, the agent submitted approximately 910 experiments — a 9x throughput increase over single-GPU runs. It ran factorial grids of 10-13 experiments per wave, catching interaction effects between parameters that sequential search would miss.

The agent even developed its own hardware strategy: it learned to use H200s for validation runs while screening initial ideas on H100s. Final result: val_bpb dropped from 1.003 to 0.974 — a 2.87% improvement over baseline.

Red Hat OpenShift Deployment

Red Hat ran 198 experiments on OpenShift AI with H100 GPUs, resulting in 29 kept improvements and a 2.3% improvement in validation loss with zero human intervention.

Setting Up Autoresearch: A Practical Walkthrough

Prerequisites

  • Single NVIDIA GPU (H100 recommended, A100 minimum for practical results)
  • Python 3.10+
  • The uv package manager
  • An LLM API key (Claude or GPT-4 — the agent uses an LLM to decide what changes to propose)

Installation

bash
git clone https://github.com/karpathy/autoresearch
cd autoresearch
,[object Object],
,[object Object],
,[object Object],
,[object Object],
,[object Object],
bash
python -c "import torch; print(torch.cuda.get_device_name(0))"

Configure the Agent

bash
# Set your LLM API key for the agent's decision-making
export ANTHROPIC_API_KEY="sk-ant-..."
# or
export OPENAI_API_KEY="sk-..."

Write Your program.md

The default program.md is intentionally minimal. A strong research specification includes baselines, priorities, constraints, and evaluation criteria:

markdown
# Research Goal
Reduce val_bpb below 0.980 within 80 experiments.
,[object Object],
,[object Object],
,[object Object],
,[object Object],
,[object Object],
    markdown
    ,[object Object],
    ,[object Object],
    ,[object Object],
  • markdown
    Revert any change increasing val_bpb by more than 0.05 from current best

Launch the Overnight Run

bash
# Create a tmux session
tmux new-session -d -s research
,[object Object],
,[object Object],
,[object Object],
,[object Object],
,[object Object],
bash
tmux attach -t research

With --max-experiments 80 and 7-minute average cycles, the run completes in roughly 9 hours.

Reading Your Morning Results

The agent logs every experiment to results.tsv. A typical morning review looks like this:

Experiment 047 | 03:14 AM
Change: Reduced warmup_fraction from 0.10 to 0.05, increased peak_lr to 4e-4
val_bpb_before: 0.9921 val_bpb_after: 0.9808
Delta: -0.0113 ✓ KEPT
,[object Object],
Experiment 049 | 03:30 AM
Change: Increased batch_size from 128 to 256
val_bpb_before: 0.9808 val_bpb_after: 0.9800
Delta: -0.0008 ✓ KEPT

Expect most experiments to revert. That is normal — in Karpathy's runs, roughly 15-20% of experiments produced improvements. The value is in the cumulative effect of stacking those improvements over 80+ iterations.

Limitations and Trade-offs

Autoresearch is powerful but not magic. Understanding its constraints helps you use it effectively:

The ratchet cannot go backward. The system only accepts improvements. This means it cannot make a short-term sacrifice (increasing loss temporarily) to enable a larger future gain. It gets stuck in local optima.

Five minutes masks slow-converging changes. Some architectural modifications need longer training to show their true value. Changes that converge slowly but ultimately outperform will be discarded because they look worse at the 5-minute checkpoint.

Agents are conservative. Current LLMs, shaped by RLHF training, tend to propose safe, incremental changes rather than bold architectural experiments. The "creativity ceiling" means the agent is better at optimization than innovation.

Single-lineage search. Unlike population-based training, autoresearch maintains one lineage. It cannot explore multiple promising directions simultaneously (unless you scale to multiple GPUs with something like SkyPilot).

The Bigger Picture: What Autoresearch Means for ML Research

Autoresearch represents a shift in the researcher's role from experimenter to experimental designer. You no longer spend time babysitting training loops and manually logging hyperparameter changes. Instead, you invest your time in writing better program.md files — defining what questions to ask, not running the experiments yourself.

This is not just a productivity hack. It changes what kinds of research are economically viable. An individual researcher with a single H100 can now explore a configuration space overnight that previously required a team running experiments during business hours for a week.

The community response confirms the demand. Derivatives have appeared for Apple Silicon (MLX port), agent optimization (LangChain's autoresearch-agents), and domain-specific applications beyond LLM training.

For teams that want to integrate AI model capabilities into their products without building custom training infrastructure, platforms like ModelsLab provide API access to over 100,000 AI models — from text-to-image and video generation to LLM chat — letting developers focus on building applications while the infrastructure complexity is handled at the platform level.

Frequently Asked Questions

What hardware do I need to run autoresearch?

You need a single NVIDIA GPU with at least 20GB VRAM. An H100 SXM5 (80GB) is recommended for the best results — it completes training cycles in 5-7 minutes, enabling 70-100 experiments overnight. An A100 works but runs slower (~8-10 minutes per cycle, ~45 experiments overnight). Community forks exist for Apple Silicon (MLX) and AMD GPUs, but the original project targets NVIDIA CUDA.

Which LLM agent works best with autoresearch?

Autoresearch is compatible with Claude, GPT-4, and Codex-based agents. The agent needs to read code, propose modifications, and interpret numerical results. Claude Code and Cursor are commonly used in practice. The choice of agent affects the quality and creativity of proposed experiments — stronger reasoning models tend to produce more impactful changes.

How much does an overnight autoresearch run cost?

On cloud H100 instances, a 12-hour run typically costs $30-$50 depending on spot pricing. You also incur LLM API costs for the agent's decision-making — roughly 12 API calls per hour for 12 hours. Total cost including both GPU and API usage is usually under $80 for a full overnight session.

Can I use autoresearch for tasks beyond LLM training?

Yes, but with modifications. The core pattern — modify code, run a fixed-duration experiment, evaluate a scalar metric, keep or discard — applies to any ML task with a clear evaluation metric. The community has adapted it for image classification (MobileNet V3), reinforcement learning reward tuning, and agent prompt optimization. The key requirement is a fast, reliable scalar metric that the agent can optimize against.

How does autoresearch compare to traditional hyperparameter search?

Traditional methods like grid search or Bayesian optimization explore a predefined parameter space. Autoresearch is fundamentally different: the agent can modify arbitrary code, not just tune numbers. It can change activation functions, add regularization layers, restructure attention mechanisms, or modify the optimizer itself. The trade-off is that it is less systematic — it relies on the LLM's judgment rather than exhaustive search — but it can discover structural improvements that parameter sweeps would never find.

Share:
Plugins

Explore Plugins for Pro

Our plugins are designed to work with the most popular content creation software.

API

Build Apps with
ML
API

Use our API to build apps, generate AI art, create videos, and produce audio with ease.