Available now on ModelsLab · Language Model

Inception: Mercury
Reasoning at 1000 Tokens/Sec

Try Inception: Mercury API Documentation

Build Faster AI Apps

Diffusion Core

Parallel Token Generation

Refines token groups simultaneously for 5-10x speed over autoregressive LLMs.

Tunable Reasoning

Low to High Effort

Set reasoning levels from instant to high for optimized latency in voice agents.

128K Context

Native Tool Use

Supports schema-aligned JSON and tool integration as drop-in LLM replacement.

Examples

See what Inception: Mercury can create

Copy any prompt below and try it yourself in the playground.

Code Review

“Review this Python function for bugs and optimize for speed: def fibonacci(n): if n <= 1: return n return fibonacci(n-1) + fibonacci(n-2)”

JSON Schema

“Generate a schema-aligned JSON response listing top 5 Python libraries for data analysis with descriptions.”

Agent Workflow

“Plan a retrieval-augmented generation workflow using vector search and tool calls for querying customer data.”

Reasoning Chain

“High reasoning: Solve this logic puzzle step-by-step: Three houses in a row, owners A B C drink water milk tea, own cat dog bird.”

For Developers

A few lines of code.
Inference. Three lines.

ModelsLab handles the infrastructure: fast inference, auto-scaling, and a developer-friendly API. No GPU management needed.

Serverless: scales to zero, scales to millions
Pay per token, no minimums
Python and JavaScript SDKs, plus REST API

API Documentation

import requests

response = requests.post(
    "https://modelslab.com/api/v7/llm/chat/completions",
    json={
  "key": "YOUR_API_KEY",
  "prompt": "",
  "model_id": ""
}
)
print(response.json())

FAQ

Common questions about Inception: Mercury

Read the docs

Inception: Mercury is the first diffusion large language model (dLLM). It uses discrete diffusion for parallel token generation. Runs 5-10x faster than GPT-4.1 Nano or Claude 3.5 Haiku.

Achieves over 1000 tokens per second on NVIDIA H100 GPUs. Up to 10x faster than speed-optimized autoregressive LLMs. Enables low-latency voice agents and chatbots.

Employs coarse-to-fine diffusion process refining tokens in parallel. Matches frontier performance at lower cost. Supports tunable reasoning levels.

Yes, native tool use and schema-aligned JSON output. Drop-in replacement for RAG and agentic workflows. 128K context window.

Available via Inception API endpoint. Use INCEPTION_API_KEY for integration. Test Mercury Coder in playground.

Sets records at 1000+ tokens/sec vs 200 for others. On par with Claude 4.5 Haiku in reasoning. Lower inference cost.

Ready to create?

Start generating with Inception: Mercury on ModelsLab.

Try Inception: Mercury API Documentation

Inception: MercuryReasoning at 1000 Tokens/Sec

Build Faster AI Apps

Parallel Token Generation

Low to High Effort

Native Tool Use

See what Inception: Mercury can create

A few lines of code.Inference. Three lines.

Common questions about Inception: Mercury

What is Inception: Mercury model?

How fast is inception mercury LLM?

What makes Inception: Mercury unique?

Does inception: mercury model support tools?

Where to access Inception: Mercury LLM?

What are benchmarks for inception mercury?

Ready to create?

Inception: Mercury
Reasoning at 1000 Tokens/Sec

A few lines of code.
Inference. Three lines.