Skip to main content
Available now on ModelsLab · Language Model

Inception: MercuryReasoning at 1000 Tokens/Sec

Build Faster AI Apps

Diffusion Core

Parallel Token Generation

Refines token groups simultaneously for 5-10x speed over autoregressive LLMs.

Tunable Reasoning

Low to High Effort

Set reasoning levels from instant to high for optimized latency in voice agents.

128K Context

Native Tool Use

Supports schema-aligned JSON and tool integration as drop-in LLM replacement.

Examples

See what Inception: Mercury can create

Copy any prompt below and try it yourself in the playground.

Code Review

Review this Python function for bugs and optimize for speed: def fibonacci(n): if n <= 1: return n return fibonacci(n-1) + fibonacci(n-2)

JSON Schema

Generate a schema-aligned JSON response listing top 5 Python libraries for data analysis with descriptions.

Agent Workflow

Plan a retrieval-augmented generation workflow using vector search and tool calls for querying customer data.

Reasoning Chain

High reasoning: Solve this logic puzzle step-by-step: Three houses in a row, owners A B C drink water milk tea, own cat dog bird.

For Developers

A few lines of code.
Inference. Three lines.

ModelsLab handles the infrastructure: fast inference, auto-scaling, and a developer-friendly API. No GPU management needed.

  • Serverless: scales to zero, scales to millions
  • Pay per token, no minimums
  • Python and JavaScript SDKs, plus REST API
import requests
response = requests.post(
"https://modelslab.com/api/v7/llm/chat/completions",
json={
"key": "YOUR_API_KEY",
"prompt": "",
"model_id": ""
}
)
print(response.json())

FAQ

Common questions about Inception: Mercury

Read the docs

Inception: Mercury is the first diffusion large language model (dLLM). It uses discrete diffusion for parallel token generation. Runs 5-10x faster than GPT-4.1 Nano or Claude 3.5 Haiku.

Achieves over 1000 tokens per second on NVIDIA H100 GPUs. Up to 10x faster than speed-optimized autoregressive LLMs. Enables low-latency voice agents and chatbots.

Employs coarse-to-fine diffusion process refining tokens in parallel. Matches frontier performance at lower cost. Supports tunable reasoning levels.

Yes, native tool use and schema-aligned JSON output. Drop-in replacement for RAG and agentic workflows. 128K context window.

Available via Inception API endpoint. Use INCEPTION_API_KEY for integration. Test Mercury Coder in playground.

Sets records at 1000+ tokens/sec vs 200 for others. On par with Claude 4.5 Haiku in reasoning. Lower inference cost.

Ready to create?

Start generating with Inception: Mercury on ModelsLab.