Moonshot AI released Kimi K2.5 on January 26, 2026, and it landed at the top of SWE-Bench Verified with a 76.8% score — making it the most capable open-weight coding model available today. It costs $0.60 per million input tokens. That combination is hard to ignore if you're building LLM-powered applications.
This guide covers what Kimi K2.5 is, how to use it via API, and where it fits in your LLM stack.
What Is Kimi K2.5?
Kimi K2.5 is a 1-trillion-parameter Mixture of Experts (MoE) language model from Moonshot AI. Despite its massive total parameter count, only approximately 32 billion parameters are active for any given request — which is how Moonshot keeps inference costs competitive while delivering frontier-class performance.
Key specifications:
- Architecture: Mixture of Experts (MoE), 1T total / ~32B active
- Context window: 256K tokens
- SWE-Bench Verified: 76.8% (top open-weight coding score)
- Tool calling: Yes — supports multi-step tool calls and agentic workflows
- Open weights: Available on Hugging Face (moonshotai/Kimi-K2.5)
- API access: Available via Moonshot's platform (platform.moonshot.ai) and compatible inference providers
The model supports Interleaved Thinking and multi-step tool calling — the same design used in the K2 Thinking variant. This makes it well-suited for agent pipelines that require sequential reasoning across multiple tool calls.
Kimi K2.5 Pricing
Moonshot AI's official API pricing for Kimi K2.5:
- Input tokens: $0.60 per million
- Output tokens: $2.50–$3.00 per million
For context, GPT-4o runs $2.50/M input and $10/M output. Claude 3.5 Sonnet is $3.00/M input and $15/M output. At $0.60/M input, Kimi K2.5 delivers frontier-level coding performance at roughly 4–5x lower input cost than comparable closed models.
Kimi K2.5 API Integration
Kimi K2.5 uses an OpenAI-compatible API format, which means any code you've written for GPT-4 or Claude can be adapted with minimal changes. The base URL is https://api.moonshot.ai/v1 and the model ID is moonshot-v1-kimi-k2.
Basic API Call (Python)
import openai
client = openai.OpenAI(
api_key="your_moonshot_api_key",
base_url="https://api.moonshot.ai/v1"
)
response = client.chat.completions.create(
model="moonshot-v1-kimi-k2",
messages=[
{
"role": "user",
"content": "Write a Python function that parses a JSON API response and handles common error cases."
}
],
max_tokens=2000,
temperature=0.1
)
print(response.choices[0].message.content)
Tool Calling with Kimi K2.5
Kimi K2.5 supports multi-step tool calling natively, which makes it effective for agent workflows. Here's an example of a function-calling setup:
import openai
import json
client = openai.OpenAI(
api_key="your_moonshot_api_key",
base_url="https://api.moonshot.ai/v1"
)
tools = [
{
"type": "function",
"function": {
"name": "search_codebase",
"description": "Search for a pattern in the project codebase",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query or pattern to find"
},
"file_type": {
"type": "string",
"description": "Optional: filter by file extension (e.g., .py, .js)"
}
},
"required": ["query"]
}
}
}
]
response = client.chat.completions.create(
model="moonshot-v1-kimi-k2",
messages=[
{
"role": "user",
"content": "Find all places in the codebase where we're making HTTP requests without error handling."
}
],
tools=tools,
tool_choice="auto"
)
# Handle tool call response
message = response.choices[0].message
if message.tool_calls:
for call in message.tool_calls:
print(f"Tool: {call.function.name}")
print(f"Args: {call.function.arguments}")
Long Context Usage (256K Tokens)
The 256K context window is particularly useful for large codebase analysis. You can feed entire repositories or lengthy documentation in a single request:
import openai
client = openai.OpenAI(
api_key="your_moonshot_api_key",
base_url="https://api.moonshot.ai/v1"
)
# Load a large codebase file
with open("large_codebase.py", "r") as f:
codebase_content = f.read()
response = client.chat.completions.create(
model="moonshot-v1-kimi-k2",
messages=[
{
"role": "system",
"content": "You are a senior code reviewer. Analyze the provided code and identify security vulnerabilities, performance issues, and architectural problems."
},
{
"role": "user",
"content": f"Review this codebase:\n\n{codebase_content}"
}
],
max_tokens=4000
)
print(response.choices[0].message.content)
Kimi Code CLI
Moonshot AI ships a dedicated CLI for Kimi K2.5 called Kimi Code, designed for terminal-based coding workflows. It's comparable to Claude Code or Cursor's terminal mode but powered by K2.5's agentic capabilities.
# Install Kimi Code CLI
npm install -g @moonshot/kimi-code
# Set API key
export MOONSHOT_API_KEY=your_api_key
# Start a coding session
kimi-code --project /path/to/your/project
# Single-shot code generation
kimi-code run "Add input validation to all API endpoints in routes/api.js"
SWE-Bench Performance: What 76.8% Actually Means
SWE-Bench Verified measures a model's ability to resolve real GitHub issues from open-source projects. A "pass" means the model made code changes that caused the project's test suite to pass. The benchmark is harder than it sounds — you're fixing actual bugs in real codebases, not generating synthetic examples.
At 76.8%, Kimi K2.5:
- Outperforms Claude 3.7 Sonnet (70.3%) on this benchmark
- Surpasses GPT-4o (38.8%) and GPT-4.5 (38.0%)
- Competes with closed frontier models while remaining open-weight
The practical implication: for automated code fixing, pull request generation, and bug triage workflows, K2.5 delivers better results than most closed models at a fraction of the cost.
Running Kimi K2.5 Locally (Self-Hosted)
As an open-weight model, Kimi K2.5 can be self-hosted via Hugging Face. The full 1T parameter model requires significant GPU memory (FP8 quantization recommended for deployment):
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "moonshotai/Kimi-K2.5"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{"role": "user", "content": "Explain the difference between async/await and Promises in JavaScript."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For production self-hosting, NVIDIA recommends running K2.5 on Hopper architecture (H100/H200). Blackwell support is available but requires a separate deployment configuration.
When to Use Kimi K2.5 vs Other LLMs
Choose Kimi K2.5 when:
- Cost efficiency matters: Long coding tasks that generate large output volumes benefit most from the lower input cost
- Agentic workflows: Multi-step tool calling across code analysis, search, and modification tasks
- Large codebase analysis: The 256K context handles entire repositories in a single pass
- Open-weight requirement: Data privacy constraints or on-premise deployment needs
Consider alternatives when:
- Output quality on non-coding tasks is the priority (multimodal or creative writing)
- You need a model already integrated into a specific platform or IDE
- You're using Anthropic's tool ecosystem where Claude's native integrations matter
Accessing Multiple LLMs Including Kimi K2.5 via a Unified API
If you're running LLM-heavy applications that need to switch between models — or A/B test Kimi K2.5 against other frontier models — a unified API layer removes the integration overhead of managing multiple provider SDKs.
ModelsLab's API platform gives developers access to 200+ AI models across image generation, video, audio, and LLM endpoints from a single API key. Whether you're comparing open-weight LLMs like K2.5 and Qwen3.5 against hosted options, or building pipelines that combine LLM reasoning with image/video generation, a single integration point reduces maintenance burden significantly.
The ModelsLab API uses OpenAI-compatible endpoints, so switching models in your existing code is a one-line change.
Summary
Kimi K2.5 is the most cost-effective frontier-class coding LLM available today. At $0.60/M input tokens with 76.8% SWE-Bench Verified performance, it's the practical choice for any developer building automated code review, bug-fixing, or agentic coding pipelines who doesn't want to pay GPT-4 pricing for GPT-4-level results.
The open-weight release means self-hosting is an option for teams with data sovereignty requirements. The OpenAI-compatible API format means integration into existing infrastructure takes minutes, not days.
If you haven't benchmarked it against your current LLM provider, the pricing alone makes it worth testing.
