Happy Horse 1.0 is now on ModelsLab

Try Now
Skip to main content
Available now on ModelsLab · Language Model

Qwen3-VL-235B-A22B-Instruct-FP8Vision Meets Reasoning

Process Images. Reason Deeply.

Visual Agent

Navigate GUIs Autonomously

Recognizes GUI elements, understands functions, invokes tools for task completion.

Spatial Reasoning

Ground 2D and 3D

Judges object positions, viewpoints, occlusions with precise spatial perception.

Video Analysis

Handle Long Videos

Supports 262K context for hours-long videos with second-level indexing.

Examples

See what Qwen3-VL-235B-A22B-Instruct-FP8 can create

Copy any prompt below and try it yourself in the playground.

GUI Task

Analyze this screenshot of a web app. Identify the login button, describe its position relative to the header, and suggest how to click it using coordinates.

Spatial Query

Examine this architectural blueprint image. Determine the relative positions of rooms, detect any occlusions, and provide 3D grounding estimates.

Video Summary

Process this 5-minute product demo video. Index key events by second, describe spatial changes in objects, and generate a timeline summary.

Document OCR

Extract all text from this scanned technical diagram. Align text with visual elements, reason about diagram logic, and output structured JSON.

For Developers

A few lines of code.
Vision inference. One call.

ModelsLab handles the infrastructure: fast inference, auto-scaling, and a developer-friendly API. No GPU management needed.

  • Serverless: scales to zero, scales to millions
  • Pay per token, no minimums
  • Python and JavaScript SDKs, plus REST API
import requests
response = requests.post(
"https://modelslab.com/api/v7/llm/chat/completions",
json={
"key": "YOUR_API_KEY",
"prompt": "",
"model_id": ""
}
)
print(response.json())

FAQ

Common questions about Qwen3-VL-235B-A22B-Instruct-FP8

Read the docs

Qwen3-VL-235B-A22B-Instruct-FP8 is a 235B parameter MoE vision-language model with 22B active params in FP8 quantization. It excels in visual reasoning, agent tasks, and long-context video understanding. Context window reaches 262K tokens.

Offers high throughput via FP8 for cost-efficient inference. Providers like DeepInfra deliver 11+ tokens/second output speed. Supports vision input with 16K max output tokens.

Features DeepStack for fine-grained visual details and Interleaved-MRoPE for video reasoning. Handles GUI navigation, visual coding, and multimodal STEM tasks. Recognizes broad visual categories accurately.

Serves as strong Qwen3-VL-235B-A22B-Instruct-FP8 alternative for vision LLM needs with MoE efficiency. Competes on benchmarks against top models in coding and math. Open-weight for flexible deployment.

Native 262K token context, expandable for books and long videos. Enables full recall and precise temporal indexing. Max output is 16K tokens per response.

Processes hours-long videos with second-level understanding via enhanced dynamics comprehension. Uses robust positional embeddings for long-horizon reasoning. Ideal for detailed video analysis tasks.

Ready to create?

Start generating with Qwen3-VL-235B-A22B-Instruct-FP8 on ModelsLab.