Skip to main content
Available now on ModelsLab · Video Generation

Omnihuman-1.5Avatars Speak Your Words

Build Expressive Videos Fast

Audio Sync

Semantic Expression Matching

Characters match audio rhythm, prosody, and semantics with natural gestures.

Full Control

Unrestricted Motion Camera

Text prompts guide camera moves, actions, and multi-character scenes.

One Call

Portrait to Video

Omnihuman-1.5 API turns single image plus audio into 1080p avatar video.

Examples

See what Omnihuman-1.5 can create

Copy any prompt below and try it yourself in the playground.

Cityscape Talk

Professional in urban office, discussing quarterly results with confident gestures, dynamic camera pan from medium shot to close-up, natural lighting

Tech Demo

Engineer at whiteboard explaining AI architecture, enthusiastic expressions, hand waves syncing to audio, steady tracking shot

Product Pitch

Designer presenting sleek gadget prototype, excited tone with product close-ups, smooth camera zoom, modern studio background

Nature Guide

Explorer in forest trail narrating wildlife facts, calm gestures matching audio, wide establishing shot to medium, golden hour light

For Developers

A few lines of code.
Video avatar. One endpoint.

ModelsLab handles the infrastructure: fast inference, auto-scaling, and a developer-friendly API. No GPU management needed.

  • Serverless: scales to zero, scales to millions
  • Pay per second, no minimums
  • Python and JavaScript SDKs, plus REST API
import requests
response = requests.post(
"https://modelslab.com/api/v7/video-fusion/image-to-video",
json={
"key": "YOUR_API_KEY",
"prompt": "The camera zoomed in. The woman spoke to the camera, and after finishing, she quickly turned around and ran backward.",
"init_audio": "https://assets.modelslab.ai/generations/7e1221ae-c5a9-4b1a-96cb-3448cc73c6e3.m4a",
"init_image": "https://assets.modelslab.ai/generations/8931fb55-905f-4ae8-8924-1b4e583ff789.png"
}
)
print(response.json())

FAQ

Common questions about Omnihuman-1.5

Read the docs

Omnihuman-1.5 generates video from one image, audio, and optional text. It creates expressive animations with semantic audio sync. Supports humans, animals, multi-character scenes.

Send image_url, audio_url, and prompt to Omnihuman 1.5 endpoint. Get async video URL after processing. Use 720p for speed or 1080p for quality.

Omnihuman-1.5 excels in full-body motion and audio semantics over basic lip-sync tools. Check ModelsLab for similar video APIs. It leads in expressive control.

Max 60s at 720p, 30s at 1080p. Audio drives lip sync, expressions, gestures. Multiple languages including English, Chinese, Spanish.

Yes, specify speakers and background reactions via prompts. Generates coherent interactions with shared attention fusion. Ideal for dialogue scenes.

Supports webhooks, polling, batch variations. Use TTL for content management. Scales for customer avatars or content tools.

Ready to create?

Start generating with Omnihuman-1.5 on ModelsLab.