Running multiple large language models on your own hardware is increasingly practical, but managing them is another story. You might want a coding model for development, a general-purpose chat model for Q&A, and a small fast model for embeddings -- all accessible through one endpoint without manually stopping and starting servers. That is exactly the problem llama-swap solves.
llama-swap is an open-source proxy server written in Go that sits in front of your local inference servers (llama.cpp, vLLM, TabbyAPI, and others) and hot-swaps between models on demand. When a request arrives, llama-swap reads the model field, starts the right backend if it is not already running, and routes the request. No restarts, no manual intervention.
With over 3,000 stars on GitHub and an active release cadence (v201 shipped April 2026), it has become a go-to tool for developers running local LLMs in production and development environments alike.
Why llama-swap?
Before diving into setup, it is worth understanding where llama-swap fits relative to other tools.
| Feature | llama-swap | Ollama | LM Studio |
|---|---|---|---|
| Hot-swap between models | Yes, automatic | Limited | No |
| Backend agnostic | Yes (llama.cpp, vLLM, etc.) | Own runner only | Own runner only |
| OpenAI API compatible | Full support | Yes | Yes |
| Anthropic API compatible | Yes | No | No |
| Run multiple models simultaneously | Yes (groups) | Partial | No |
| GUI included | Web UI | No (CLI only) | Full desktop app |
| Dependencies | Zero (single Go binary) | Single binary | Desktop app |
| Custom inference flags | Full control | Limited | Limited |
| Docker/Podman support | Native | Via wrapper | No |
llama-swap is the right choice when you need fine-grained control over how each model is served, want to mix backends (for example, llama.cpp for small models and vLLM for large ones), or need a headless server that runs on boot.
Prerequisites
Before installing llama-swap, make sure you have:
- A machine with enough RAM or VRAM for your target models (8GB minimum for small models, 24GB+ for larger ones)
- An inference server installed -- llama.cpp (
llama-server) is the most common choice - One or more GGUF model files downloaded (from Hugging Face or similar sources)
- Basic familiarity with the command line
Installation
llama-swap offers several installation methods. Choose the one that fits your platform.
Homebrew (macOS and Linux)
brew tap mostlygeek/llama-swapbrew install llama-swap
WinGet (Windows)
winget install llama-swap
Docker
Docker images are published for multiple platforms including CUDA, Vulkan, Intel, and CPU-only variants. These images bundle both llama-swap and llama-server.
# Pull the image for your platformdocker pull ghcr.io/mostlygeek/llama-swap:cuda # NVIDIA GPUdocker pull ghcr.io/mostlygeek/llama-swap:vulkan # AMD/Intel GPUdocker pull ghcr.io/mostlygeek/llama-swap:cpu # CPU only
Pre-built Binaries
Download binaries for Linux, macOS, Windows, and FreeBSD from the GitHub releases page.
Building from Source
Requires Go and Node.js (for the web UI).
git clone https://github.com/mostlygeek/llama-swap.gitcd llama-swapmake clean all# Binary will be in build/
Your First Configuration
llama-swap uses a single YAML configuration file. Here is the minimum viable setup:
models:qwen3-8b:cmd: llama-server --port ${PORT} --model /models/qwen3-8b-q4_k_m.gguf
That is it. Three lines. The ${PORT} macro is automatically assigned by llama-swap so there are no port conflicts.
Save this as config.yaml and start llama-swap:
llama-swap --config config.yaml --listen localhost:8080
Now you can send requests to http://localhost:8080 using the standard OpenAI API format:
curl http://localhost:8080/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "qwen3-8b","messages": [{"role": "user", "content": "Hello!"}]}'
Multi-Model Configuration
The real power of llama-swap emerges when you configure multiple models. Here is a practical configuration for a developer workstation:
healthCheckTimeout: 120logLevel: info,[object Object],,[object Object],,[object Object],
nomic-embed:cmd: |${llama-server-path} --port ${PORT}--model ${models-dir}/nomic-embed-text-v1.5.Q8_0.gguf--embedding--ctx-size 2048ttl: 300
Let us break down the key settings:
- macros: Reusable snippets that reduce duplication across model definitions. Define paths, common flags, and context sizes once.
- cmd: The full command to start the inference server. Supports multi-line YAML for readability. The
${PORT}macro is required for automatic port assignment. - ttl: Time-to-live in seconds. The model is automatically unloaded after this period of inactivity, freeing up RAM/VRAM for the next model.
Running Multiple Models Simultaneously with Groups
By default, llama-swap runs one model at a time -- loading a new model unloads the previous one. The groups feature changes this, allowing you to run several models concurrently.
This is useful when you need a chat model and an embedding model running at the same time, or when you want a coding assistant alongside a general-purpose model.
groups:parallel-small:swap: falsemembers:- nomic-embed- qwen3-8b-chat
heavy-models:swap: truemembers:- deepseek-coder-33b- llama3-70b
With swap: false, all members of the parallel-small group can run at the same time without unloading each other. The heavy-models group uses swap: true, meaning only one of its members runs at a time.
Using Docker and vLLM Backends
llama-swap is not limited to llama.cpp. You can use any OpenAI-compatible server, including vLLM running in Docker containers:
models:qwen3-30b-vllm:cmdStop: docker stop vllm-qwencmd: |docker run --init --rm --name vllm-qwen--runtime=nvidia --gpus '"device=0,1"'--shm-size=16g-v /models:/models-p ${PORT}:8000vllm/vllm-openai:v0.10.0--model /models/Qwen3-30B-AWQ--served-model-name qwen3-30b-vllm--tensor-parallel-size 2--gpu-memory-utilization 0.9ttl: 1800
The cmdStop field is important for containers -- it tells llama-swap how to gracefully stop the Docker container when swapping models.
Model Aliases
If you use tools or clients that expect specific model names (like gpt-4o-mini), you can create aliases:
models:qwen3-8b-chat:cmd: llama-server --port ${PORT} --model /models/qwen3-8b-q4_k_m.ggufaliases:- gpt-4o-mini- default-chat
Now requests to gpt-4o-mini are automatically routed to your local Qwen model.
Securing Your Instance with API Keys
For shared or remote setups, llama-swap supports API key authentication:
apiKeys:- "sk-your-secret-key-here"- "${env.LLAMA_SWAP_API_KEY}"
Clients then include the key as a Bearer token, just like the OpenAI API:
curl http://localhost:8080/v1/chat/completions \-H "Authorization: Bearer sk-your-secret-key-here" \-H "Content-Type: application/json" \-d '{"model": "qwen3-8b-chat", "messages": [{"role": "user", "content": "Hello"}]}'
Running with Docker (Full Example)
Here is a complete Docker deployment with a custom config and models directory:
docker run -d --name llama-swap \--runtime nvidia \-p 8080:8080 \-v /path/to/models:/models \-v /path/to/config.yaml:/app/config.yaml \ghcr.io/mostlygeek/llama-swap:cuda
For configuration hot-reload (edit your config without restarting), mount the config directory and use the -watch-config flag:
docker run -d --name llama-swap \--runtime nvidia \-p 8080:8080 \-v /path/to/models:/models \-v /path/to/config-dir:/config \ghcr.io/mostlygeek/llama-swap:cuda \-config /config/config.yaml -watch-config
The Web UI
llama-swap ships with a built-in web interface accessible at http://localhost:8080/ui. From here you can:
- Test models with an interactive playground
- View detailed token generation metrics
- Inspect raw request and response payloads
- Manually load and unload models
- Stream logs in real time
This is invaluable for debugging prompt templates, comparing model outputs, and monitoring resource usage.
Integration with Developer Tools
Because llama-swap exposes a standard OpenAI-compatible API, it works out of the box with virtually any tool that supports the OpenAI API:
- Continue.dev (VS Code / JetBrains): Point the OpenAI provider at
http://localhost:8080 - Open WebUI: Set the OpenAI base URL to your llama-swap endpoint
- LangChain / LlamaIndex: Use the OpenAI client with a custom
base_url - Any OpenAI SDK: Set
base_url="http://localhost:8080/v1"
from openai import OpenAI,[object Object],
response = client.chat.completions.create(model="qwen3-8b-chat",messages=[{"role": "user", "content": "Explain hot-swapping in 2 sentences."}])print(response.choices[0].message.content)
Cloud Alternative: ModelsLab LLM API
Running local models gives you full control, privacy, and zero per-token costs after the hardware investment. But it also means managing hardware, keeping drivers updated, and handling capacity planning.
If you prefer not to manage local infrastructure, ModelsLab's LLM API provides access to 100,000+ AI models via a single API endpoint. It is a practical alternative when you need burst capacity beyond what your local GPU can handle, access to very large models (70B+ parameters) without the VRAM investment, production-grade uptime and scalability, or quick prototyping before committing to local deployment.
Many teams use both: local models via llama-swap for development and privacy-sensitive workloads, and ModelsLab for production traffic and large-scale inference.
Troubleshooting Common Issues
Model fails to load: Check the healthCheckTimeout setting. Large models can take over 60 seconds to load, especially when offloading layers to GPU. Increase the timeout in your config:
healthCheckTimeout: 300
Port conflicts: If you run other services on ports 5800+, change the starting port:
startPort: 10001
Out of memory when swapping: Lower the ttl values so models unload sooner, or use groups to ensure only compatible models run simultaneously.
Docker container does not stop: Always define cmdStop for container-based models so llama-swap can gracefully shut them down.
FAQ
How many models can llama-swap manage?
There is no hard limit on the number of model definitions in your config. The constraint is your hardware -- specifically available RAM and VRAM. With TTL-based unloading, you can define dozens of models and let llama-swap load them on demand. Only active models consume resources.
Does llama-swap work with Apple Silicon Macs?
Yes. Install via Homebrew and use llama.cpp compiled with Metal support. llama-swap itself is a Go binary that runs natively on ARM64 macOS. The Metal backend in llama.cpp handles GPU acceleration on Apple Silicon.
Can I use llama-swap with vLLM or other non-llama.cpp servers?
Absolutely. llama-swap works with any server that exposes an OpenAI-compatible API. This includes vLLM, TabbyAPI, stable-diffusion.cpp, and others. Just point the cmd field to the appropriate startup command.
Is llama-swap suitable for production use?
llama-swap is used in production by many teams, but it is primarily designed for single-server deployments. For multi-node, high-availability setups, consider pairing it with a load balancer or using a managed API service like ModelsLab for the production tier.
How does hot-swapping affect latency?
The first request to a new model incurs a cold-start delay while the inference server loads the model into memory (typically 5-30 seconds depending on model size and storage speed). Subsequent requests are routed directly to the already-running server with no additional overhead. Use the ttl setting to keep frequently-used models loaded longer.
Conclusion
llama-swap brings a missing layer of orchestration to local LLM workflows. With a single binary and a YAML config file, you get on-demand model loading, automatic swapping, multi-backend support, and a clean OpenAI-compatible API for all your tools. Whether you are switching between a coding model and a chat model throughout your workday or running a multi-model pipeline, llama-swap makes it seamless.
Get started at github.com/mostlygeek/llama-swap -- you can go from zero to serving your first model in under five minutes.
