Hot-Swap Local LLMs Instantly: llama-swap Setup Guide (2026)

Running multiple large language models on your own hardware is increasingly practical, but managing them is another story. You might want a coding model for development, a general-purpose chat model for Q&A, and a small fast model for embeddings -- all accessible through one endpoint without manually stopping and starting servers. That is exactly the problem llama-swap solves.

llama-swap is an open-source proxy server written in Go that sits in front of your local inference servers (llama.cpp, vLLM, TabbyAPI, and others) and hot-swaps between models on demand. When a request arrives, llama-swap reads the model field, starts the right backend if it is not already running, and routes the request. No restarts, no manual intervention.

With over 3,000 stars on GitHub and an active release cadence (v201 shipped April 2026), it has become a go-to tool for developers running local LLMs in production and development environments alike.

Why llama-swap?

Before diving into setup, it is worth understanding where llama-swap fits relative to other tools.

Feature	llama-swap	Ollama	LM Studio
Hot-swap between models	Yes, automatic	Limited	No
Backend agnostic	Yes (llama.cpp, vLLM, etc.)	Own runner only	Own runner only
OpenAI API compatible	Full support	Yes	Yes
Anthropic API compatible	Yes	No	No
Run multiple models simultaneously	Yes (groups)	Partial	No
GUI included	Web UI	No (CLI only)	Full desktop app
Dependencies	Zero (single Go binary)	Single binary	Desktop app
Custom inference flags	Full control	Limited	Limited
Docker/Podman support	Native	Via wrapper	No

llama-swap is the right choice when you need fine-grained control over how each model is served, want to mix backends (for example, llama.cpp for small models and vLLM for large ones), or need a headless server that runs on boot.

Prerequisites

Before installing llama-swap, make sure you have:

A machine with enough RAM or VRAM for your target models (8GB minimum for small models, 24GB+ for larger ones)
An inference server installed -- llama.cpp (llama-server) is the most common choice
One or more GGUF model files downloaded (from Hugging Face or similar sources)
Basic familiarity with the command line

Installation

llama-swap offers several installation methods. Choose the one that fits your platform.

Homebrew (macOS and Linux)

bash

brew tap mostlygeek/llama-swap
brew install llama-swap

WinGet (Windows)

bash

winget install llama-swap

Docker

Docker images are published for multiple platforms including CUDA, Vulkan, Intel, and CPU-only variants. These images bundle both llama-swap and llama-server.

bash

# Pull the image for your platform
docker pull ghcr.io/mostlygeek/llama-swap:cuda    # NVIDIA GPU
docker pull ghcr.io/mostlygeek/llama-swap:vulkan  # AMD/Intel GPU
docker pull ghcr.io/mostlygeek/llama-swap:cpu     # CPU only

Pre-built Binaries

Download binaries for Linux, macOS, Windows, and FreeBSD from the GitHub releases page.

Building from Source

Requires Go and Node.js (for the web UI).

bash

git clone https://github.com/mostlygeek/llama-swap.git
cd llama-swap
make clean all
# Binary will be in build/

Your First Configuration

llama-swap uses a single YAML configuration file. Here is the minimum viable setup:

yaml

models:
  qwen3-8b:
    cmd: llama-server --port ${PORT} --model /models/qwen3-8b-q4_k_m.gguf

That is it. Three lines. The ${PORT} macro is automatically assigned by llama-swap so there are no port conflicts.

Save this as config.yaml and start llama-swap:

bash

llama-swap --config config.yaml --listen localhost:8080

Now you can send requests to http://localhost:8080 using the standard OpenAI API format:

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Multi-Model Configuration

The real power of llama-swap emerges when you configure multiple models. Here is a practical configuration for a developer workstation:

yaml

healthCheckTimeout: 120
logLevel: info
,[object Object],
,[object Object],
,[object Object],

yaml

nomic-embed:
cmd: |
${llama-server-path} --port ${PORT}
--model ${models-dir}/nomic-embed-text-v1.5.Q8_0.gguf
--embedding
--ctx-size 2048
ttl: 300

Let us break down the key settings:

macros: Reusable snippets that reduce duplication across model definitions. Define paths, common flags, and context sizes once.
cmd: The full command to start the inference server. Supports multi-line YAML for readability. The ${PORT} macro is required for automatic port assignment.
ttl: Time-to-live in seconds. The model is automatically unloaded after this period of inactivity, freeing up RAM/VRAM for the next model.

Running Multiple Models Simultaneously with Groups

By default, llama-swap runs one model at a time -- loading a new model unloads the previous one. The groups feature changes this, allowing you to run several models concurrently.

This is useful when you need a chat model and an embedding model running at the same time, or when you want a coding assistant alongside a general-purpose model.

yaml

groups:
  parallel-small:
    swap: false
    members:
      - nomic-embed
      - qwen3-8b-chat

yaml

heavy-models:
swap: true
members:
- deepseek-coder-33b
- llama3-70b

With swap: false, all members of the parallel-small group can run at the same time without unloading each other. The heavy-models group uses swap: true, meaning only one of its members runs at a time.

Using Docker and vLLM Backends

llama-swap is not limited to llama.cpp. You can use any OpenAI-compatible server, including vLLM running in Docker containers:

yaml

models:
  qwen3-30b-vllm:
    cmdStop: docker stop vllm-qwen
    cmd: |
      docker run --init --rm --name vllm-qwen
        --runtime=nvidia --gpus '"device=0,1"'
        --shm-size=16g
        -v /models:/models
        -p ${PORT}:8000
        vllm/vllm-openai:v0.10.0
        --model /models/Qwen3-30B-AWQ
        --served-model-name qwen3-30b-vllm
        --tensor-parallel-size 2
        --gpu-memory-utilization 0.9
    ttl: 1800

The cmdStop field is important for containers -- it tells llama-swap how to gracefully stop the Docker container when swapping models.

Model Aliases

If you use tools or clients that expect specific model names (like gpt-4o-mini), you can create aliases:

yaml

models:
  qwen3-8b-chat:
    cmd: llama-server --port ${PORT} --model /models/qwen3-8b-q4_k_m.gguf
    aliases:
      - gpt-4o-mini
      - default-chat

Now requests to gpt-4o-mini are automatically routed to your local Qwen model.

Securing Your Instance with API Keys

For shared or remote setups, llama-swap supports API key authentication:

yaml

apiKeys:
  - "sk-your-secret-key-here"
  - "${env.LLAMA_SWAP_API_KEY}"

Clients then include the key as a Bearer token, just like the OpenAI API:

bash

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer sk-your-secret-key-here" \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen3-8b-chat", "messages": [{"role": "user", "content": "Hello"}]}'

Running with Docker (Full Example)

Here is a complete Docker deployment with a custom config and models directory:

bash

docker run -d --name llama-swap \
  --runtime nvidia \
  -p 8080:8080 \
  -v /path/to/models:/models \
  -v /path/to/config.yaml:/app/config.yaml \
  ghcr.io/mostlygeek/llama-swap:cuda

For configuration hot-reload (edit your config without restarting), mount the config directory and use the -watch-config flag:

bash

docker run -d --name llama-swap \
  --runtime nvidia \
  -p 8080:8080 \
  -v /path/to/models:/models \
  -v /path/to/config-dir:/config \
  ghcr.io/mostlygeek/llama-swap:cuda \
  -config /config/config.yaml -watch-config

The Web UI

llama-swap ships with a built-in web interface accessible at http://localhost:8080/ui. From here you can:

Test models with an interactive playground
View detailed token generation metrics
Inspect raw request and response payloads
Manually load and unload models
Stream logs in real time

This is invaluable for debugging prompt templates, comparing model outputs, and monitoring resource usage.

Integration with Developer Tools

Because llama-swap exposes a standard OpenAI-compatible API, it works out of the box with virtually any tool that supports the OpenAI API:

Continue.dev (VS Code / JetBrains): Point the OpenAI provider at http://localhost:8080
Open WebUI: Set the OpenAI base URL to your llama-swap endpoint
LangChain / LlamaIndex: Use the OpenAI client with a custom base_url
Any OpenAI SDK: Set base_url="http://localhost:8080/v1"

python

from openai import OpenAI
,[object Object],

python

response = client.chat.completions.create(
model="qwen3-8b-chat",
messages=[{"role": "user", "content": "Explain hot-swapping in 2 sentences."}]
)
print(response.choices[0].message.content)

Cloud Alternative: ModelsLab LLM API

Running local models gives you full control, privacy, and zero per-token costs after the hardware investment. But it also means managing hardware, keeping drivers updated, and handling capacity planning.

If you prefer not to manage local infrastructure, ModelsLab's LLM API provides access to 100,000+ AI models via a single API endpoint. It is a practical alternative when you need burst capacity beyond what your local GPU can handle, access to very large models (70B+ parameters) without the VRAM investment, production-grade uptime and scalability, or quick prototyping before committing to local deployment.

Many teams use both: local models via llama-swap for development and privacy-sensitive workloads, and ModelsLab for production traffic and large-scale inference.

Troubleshooting Common Issues

Model fails to load: Check the healthCheckTimeout setting. Large models can take over 60 seconds to load, especially when offloading layers to GPU. Increase the timeout in your config:

yaml

healthCheckTimeout: 300

Port conflicts: If you run other services on ports 5800+, change the starting port:

yaml

startPort: 10001

Out of memory when swapping: Lower the ttl values so models unload sooner, or use groups to ensure only compatible models run simultaneously.

Docker container does not stop: Always define cmdStop for container-based models so llama-swap can gracefully shut them down.

FAQ

How many models can llama-swap manage?

There is no hard limit on the number of model definitions in your config. The constraint is your hardware -- specifically available RAM and VRAM. With TTL-based unloading, you can define dozens of models and let llama-swap load them on demand. Only active models consume resources.

Does llama-swap work with Apple Silicon Macs?

Yes. Install via Homebrew and use llama.cpp compiled with Metal support. llama-swap itself is a Go binary that runs natively on ARM64 macOS. The Metal backend in llama.cpp handles GPU acceleration on Apple Silicon.

Can I use llama-swap with vLLM or other non-llama.cpp servers?

Absolutely. llama-swap works with any server that exposes an OpenAI-compatible API. This includes vLLM, TabbyAPI, stable-diffusion.cpp, and others. Just point the cmd field to the appropriate startup command.

Is llama-swap suitable for production use?

llama-swap is used in production by many teams, but it is primarily designed for single-server deployments. For multi-node, high-availability setups, consider pairing it with a load balancer or using a managed API service like ModelsLab for the production tier.

How does hot-swapping affect latency?

The first request to a new model incurs a cold-start delay while the inference server loads the model into memory (typically 5-30 seconds depending on model size and storage speed). Subsequent requests are routed directly to the already-running server with no additional overhead. Use the ttl setting to keep frequently-used models loaded longer.

Conclusion

llama-swap brings a missing layer of orchestration to local LLM workflows. With a single binary and a YAML config file, you get on-demand model loading, automatic swapping, multi-backend support, and a clean OpenAI-compatible API for all your tools. Whether you are switching between a coding model and a chat model throughout your workday or running a multi-model pipeline, llama-swap makes it seamless.

Get started at github.com/mostlygeek/llama-swap -- you can go from zero to serving your first model in under five minutes.

Hot-Swap Local LLMs Instantly: llama-swap Setup Guide (2026)

Why llama-swap?

Prerequisites

Installation

Homebrew (macOS and Linux)

WinGet (Windows)

Docker

Pre-built Binaries

Building from Source

Your First Configuration

Multi-Model Configuration

Running Multiple Models Simultaneously with Groups

Using Docker and vLLM Backends

Model Aliases

Securing Your Instance with API Keys

Running with Docker (Full Example)

The Web UI

Integration with Developer Tools

Cloud Alternative: ModelsLab LLM API

Troubleshooting Common Issues

FAQ

How many models can llama-swap manage?

Does llama-swap work with Apple Silicon Macs?

Can I use llama-swap with vLLM or other non-llama.cpp servers?

Is llama-swap suitable for production use?

How does hot-swapping affect latency?

Conclusion

Explore Plugins for Pro

Build Apps with
ModelsLab
ML
API

Hot-Swap Local LLMs Instantly: llama-swap Setup Guide (2026)

Why llama-swap?

Prerequisites

Installation

Homebrew (macOS and Linux)

WinGet (Windows)

Docker

Pre-built Binaries

Building from Source

Your First Configuration

Multi-Model Configuration

Running Multiple Models Simultaneously with Groups

Using Docker and vLLM Backends

Model Aliases

Securing Your Instance with API Keys

Running with Docker (Full Example)

The Web UI

Integration with Developer Tools

Cloud Alternative: ModelsLab LLM API

Troubleshooting Common Issues

FAQ

How many models can llama-swap manage?

Does llama-swap work with Apple Silicon Macs?

Can I use llama-swap with vLLM or other non-llama.cpp servers?

Is llama-swap suitable for production use?

How does hot-swapping affect latency?

Conclusion

Explore Plugins for Pro

Build Apps with ModelsLabML API

Build Apps with
ModelsLab
ML
API