LLM Serving: vLLM vs TGI vs Ollama in Production (2026)
Learning Objectives
- Understand the bottlenecks in LLM inference
- Use vLLM for high-throughput production serving
- Deploy local models with Ollama
- Apply quantization to reduce memory requirements
- Benchmark and optimize inference performance
LLM Inference Fundamentals
The Two Phases of Inference
Prefill (prompt processing): Process all input tokens in parallel. GPU-bound. Fast.
Decode (token generation): Generate one token at a time. Memory bandwidth-bound. Slow.
The decode phase is the primary bottleneck. Most optimization techniques target it.
KV Cache
During decoding, the model recomputes attention keys and values for every previous token on each step — unless you cache them.
KV cache stores (Key, Value) pairs from previous decoding steps. This reduces redundant computation from O(n²) to O(n) per step.
The KV cache grows with sequence length and batch size. A 70B model with 4096-token context uses ~80GB of KV cache memory at full batch.
Memory Layout
GPU Memory = Model Weights + KV Cache + Activations
~ 70GB + ~20GB + ~5GB (70B model, BF16)Running out of KV cache space is the primary cause of OOM errors during inference.
vLLM: High-Throughput Serving
vLLM introduces PagedAttention — manages KV cache in fixed-size blocks (like OS memory pages), enabling efficient sharing and dramatically higher throughput.
pip install vllmStart a vLLM Server
# Serve Llama 3.1 8B with OpenAI-compatible API
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--tensor-parallel-size 1 # use 2+ for multi-GPU
# With quantization (half the memory)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--quantization awq \
--dtype halfCall the vLLM Server
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain attention mechanisms."}],
max_tokens=512,
temperature=0.7,
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "List 5 Python tips."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)vLLM Async Python Client
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
import asyncio
engine_args = AsyncEngineArgs(
model="meta-llama/Llama-3.1-8B-Instruct",
max_model_len=4096,
gpu_memory_utilization=0.90,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
async def generate(prompt: str) -> str:
sampling = SamplingParams(temperature=0.7, max_tokens=256)
results = engine.generate(prompt, sampling, request_id="req1")
full_output = ""
async for output in results:
full_output = output.outputs[0].text
return full_output
asyncio.run(generate("What is PagedAttention?"))Ollama: Zero-Config Local Serving
Ollama is the easiest way to run open-source LLMs locally. Perfect for development.
# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run
ollama pull llama3.2 # 2GB download
ollama run llama3.2 # interactive chat
# Run as server (starts automatically on install)
# Server at http://localhost:11434Custom Modelfile
Customize system prompts and parameters:
# Create file: Modelfile
FROM llama3.2
SYSTEM """You are a concise coding assistant. Always include working code examples."""
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096ollama create coding-assistant -f Modelfile
ollama run coding-assistantOllama Python API
import requests
def ollama_chat(prompt: str, model: str = "llama3.2") -> str:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False,
},
)
return response.json()["response"]
# Or use the OpenAI-compatible endpoint
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Explain RLHF in 100 words."}],
)Quantization
Quantization reduces model precision from 32-bit floats to 8-bit or 4-bit integers. Roughly halves memory per bit reduction with modest quality loss.
GGUF Quantization (llama.cpp)
from llama_cpp import Llama
# Q4_K_M: good balance of quality and size
llm = Llama(
model_path="./models/Llama-3.1-8B-Instruct-Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=-1, # offload all layers to GPU
n_threads=8,
)
output = llm.create_chat_completion(
messages=[{"role": "user", "content": "What is quantization?"}],
max_tokens=256,
)
print(output["choices"][0]["message"]["content"])AWQ Quantization (for vLLM)
pip install autoawq
python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')
model.quantize(tokenizer, quant_config={'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'})
model.save_quantized('./Llama-3.1-8B-Instruct-AWQ')
"Memory Savings by Quantization Format
| Format | Bits | 8B Model Size | Quality Loss |
|---|---|---|---|
| BF16 | 16 | 16 GB | None (baseline) |
| INT8 | 8 | 8 GB | Minimal |
| Q4_K_M | ~4.5 | 5 GB | Low |
| Q3_K_M | ~3.5 | 4 GB | Moderate |
| Q2_K | ~2.5 | 3 GB | Noticeable |
Benchmarking Inference
import time
import statistics
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")
def benchmark_latency(
prompt: str,
model: str,
n_runs: int = 20,
max_tokens: int = 100,
) -> dict:
latencies = []
tokens_per_sec = []
for _ in range(n_runs):
start = time.perf_counter()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
)
elapsed = time.perf_counter() - start
total_tokens = response.usage.completion_tokens
latencies.append(elapsed)
tokens_per_sec.append(total_tokens / elapsed)
return {
"p50_latency_ms": round(statistics.median(latencies) * 1000, 1),
"p95_latency_ms": round(sorted(latencies)[int(0.95 * n_runs)] * 1000, 1),
"avg_tokens_per_sec": round(statistics.mean(tokens_per_sec), 1),
"n_runs": n_runs,
}
results = benchmark_latency(
prompt="Explain the transformer architecture in 3 sentences.",
model="meta-llama/Llama-3.1-8B-Instruct",
n_runs=20,
)
print(results)
# {'p50_latency_ms': 1420.3, 'p95_latency_ms': 1680.1, 'avg_tokens_per_sec': 67.2, 'n_runs': 20}Performance Optimization Tips
| Optimization | Impact | When to Use |
|---|---|---|
| Quantization (INT4) | 2× memory reduction | Always, unless quality is critical |
| Continuous batching | 10–50× throughput | High-concurrency serving (vLLM does this automatically) |
| Prompt caching | 5–10× latency reduction | Long, repeated system prompts |
| Speculative decoding | 2–3× latency reduction | High-volume serving with a draft model |
| Tensor parallelism | Linear GPU scaling | Multi-GPU setups |
| Flash Attention 2 | 2–4× memory reduction | All transformers (automatic in vLLM) |
Troubleshooting
Out of memory (CUDA OOM)
- Reduce
--max-model-len(shorter context window) - Lower
--gpu-memory-utilization(default 0.90 → try 0.80) - Use quantization (AWQ or GGUF Q4)
- Use tensor parallelism across multiple GPUs
Slow throughput (< 20 tokens/sec on 8B model)
- Enable Flash Attention (automatic in recent vLLM)
- Increase batch size for offline processing
- Check GPU utilization — should be > 80%
Responses are truncated
- Increase
--max-model-len(but this increases memory) - Increase
max_tokensin your API call
Key Takeaways
- LLM inference has two phases: prefill (parallel token processing, GPU-bound) and decode (sequential token generation, memory-bandwidth-bound)
- The KV cache stores attention keys and values to avoid recomputation — it is the main source of OOM errors during inference
- vLLM uses PagedAttention to manage KV cache efficiently, enabling 10–50× higher throughput than naive serving through continuous batching
- Ollama is the fastest path to local development — zero configuration, OpenAI-compatible API, supports macOS Metal and NVIDIA CUDA
- For CPU-only or local development inference, llama.cpp and Ollama are the right tools; for high-concurrency GPU production, vLLM is the standard
- Quantization (INT4/AWQ) cuts memory requirements by 2–4× with minimal quality loss — always quantize for production serving
- Continuous batching (automatic in vLLM) processes multiple requests simultaneously, dramatically improving GPU utilization and throughput
- Speculative decoding and prompt caching can further reduce latency by 2–10× for applications with repeated system prompts or high volume
FAQ
vLLM versus llama.cpp — which should I use? Use vLLM for production serving with high concurrency on GPU — it handles batching, KV cache management, and scaling automatically. Use llama.cpp (or Ollama) for local development, CPU inference, or single-user scenarios where a full server is unnecessary.
What GPU do I need for 8B models? At INT4 quantization: an 8B model (~5GB) fits on an RTX 3080 (10GB) or any card with 8GB+ VRAM. For full BF16 precision (16GB): the RTX 4090 (24GB) is the minimum practical single-GPU option.
How do I reduce CUDA out-of-memory errors in vLLM?
Reduce --max-model-len to shorten the maximum sequence length (the biggest KV cache consumer). Lower --gpu-memory-utilization from 0.90 to 0.80. Use quantization (AWQ or GGUF Q4) to reduce model weight memory. Enable tensor parallelism across multiple GPUs if available.
What is PagedAttention and why does it matter? PagedAttention manages the KV cache in fixed-size blocks (like OS virtual memory pages), allowing efficient memory sharing across requests and eliminating memory fragmentation. This enables much higher concurrency than pre-allocating a large contiguous KV cache block per request.
How do I serve a model with streaming output?
Both vLLM and Ollama support streaming via the OpenAI-compatible API. Set stream=True in your API call and iterate over the response chunks. Each chunk contains a delta with the next token(s). The code examples above show both streaming and non-streaming patterns.
What is the throughput difference between vLLM and Ollama? For single-user interactive use, the difference is small. For concurrent requests (10+ simultaneous users), vLLM can be 10–50× more efficient due to continuous batching and PagedAttention. Ollama is not designed for multi-user production serving.
When should I use a custom Modelfile in Ollama? Use a Modelfile when you want to embed a specific system prompt, set default temperature and context length, or create a named local model from a custom GGUF file. This lets you version your model configuration alongside your application code.