LLM Serving: vLLM vs TGI vs Ollama in Production (2026)

Learning Objectives

Understand the bottlenecks in LLM inference
Use vLLM for high-throughput production serving
Deploy local models with Ollama
Apply quantization to reduce memory requirements
Benchmark and optimize inference performance

LLM Inference Fundamentals

The Two Phases of Inference

Prefill (prompt processing): Process all input tokens in parallel. GPU-bound. Fast.

Decode (token generation): Generate one token at a time. Memory bandwidth-bound. Slow.

The decode phase is the primary bottleneck. Most optimization techniques target it.

KV Cache

During decoding, the model recomputes attention keys and values for every previous token on each step — unless you cache them.

KV cache stores (Key, Value) pairs from previous decoding steps. This reduces redundant computation from O(n²) to O(n) per step.

The KV cache grows with sequence length and batch size. A 70B model with 4096-token context uses ~80GB of KV cache memory at full batch.

Memory Layout

plaintext

GPU Memory = Model Weights + KV Cache + Activations
           ~  70GB         +  ~20GB   +  ~5GB     (70B model, BF16)

Running out of KV cache space is the primary cause of OOM errors during inference.

vLLM: High-Throughput Serving

vLLM introduces PagedAttention — manages KV cache in fixed-size blocks (like OS memory pages), enabling efficient sharing and dramatically higher throughput.

Bash

pip install vllm

Start a vLLM Server

Bash

# Serve Llama 3.1 8B with OpenAI-compatible API
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --port 8000 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --tensor-parallel-size 1   # use 2+ for multi-GPU

# With quantization (half the memory)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --quantization awq \
    --dtype half

Call the vLLM Server

Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain attention mechanisms."}],
    max_tokens=512,
    temperature=0.7,
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "List 5 Python tips."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

vLLM Async Python Client

Python

from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
import asyncio

engine_args = AsyncEngineArgs(
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_model_len=4096,
    gpu_memory_utilization=0.90,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)

async def generate(prompt: str) -> str:
    sampling = SamplingParams(temperature=0.7, max_tokens=256)
    results = engine.generate(prompt, sampling, request_id="req1")

    full_output = ""
    async for output in results:
        full_output = output.outputs[0].text

    return full_output

asyncio.run(generate("What is PagedAttention?"))

Ollama: Zero-Config Local Serving

Ollama is the easiest way to run open-source LLMs locally. Perfect for development.

Bash

# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run
ollama pull llama3.2           # 2GB download
ollama run llama3.2            # interactive chat

# Run as server (starts automatically on install)
# Server at http://localhost:11434

Custom Modelfile

Customize system prompts and parameters:

modelfile

# Create file: Modelfile
FROM llama3.2

SYSTEM """You are a concise coding assistant. Always include working code examples."""

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

Bash

ollama create coding-assistant -f Modelfile
ollama run coding-assistant

Ollama Python API

Python

import requests

def ollama_chat(prompt: str, model: str = "llama3.2") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
        },
    )
    return response.json()["response"]


# Or use the OpenAI-compatible endpoint
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain RLHF in 100 words."}],
)

Quantization

Quantization reduces model precision from 32-bit floats to 8-bit or 4-bit integers. Roughly halves memory per bit reduction with modest quality loss.

GGUF Quantization (llama.cpp)

Python

from llama_cpp import Llama

# Q4_K_M: good balance of quality and size
llm = Llama(
    model_path="./models/Llama-3.1-8B-Instruct-Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,  # offload all layers to GPU
    n_threads=8,
)

output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "What is quantization?"}],
    max_tokens=256,
)
print(output["choices"][0]["message"]["content"])

AWQ Quantization (for vLLM)

Bash

pip install autoawq

python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')
model.quantize(tokenizer, quant_config={'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'})
model.save_quantized('./Llama-3.1-8B-Instruct-AWQ')
"

Memory Savings by Quantization Format

Format	Bits	8B Model Size	Quality Loss
BF16	16	16 GB	None (baseline)
INT8	8	8 GB	Minimal
Q4_K_M	~4.5	5 GB	Low
Q3_K_M	~3.5	4 GB	Moderate
Q2_K	~2.5	3 GB	Noticeable

Benchmarking Inference

Python

import time
import statistics
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")

def benchmark_latency(
    prompt: str,
    model: str,
    n_runs: int = 20,
    max_tokens: int = 100,
) -> dict:
    latencies = []
    tokens_per_sec = []

    for _ in range(n_runs):
        start = time.perf_counter()
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
        )
        elapsed = time.perf_counter() - start

        total_tokens = response.usage.completion_tokens
        latencies.append(elapsed)
        tokens_per_sec.append(total_tokens / elapsed)

    return {
        "p50_latency_ms": round(statistics.median(latencies) * 1000, 1),
        "p95_latency_ms": round(sorted(latencies)[int(0.95 * n_runs)] * 1000, 1),
        "avg_tokens_per_sec": round(statistics.mean(tokens_per_sec), 1),
        "n_runs": n_runs,
    }


results = benchmark_latency(
    prompt="Explain the transformer architecture in 3 sentences.",
    model="meta-llama/Llama-3.1-8B-Instruct",
    n_runs=20,
)
print(results)
# {'p50_latency_ms': 1420.3, 'p95_latency_ms': 1680.1, 'avg_tokens_per_sec': 67.2, 'n_runs': 20}

Performance Optimization Tips

Optimization	Impact	When to Use
Quantization (INT4)	2× memory reduction	Always, unless quality is critical
Continuous batching	10–50× throughput	High-concurrency serving (vLLM does this automatically)
Prompt caching	5–10× latency reduction	Long, repeated system prompts
Speculative decoding	2–3× latency reduction	High-volume serving with a draft model
Tensor parallelism	Linear GPU scaling	Multi-GPU setups
Flash Attention 2	2–4× memory reduction	All transformers (automatic in vLLM)

Troubleshooting

Out of memory (CUDA OOM)

Reduce --max-model-len (shorter context window)
Lower --gpu-memory-utilization (default 0.90 → try 0.80)
Use quantization (AWQ or GGUF Q4)
Use tensor parallelism across multiple GPUs

Slow throughput (< 20 tokens/sec on 8B model)

Enable Flash Attention (automatic in recent vLLM)
Increase batch size for offline processing
Check GPU utilization — should be > 80%

Responses are truncated

Increase --max-model-len (but this increases memory)
Increase max_tokens in your API call

Key Takeaways

LLM inference has two phases: prefill (parallel token processing, GPU-bound) and decode (sequential token generation, memory-bandwidth-bound)
The KV cache stores attention keys and values to avoid recomputation — it is the main source of OOM errors during inference
vLLM uses PagedAttention to manage KV cache efficiently, enabling 10–50× higher throughput than naive serving through continuous batching
Ollama is the fastest path to local development — zero configuration, OpenAI-compatible API, supports macOS Metal and NVIDIA CUDA
For CPU-only or local development inference, llama.cpp and Ollama are the right tools; for high-concurrency GPU production, vLLM is the standard
Quantization (INT4/AWQ) cuts memory requirements by 2–4× with minimal quality loss — always quantize for production serving
Continuous batching (automatic in vLLM) processes multiple requests simultaneously, dramatically improving GPU utilization and throughput
Speculative decoding and prompt caching can further reduce latency by 2–10× for applications with repeated system prompts or high volume

FAQ

vLLM versus llama.cpp — which should I use? Use vLLM for production serving with high concurrency on GPU — it handles batching, KV cache management, and scaling automatically. Use llama.cpp (or Ollama) for local development, CPU inference, or single-user scenarios where a full server is unnecessary.

What GPU do I need for 8B models? At INT4 quantization: an 8B model (~5GB) fits on an RTX 3080 (10GB) or any card with 8GB+ VRAM. For full BF16 precision (16GB): the RTX 4090 (24GB) is the minimum practical single-GPU option.

How do I reduce CUDA out-of-memory errors in vLLM? Reduce --max-model-len to shorten the maximum sequence length (the biggest KV cache consumer). Lower --gpu-memory-utilization from 0.90 to 0.80. Use quantization (AWQ or GGUF Q4) to reduce model weight memory. Enable tensor parallelism across multiple GPUs if available.

What is PagedAttention and why does it matter? PagedAttention manages the KV cache in fixed-size blocks (like OS virtual memory pages), allowing efficient memory sharing across requests and eliminating memory fragmentation. This enables much higher concurrency than pre-allocating a large contiguous KV cache block per request.

How do I serve a model with streaming output? Both vLLM and Ollama support streaming via the OpenAI-compatible API. Set stream=True in your API call and iterate over the response chunks. Each chunk contains a delta with the next token(s). The code examples above show both streaming and non-streaming patterns.

What is the throughput difference between vLLM and Ollama? For single-user interactive use, the difference is small. For concurrent requests (10+ simultaneous users), vLLM can be 10–50× more efficient due to continuous batching and PagedAttention. Ollama is not designed for multi-user production serving.

When should I use a custom Modelfile in Ollama? Use a Modelfile when you want to embed a specific system prompt, set default temperature and context length, or create a named local model from a custom GGUF file. This lets you version your model configuration alongside your application code.

Learning Objectives

LLM Inference Fundamentals

The Two Phases of Inference

KV Cache

Memory Layout

vLLM: High-Throughput Serving

Start a vLLM Server

Call the vLLM Server

vLLM Async Python Client

Ollama: Zero-Config Local Serving

Custom Modelfile

Ollama Python API

Quantization

GGUF Quantization (llama.cpp)

AWQ Quantization (for vLLM)

Memory Savings by Quantization Format

Benchmarking Inference

Performance Optimization Tips

Troubleshooting

Key Takeaways

FAQ

What to Learn Next