Open Source LLMs: Pick the GGUF Model for Your GPU (2026)
Learning Objectives
- Know the major open-source LLM families and their strengths
- Run LLMs locally with Ollama (zero-config)
- Use
llama.cppfor quantized inference on CPU - Load models via Hugging Face for custom pipelines
- Choose the right model for your hardware and use case
Why Open Source LLMs?
| Reason | Details |
|---|---|
| Privacy | Data never leaves your machine |
| Cost | No per-token API charges at inference time |
| Control | Fine-tune, customize, modify weights freely |
| No rate limits | Run as many requests as your hardware allows |
| Offline use | Works without internet |
The trade-off: you manage infrastructure, and the largest open models (70B+) still lag behind frontier models like GPT-4o on complex reasoning.
Major Model Families (2025)
Llama 3.x (Meta)
- Sizes: 1B, 3B, 8B, 70B, 405B
- Llama 3.2 3B/1B: multimodal vision models (compact)
- Llama 3.1 8B: best-in-class small model
- Llama 3.1 70B: near-GPT-4 quality
- License: Meta Llama 3 License (free for most commercial use)
Mistral / Mixtral (Mistral AI)
- Mistral 7B: punches above its weight class; excellent instruction following
- Mixtral 8×7B: Mixture of Experts (MoE) — 47B total params, 13B active per token
- Mistral Nemo 12B: strong multilingual support
- License: Apache 2.0
Gemma 2 (Google)
- Sizes: 2B, 9B, 27B
- Excellent for research and fine-tuning
- Strong on coding and reasoning for its size
- License: Custom Gemma License (free for research and commercial use)
Phi-3 / Phi-4 (Microsoft)
- Phi-3 Mini (3.8B), Phi-3 Small (7B), Phi-3 Medium (14B)
- Phi-4 (14B): state-of-the-art at 14B size for reasoning
- Optimized for efficiency — excellent on CPU/mobile
- License: MIT
Qwen2.5 (Alibaba)
- Sizes: 0.5B to 72B
- Coder variant excels at code generation
- Strong multilingual (especially Chinese)
- License: Apache 2.0
Choosing a Model for Your Hardware
| GPU VRAM | Recommended Model | Notes |
|---|---|---|
| 4–8 GB | Phi-4 (14B Q4), Gemma 2 2B | Use 4-bit quantization |
| 8–16 GB | Llama 3.1 8B, Mistral 7B | Comfortable at 4-bit |
| 16–24 GB | Llama 3.1 8B (full), Mixtral 8×7B Q4 | Good quality |
| 40–80 GB | Llama 3.1 70B Q4 | Near-frontier quality |
| CPU only | Phi-3 Mini, Gemma 2 2B, Llama 3.2 1B | Slow but works |
Method 1: Ollama (Easiest)
Ollama is a tool that makes running LLMs locally as simple as ollama run llama3.
Install
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from https://ollama.comPull and Run a Model
# Pull a model (downloaded once)
ollama pull llama3.2 # 3B — fast, 2GB download
ollama pull mistral # 7B — great quality
ollama pull phi4 # 14B — excellent reasoning
# Interactive chat
ollama run llama3.2
# >>> Hello! What is RAG in AI?Use the Ollama API
Ollama runs a local server at http://localhost:11434 with an OpenAI-compatible API.
from openai import OpenAI
# Point the client at your local Ollama server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but ignored
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "user", "content": "Explain transformer attention in 100 words."}
],
)
print(response.choices[0].message.content)This means any code written for the OpenAI API works with Ollama by changing base_url.
Manage Models
ollama list # list downloaded models
ollama rm mistral # delete a model
ollama show llama3.2 # model info (params, context, etc.)Method 2: llama.cpp (CPU + Quantized Inference)
llama.cpp is a C++ implementation of LLaMA inference, optimized for CPU. It supports GGUF quantized model files.
Install (Python bindings)
pip install llama-cpp-pythonFor GPU acceleration on macOS (Metal):
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-pythonDownload a GGUF Model
Find quantized models on Hugging Face — search for "GGUF" + model name. Example: bartowski/Llama-3.2-3B-Instruct-GGUF.
pip install huggingface_hub
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF \
--include "Llama-3.2-3B-Instruct-Q4_K_M.gguf" \
--local-dir ./modelsRun Inference
from llama_cpp import Llama
llm = Llama(
model_path="./models/Llama-3.2-3B-Instruct-Q4_K_M.gguf",
n_ctx=4096, # context window
n_gpu_layers=-1, # -1 = offload all layers to GPU (if available)
)
output = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the key differences between SQL and NoSQL?"},
],
max_tokens=512,
temperature=0.7,
)
print(output["choices"][0]["message"]["content"])Quantization Formats (GGUF)
| Format | Quality | Size Reduction |
|---|---|---|
| Q8_0 | Excellent | 45% |
| Q5_K_M | Very Good | 55% |
| Q4_K_M | Good (recommended) | 60% |
| Q3_K_M | Acceptable | 70% |
| Q2_K | Moderate | 75% |
Q4_K_M is the sweet spot for most use cases — minimal quality loss, fits in less VRAM.
Method 3: Hugging Face Transformers
For custom pipelines, fine-tuning, and research.
pip install transformers accelerate torchfrom transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "google/gemma-2-2b-it" # instruction-tuned Gemma 2 2B
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Explain gradient descent in simple terms."}
]
input_ids = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
)
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)Running a Local API Server
Expose your local model as an OpenAI-compatible API:
# With vLLM (high-performance, GPU)
pip install vllm
vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8000
# With LiteLLM proxy (wraps any model)
pip install litellm
litellm --model ollama/llama3.2 --port 8000Then connect any OpenAI-compatible client to http://localhost:8000.
Troubleshooting
Model is very slow on CPU
- Use a smaller model (1B–3B) or a more aggressively quantized file (Q4 or Q3)
- Increase
n_threadsin llama_cpp:Llama(model_path=..., n_threads=8)
Out of memory
- Use a smaller quantization (Q4_K_M instead of Q8_0)
- Reduce context window:
n_ctx=2048
Gibberish output
- Ensure you're applying the correct chat template for the model
- Check that you're using the instruction-tuned variant (look for
-Instructor-itin the model name)
Key Takeaways
- Open source LLMs give you privacy, zero per-token cost, and full fine-tuning control at the cost of infrastructure management
- Llama 3.1 70B is the best all-around open source model; Phi-4 and Qwen2.5 punch above their size class for reasoning and coding
- VRAM is the primary hardware constraint — use the VRAM table to match model size and quantization level to your GPU
- Ollama is the fastest path to running models locally — zero config, OpenAI-compatible API, works on Mac/Linux/Windows
- Q4_K_M quantization is the recommended default — it cuts model size by 60% with minimal quality loss
- llama.cpp gives you CPU inference and fine-grained quantization control via GGUF format
- Hugging Face Transformers is the right choice for custom pipelines, fine-tuning, and research workflows
- Always use the instruct-tuned variant (suffix
-Instructor-it) for chat and instruction-following tasks
FAQ
Are open source LLMs as good as GPT-4? For many tasks: yes. Llama 3.1 70B is competitive with GPT-4 on coding and general reasoning. On complex multi-step reasoning and long-context synthesis, frontier proprietary models still hold an edge.
Can I use these models for commercial projects? Most open models (Llama 3, Mistral, Gemma, Phi) allow commercial use. Always check the specific license. Meta's Llama license has restrictions for large companies (over 700M MAU). Mistral and Qwen are Apache 2.0 — the most permissive option.
What is the difference between base and instruct models? Base models predict the next token (raw language modeling). Instruct models are fine-tuned to follow instructions and engage in conversation. Always use instruct models for chat applications and task completion.
Which model should I start with if I have an RTX 4090? Start with Llama 3.1 8B Q4_K_M for general tasks (fits comfortably in 5GB VRAM, leaving plenty of headroom). For coding, try Qwen2.5 Coder 7B. For reasoning-heavy tasks, Phi-4 14B Q4_K_M fits in around 9GB.
How do I run a model on a Mac without a discrete GPU? Use Ollama — it supports Apple Silicon Metal acceleration natively. An M2/M3 Mac with 16GB unified memory runs 7B–8B models well. An M3 Max with 64GB can run 70B models at useful speeds.
What does the Q4_K_M suffix mean? Q4 is the quantization bit depth (4 bits per weight). K indicates the K-quant algorithm, which uses mixed precision across layers. M means medium quality within that bit depth. Q4_K_M is the community-standard recommended quantization level.
Can I use Ollama with LangChain or other frameworks?
Yes. Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. Any LangChain, LlamaIndex, or OpenAI SDK code works with Ollama by changing the base_url parameter.