Ollama Guide: Run LLMs Locally
Run Llama, Mistral, Gemma, and more on your own hardware — free, private, offline, no API costs. This guide covers installation, the best models to run, Python integration, and a browser UI.
Why Run LLMs Locally?
Privacy
Your data never leaves your machine. Essential for sensitive documents, legal text, medical records, or proprietary code.
No cost
Zero API fees. Run as many queries as you want with no rate limits or pay-per-token billing.
Offline
Works without internet. Useful for air-gapped environments or unreliable connections.
Customization
Fine-tune models locally, swap them instantly, and integrate with any tool without API restrictions.
Installing Ollama
macOS / Linux
curl -fsSL https://ollama.com/install.sh | shWindows
Download installer from ollama.com/downloadThen pull and run a model:
ollama run llama3.1:8bBest Models to Run Locally
| Model | RAM Required | Best For |
|---|---|---|
| llama3.2:3b | 2GB RAM | Fast, low-end hardware |
| llama3.1:8b | 5GB RAM | Good balance of quality/speed |
| mistral:7b | 4GB RAM | Coding + chat |
| qwen2.5-coder | 4GB RAM | Code generation |
| gemma2:9b | 6GB RAM | Multimodal tasks |
Using Ollama with Python
pip install ollamaimport ollama
response = ollama.chat(
model="llama3.1:8b",
messages=[
{"role": "user", "content": "Explain transformers in 2 sentences."},
],
)
print(response["message"]["content"])Ollama + Open WebUI
Open WebUI gives you a ChatGPT-style browser interface over your local Ollama models. Install with Docker:
docker run -d -p 3000:8080 \ -v open-webui:/app/backend/data \ --name open-webui \ ghcr.io/open-webui/open-webui:main
Then visit http://localhost:3000 for a full chat UI with model switching, history, and file uploads.
Frequently Asked Questions
What hardware do I need for Ollama?
Any modern Mac, Windows PC, or Linux machine with at least 8GB RAM can run 7B models. 16GB RAM is comfortable for 13B models. An Apple Silicon Mac (M1/M2/M3) with unified memory is the best consumer hardware for local LLMs — the GPU and CPU share the same memory pool, making 7B models fast without a discrete GPU.
Is Ollama free?
Yes, completely free and open-source. There are no API costs, no tokens to buy, and no rate limits. The only cost is your electricity and the initial model download (2–8GB per model). Models are stored locally and can be used offline.
Can I use Ollama with LangChain?
Yes. LangChain has first-class Ollama support via the ChatOllama and OllamaEmbeddings classes. This lets you build RAG pipelines and agents that run entirely locally — no API keys or costs required. See our LangChain tutorial for a full example.
What's the best model to run with Ollama?
For general use: llama3.1:8b (best quality/speed balance on most hardware). For coding: qwen2.5-coder or mistral:7b. For low-end hardware (8GB RAM): llama3.2:3b. For the best quality if you have 16GB+ RAM: llama3.1:13b or gemma2:9b.
Build with local LLMs + LangChain
Combine Ollama with LangChain to build fully local RAG pipelines and agents — no API costs, complete privacy.
LangChain Tutorial →