This is the fastest path from nothing to a working local LLM. No Python environment, no CUDA wrestling — Ollama handles all of it.
1. Install Ollama
Download the installer from ollama.com (Windows, macOS, Linux). On Linux it’s one line:
curl -fsSL https://ollama.com/install.sh | sh
The installer detects your GPU automatically — NVIDIA via CUDA, AMD via ROCm, Apple Silicon natively.
2. Pick a model that fits
The biggest beginner mistake is pulling a model too large for your VRAM. Match the model to your card:
| VRAM | Start with |
|---|---|
| 4 GB | llama3.2:3b |
| 8 GB | qwen2.5:7b |
| 12–16 GB | qwen2.5:14b |
| 24 GB | qwen2.5:32b |
Then pull and run it:
ollama run qwen2.5:7b
The first run downloads the model (a 7B Q4 is roughly 4–5 GB). After that, it loads from disk in seconds and you’re chatting in your terminal.
3. Verify it’s on the GPU
If replies feel slow, check whether the model actually loaded into VRAM:
ollama ps
Look at the PROCESSOR column: 100% GPU is what you want. If you see a CPU percentage, the model didn’t fully fit — switch to a smaller model or a tighter quantization (:7b-instruct-q4_K_M style tags let you pick).
4. Use it from code
Ollama exposes a local API on port 11434, so any script can talk to it:
import requests
r = requests.post("http://localhost:11434/api/generate", json={
"model": "qwen2.5:7b",
"prompt": "Explain VRAM in one paragraph.",
"stream": False,
})
print(r.json()["response"])
That’s the foundation for everything else on this site: once a local model answers on localhost, you can wire it into editors, scripts, voice assistants — anything.
Where to go next
The natural next step is giving your model documents to work with (local RAG) or a web interface. Both build directly on the setup you just finished.