Running Local Models — Ollama and LM Studio
Run AI models on your own machine — privacy benefits, performance tradeoffs, setup guides, and when local makes sense.
Every time you use Claude or ChatGPT, your code travels over the internet to someone else's servers. For most work, that's fine. But there are legitimate reasons to run AI models on your own machine: privacy requirements, offline access, cost control, or simply the principle that your code shouldn't leave your hardware.
Local models are the self-hosted option. They run on your machine, use your GPU (or CPU), and never send a byte of data to any external server. The tradeoff is capability — local models are smaller and less powerful than the cloud giants. But for many tasks, they're more than good enough.
Ollama — The Docker of Local AI
Ollama is the simplest way to run local models. Think of it as Docker for AI models — you pull a model, run it, and interact with it through a simple API.
Installing Ollama
# macOS (Homebrew)
brew install ollama
# Or download from ollama.com
# Available for macOS, Linux, and WindowsRunning Your First Model
# Pull and run a model
ollama run llama3.1:8b
# It downloads the model (~4.7GB) and starts an interactive chat
>>> Write a TypeScript function that validates an email addressThat's it. No configuration, no GPU drivers to install, no Python environments to manage. Ollama handles everything.
Available Models
# List popular coding-capable models
ollama list
# Pull specific models
ollama pull llama3.1:8b # Good balance for most machines
ollama pull llama3.1:70b # Powerful, needs 40GB+ RAM
ollama pull codellama:13b # Specialized for code
ollama pull qwen2.5-coder:7b # Strong coding model, efficient
ollama pull deepseek-coder-v2 # Excellent for code
ollama pull mistral:7b # Fast, general-purposeUsing Ollama as an API
Ollama exposes an API that's compatible with the OpenAI format, which means many tools work with it out of the box:
# Ollama runs a local API server on port 11434
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Write a React useState hook for a counter",
"stream": false
}'Any tool that supports a custom OpenAI-compatible endpoint can use Ollama. This includes some configurations of Cursor, Continue.dev, and other coding assistants.
LM Studio — The GUI Approach
LM Studio provides a graphical interface for running local models. If Ollama is the command-line approach, LM Studio is the desktop application approach.
What LM Studio Offers
- Model browser — Search and download models from Hugging Face directly in the app
- Chat interface — Talk to models through a familiar chat UI
- Local server — Expose models as an OpenAI-compatible API
- Model comparison — Run multiple models side-by-side to compare output
- Settings control — Adjust temperature, context length, and other parameters through the UI
When to Choose LM Studio Over Ollama
| Need | Ollama | LM Studio | |------|--------|-----------| | Quick command-line usage | Better | Works | | Visual model management | No GUI | Better | | API integration | Both work | Both work | | Model comparison | Manual | Built-in | | Browsing available models | Ollama.com | In-app browser | | Automation/scripting | Better | Less suited |
Both tools are free and run the same underlying models. Choose based on whether you prefer terminal or GUI workflows.
Hardware Requirements
Local models need serious hardware — especially RAM. Here's what you need:
The RAM Rule of Thumb
A model's RAM requirement is roughly 1.2x its file size. A 4GB model needs about 5GB of free RAM. A 40GB model needs about 48GB.
| Model Size | RAM Needed | GPU VRAM | Good For | |------------|-----------|----------|----------| | 1B-3B params | 2-4 GB | 2-4 GB | Simple completions, fast responses | | 7B-8B params | 5-8 GB | 6-8 GB | General coding, explanations | | 13B-14B params | 10-16 GB | 10-16 GB | Better quality, reasonable speed | | 30B-34B params | 20-32 GB | 24-32 GB | Near-cloud quality for many tasks | | 70B params | 40-48 GB | 40-48 GB | Closest to cloud quality |
Apple Silicon Advantage
If you have a Mac with Apple Silicon (M1/M2/M3/M4), local models run well because the unified memory architecture lets the GPU access all system RAM. A MacBook Pro with 36GB of unified memory can run 30B parameter models comfortably.
MacBook Air M2 (16GB): Can run 7B-8B models well
MacBook Pro M3 (36GB): Can run 13B-30B models
Mac Studio M2 Ultra (192GB): Can run 70B+ modelsNVIDIA GPUs
On Windows/Linux, NVIDIA GPUs with CUDA support provide the best performance. VRAM is the bottleneck:
RTX 4060 (8GB VRAM): 7B-8B models
RTX 4070 (12GB VRAM): 13B models
RTX 4090 (24GB VRAM): 30B models
2x RTX 4090 (48GB): 70B modelsCPU-Only
You can run local models on CPU alone, but expect significantly slower generation. An 8B model that generates 30 tokens/second on GPU might generate 3-5 tokens/second on CPU. Usable for simple tasks, painful for long responses.
Local vs Cloud — The Tradeoffs
| Factor | Local | Cloud | |--------|-------|-------| | Privacy | Complete — code never leaves your machine | Your code goes to the provider | | Speed | Depends on hardware (often slower) | Consistently fast | | Quality | Good for smaller models, gap for complex tasks | Best available models | | Cost | Free after hardware | Per-token pricing | | Offline access | Works without internet | Requires internet | | Setup | 5 minutes for basic, more for optimization | Sign up and get an API key | | Maintenance | You manage updates and hardware | Provider manages everything |
When Local Wins
Proprietary code. If your company policy or client contract prohibits sending code to third-party services, local models are the only option.
Offline development. Traveling, poor internet, or air-gapped environments. Local models work anywhere your laptop does.
High volume, simple tasks. If you're making thousands of simple API calls (formatting, classification, simple generation), local models eliminate per-call costs.
Learning and experimentation. Running models locally helps you understand how they work. You can experiment with different models, settings, and approaches without worrying about costs.
When Cloud Wins
Complex coding tasks. For refactoring large files, architectural design, complex debugging, and multi-step reasoning, cloud models (Claude Opus, GPT-4o) are significantly more capable than anything that runs locally.
Interactive coding agents. Claude Code's capabilities come from both the model and the tooling around it. Local models don't have equivalent agent infrastructure yet.
Time-sensitive work. If the task is important and you need the best possible result, use the best available model — which is currently in the cloud.
Setting Up Ollama for Coding
Here's a practical setup for coding with Ollama:
# Install Ollama
brew install ollama
# Pull a good coding model
ollama pull qwen2.5-coder:7b
# Test it with a coding task
ollama run qwen2.5-coder:7b "Write a TypeScript function that
debounces a callback with a configurable delay"
# Start the API server (for tool integration)
ollama serveUsing With Continue.dev (VS Code)
Continue.dev is a VS Code extension that provides AI coding assistance and supports local models:
// .continue/config.json
{
"models": [
{
"title": "Local Qwen Coder",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
]
}This gives you AI coding assistance in VS Code powered entirely by a local model. The quality won't match Claude or GPT-4o for complex tasks, but it's private and free.
Try this now
- Decide whether you have a real reason to go local: privacy, offline work, or cost.
- Check your hardware before downloading a model that will be miserable to run.
- Test one local model on a simple coding task and compare the result honestly to your preferred cloud model.
Prompt to give your agent
"Design a local AI coding setup for me. Hardware: [CPU/GPU/RAM] Operating system: [OS] Constraints: [privacy, offline use, budget] Typical tasks: [simple coding help, reviews, refactors, chat]
Recommend:
- the best local model size I can realistically run
- whether Ollama or LM Studio is the better starting point
- what tool integration I should use
- which tasks should still go to cloud models if policy allows"
What you must review yourself
- Whether the privacy or offline requirement is real enough to justify the quality tradeoff
- Whether the model actually fits your hardware without constant swapping and frustration
- Whether the local UX gap is acceptable for the way you like to work
- Whether you are expecting cloud-level reasoning from a smaller local model
Common Mistakes to Avoid
- Expecting cloud-quality results. Local models are valuable, but they are not magic.
- Running models too large for your hardware. Bad fit equals bad experience.
- Not trying quantized models. Smaller practical models often beat idealized bigger ones you cannot run well.
- Ignoring the UX gap. Model quality and tool quality are separate tradeoffs.
- Using local when cloud is better. Principle should not override task fit without a reason.
Key takeaways
- Local models trade top-end capability for privacy, control, and offline access
- Hardware fit matters as much as model choice
- Ollama and LM Studio lower the barrier, but they do not erase the tradeoffs
- A hybrid workflow often beats a purely local or purely cloud stance
What's Next
You have now completed the ai-tools path end to end. The next step is to apply these choices deliberately:
- choose the right tool for the task instead of defaulting blindly
- set the right provider, budget, and context strategy before scale makes mistakes expensive
- treat prompts, context files, and review discipline as part of the engineering workflow
That is how AI assistance becomes leverage instead of chaos.