Every time you use a cloud AI service, your data travels to someone else's server. For personal projects, research, sensitive work, or just preference — this matters.
Here's a complete local AI stack that runs on your machine. Nothing phones home.
The Core: Ollama
Ollama is the foundation. It's a tool for running open-weight models locally with a dead-simple interface:
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Run a model
ollama run llama3.2
ollama run qwen2.5-coder:32b
ollama run nomic-embed-text # For embeddings
# Expose an API (OpenAI-compatible)
ollama serve # Runs on http://localhost:11434
The OpenAI-compatible API means any tool that supports OpenAI also works with Ollama. Point OPENAI_BASE_URL to http://localhost:11434/v1 and your existing code runs against local models.
For Non-Technical Users: LM Studio
If you want a GUI for model management and chat:
LM Studio is a desktop app that downloads and runs models with a ChatGPT-style interface. No terminal required. Also exposes a local API.
The difference from Ollama: LM Studio is for interactive use. Ollama is for programmatic use and running as a service.
Local Embeddings
For RAG and semantic search, you need an embedding model. The best local options:
nomic-embed-text — Best balance of quality and speed. Pull via Ollama:
ollama pull nomic-embed-text
mxbai-embed-large — Higher quality for more complex retrieval tasks, larger model.
all-MiniLM-L6-v2 — Tiny and fast, good for high-volume use cases where compute is limited.
Using local embeddings via Python:
import ollama
def embed(text: str) -> list[float]:
response = ollama.embeddings(
model="nomic-embed-text",
prompt=text
)
return response["embedding"]
Local Vector Database
ChromaDB runs in-process with zero infrastructure:
import chromadb
client = chromadb.PersistentClient(path="./local-db")
collection = client.get_or_create_collection("my-docs")
# Add documents
collection.add(
documents=["This is a document about AI"],
ids=["doc-1"]
)
# Query
results = collection.query(
query_texts=["What documents are about AI?"],
n_results=3
)
The PersistentClient stores data on disk — it persists between sessions. No server, no Docker, just a directory.
For larger scale (millions of documents), Qdrant can also be run locally via Docker with persistent storage.
Putting It Together: A Local RAG System
import ollama
import chromadb
from pathlib import Path
# Setup
client = chromadb.PersistentClient(path="./knowledge-base")
collection = client.get_or_create_collection("docs")
def index_file(filepath: str):
"""Add a file to the local knowledge base."""
text = Path(filepath).read_text()
# Chunk the text (simple split — use a proper chunker in production)
chunks = [text[i:i+500] for i in range(0, len(text), 500)]
embeddings = [
ollama.embeddings(model="nomic-embed-text", prompt=chunk)["embedding"]
for chunk in chunks
]
collection.add(
documents=chunks,
embeddings=embeddings,
ids=[f"{filepath}-{i}" for i in range(len(chunks))]
)
def ask(question: str) -> str:
"""Query the knowledge base and generate an answer."""
q_embedding = ollama.embeddings(
model="nomic-embed-text", prompt=question
)["embedding"]
results = collection.query(
query_embeddings=[q_embedding],
n_results=3
)
context = "\n\n".join(results["documents"][0])
response = ollama.chat(
model="llama3.2",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}]
)
return response["message"]["content"]
This entire stack — indexing, embedding, retrieval, generation — runs locally. No API keys, no internet required after the initial model download.
Hardware Considerations
Honest requirements:
- 7B models: 8GB RAM minimum, 16GB comfortable
- 13B models: 16GB RAM minimum
- 32B models: 32GB RAM, Apple Silicon preferred for M-series memory bandwidth
- 70B models: 64GB+ RAM or a dedicated GPU with 40GB+ VRAM
Apple Silicon (M2/M3/M4) is the best consumer hardware for local AI. The unified memory architecture means you can run 32B models on a MacBook Pro without a discrete GPU.
Why This Matters Beyond Privacy
Local AI tools have a cost structure that's different from cloud APIs: high upfront (hardware), zero marginal. If you're running thousands of embeddings or processing large corpora, local is dramatically cheaper at scale. The break-even point depends on your hardware but it's often reached quickly.