Local AI Tools: The Privacy-First Stack

Every time you use a cloud AI service, your data travels to someone else's server. For personal projects, research, sensitive work, or just preference — this matters.

Here's a complete local AI stack that runs on your machine. Nothing phones home.

The Core: Ollama

Ollama is the foundation. It's a tool for running open-weight models locally with a dead-simple interface:

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Run a model
ollama run llama3.2
ollama run qwen2.5-coder:32b
ollama run nomic-embed-text  # For embeddings

# Expose an API (OpenAI-compatible)
ollama serve  # Runs on http://localhost:11434

The OpenAI-compatible API means any tool that supports OpenAI also works with Ollama. Point OPENAI_BASE_URL to http://localhost:11434/v1 and your existing code runs against local models.

For Non-Technical Users: LM Studio

If you want a GUI for model management and chat:

LM Studio is a desktop app that downloads and runs models with a ChatGPT-style interface. No terminal required. Also exposes a local API.

The difference from Ollama: LM Studio is for interactive use. Ollama is for programmatic use and running as a service.

Local Embeddings

For RAG and semantic search, you need an embedding model. The best local options:

nomic-embed-text — Best balance of quality and speed. Pull via Ollama:

ollama pull nomic-embed-text

mxbai-embed-large — Higher quality for more complex retrieval tasks, larger model.

all-MiniLM-L6-v2 — Tiny and fast, good for high-volume use cases where compute is limited.

Using local embeddings via Python:

import ollama

def embed(text: str) -> list[float]:
    response = ollama.embeddings(
        model="nomic-embed-text",
        prompt=text
    )
    return response["embedding"]

Local Vector Database

ChromaDB runs in-process with zero infrastructure:

import chromadb

client = chromadb.PersistentClient(path="./local-db")
collection = client.get_or_create_collection("my-docs")

# Add documents
collection.add(
    documents=["This is a document about AI"],
    ids=["doc-1"]
)

# Query
results = collection.query(
    query_texts=["What documents are about AI?"],
    n_results=3
)

The PersistentClient stores data on disk — it persists between sessions. No server, no Docker, just a directory.

For larger scale (millions of documents), Qdrant can also be run locally via Docker with persistent storage.

Putting It Together: A Local RAG System

import ollama
import chromadb
from pathlib import Path

# Setup
client = chromadb.PersistentClient(path="./knowledge-base")
collection = client.get_or_create_collection("docs")

def index_file(filepath: str):
    """Add a file to the local knowledge base."""
    text = Path(filepath).read_text()
    # Chunk the text (simple split — use a proper chunker in production)
    chunks = [text[i:i+500] for i in range(0, len(text), 500)]

    embeddings = [
        ollama.embeddings(model="nomic-embed-text", prompt=chunk)["embedding"]
        for chunk in chunks
    ]

    collection.add(
        documents=chunks,
        embeddings=embeddings,
        ids=[f"{filepath}-{i}" for i in range(len(chunks))]
    )

def ask(question: str) -> str:
    """Query the knowledge base and generate an answer."""
    q_embedding = ollama.embeddings(
        model="nomic-embed-text", prompt=question
    )["embedding"]

    results = collection.query(
        query_embeddings=[q_embedding],
        n_results=3
    )

    context = "\n\n".join(results["documents"][0])
    response = ollama.chat(
        model="llama3.2",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )
    return response["message"]["content"]

This entire stack — indexing, embedding, retrieval, generation — runs locally. No API keys, no internet required after the initial model download.

Hardware Considerations

Honest requirements:

7B models: 8GB RAM minimum, 16GB comfortable
13B models: 16GB RAM minimum
32B models: 32GB RAM, Apple Silicon preferred for M-series memory bandwidth
70B models: 64GB+ RAM or a dedicated GPU with 40GB+ VRAM

Apple Silicon (M2/M3/M4) is the best consumer hardware for local AI. The unified memory architecture means you can run 32B models on a MacBook Pro without a discrete GPU.

Why This Matters Beyond Privacy

Local AI tools have a cost structure that's different from cloud APIs: high upfront (hardware), zero marginal. If you're running thousands of embeddings or processing large corpora, local is dramatically cheaper at scale. The break-even point depends on your hardware but it's often reached quickly.