GitHub is full of AI repos collecting stars from people who will never actually use them. This list is different — these are repos I've actually shipped code with or learned something concrete from.
1. BerriAI/litellm
LiteLLM gives you a unified interface to 100+ LLM providers. One SDK, one API contract, swap models without changing code.
from litellm import completion
# Works for OpenAI, Anthropic, Cohere, local models, all of them
response = completion(model="anthropic/claude-opus-4-5", messages=[...])
response = completion(model="ollama/llama3", messages=[...])
The killer feature for production: a proxy server that lets you add rate limiting, load balancing, and cost tracking without touching application code.
2. chroma-core/chroma
The simplest vector database to get started with. Runs in-process (no server required), scales to a hosted version when you need it, and has a clean Python API.
import chromadb
client = chromadb.Client()
collection = client.create_collection("my-docs")
collection.add(documents=["doc 1", "doc 2"], ids=["1", "2"])
results = collection.query(query_texts=["relevant question"], n_results=2)
For RAG prototypes and small to medium production use cases, Chroma is often the right choice. You can graduate to Qdrant or Pinecone later without rearchitecting.
3. simonw/llm
Simon Willison's llm CLI tool. Run LLMs from the command line, log every prompt and response to SQLite, chain operations with pipes.
llm "Summarize this" < article.txt
cat code.py | llm "What does this do?"
llm logs list # See everything you've ever run
The logging alone is worth installing. Every prompt stored locally — useful for debugging, auditing, and cost estimation.
4. microsoft/promptflow
Promptflow is a framework for building, testing, and evaluating LLM applications. The useful parts are the evaluation tools — you can define evaluation metrics and run them against your prompts systematically.
Less useful for simple apps. Very useful if you're iterating on prompt quality and want a structured way to measure whether changes are improvements.
5. guidance-ai/guidance
Guidance gives you structural control over LLM output — constrained generation, guaranteed JSON schemas, conditional logic in prompts.
from guidance import models, gen, select
llm = models.Anthropic("claude-haiku-4-5")
with llm:
lm = llm + "Is this review positive or negative? " + select(["positive", "negative"])
When you need deterministic output structure and regular prompting isn't reliable enough, Guidance is the tool.
6. run-llama/llama_index
LlamaIndex is mature, well-documented, and has excellent support for the retrieval patterns that actually matter in production. The data connectors alone — PDF, Notion, Slack, Google Docs, 100+ others — save significant time.
Use it when your RAG pipeline pulls from diverse sources and you don't want to write 15 custom parsers.
7. instructor-ai/instructor
Instructor patches the Anthropic and OpenAI SDKs to reliably return structured Pydantic models from LLM calls. No more writing JSON validation code for model outputs.
import instructor
from anthropic import Anthropic
from pydantic import BaseModel
client = instructor.from_anthropic(Anthropic())
class UserProfile(BaseModel):
name: str
age: int
skills: list[str]
user = client.messages.create(
model="claude-haiku-4-5",
response_model=UserProfile,
messages=[{"role": "user", "content": "Extract: John, 28, knows Python and Rust"}],
)
# user is a UserProfile, not a string
Instructor handles retries, validation errors, and re-prompting automatically. If you're parsing structured data from LLMs, you need this.
The Meta-Pattern
What these repos have in common: they wrap LLMs at the right abstraction level. They don't try to make LLMs do things they're bad at — they handle the infrastructure so you focus on what matters.
Star them all, but actually use them. Most AI developer productivity comes from good tooling, not better prompting.