LLM strategy
DevRecall does not run an LLM proxy. There’s no DevRecall API key; we have no inference infrastructure. Two modes, both end-to-end:
| Mode | How it works | Privacy | Cost |
|---|---|---|---|
| Local | DevRecall calls localhost:11434 (Ollama) | Maximum — nothing leaves device | Zero |
| BYOK | DevRecall calls OpenAI / Anthropic / OpenAI-compatible directly | Data → provider, never via DevRecall | You pay provider |
Why no proxy
Section titled “Why no proxy”A proxy would mean:
- Your Slack messages and commit diffs pass through DevRecall infrastructure
- We’d be on the hook for SOC2 / DPA / DLP for every enterprise eval
- Inference cost scales linearly with users
- One outage and nobody gets standups
End-to-end (local or BYOK) is the only model consistent with “on-device by design.”
Local: Ollama
Section titled “Local: Ollama”The default model is gemma4. Install Ollama, pull the model, and
DevRecall picks it up automatically:
ollama pull gemma4If you want a different model, set it in ~/.devrecall/config.json:
{ "llm": { "provider": "ollama", "model": "gemma4" } }Any chat-capable model that Ollama can serve will work. Larger models give richer brag docs and chat answers; smaller models are faster for daily standups. Pick what fits your machine.
DevRecall does not bundle or auto-install Ollama. It’s a separate project with its own update cycle.
BYOK: OpenAI / Anthropic
Section titled “BYOK: OpenAI / Anthropic”{ "llm": { "provider": "anthropic", "model": "claude-sonnet-4-6" } }# Prompts for key, stores it in the OS keychaindevrecall auth anthropicSupported providers and their defaults when model is omitted:
| Provider | Default model | Other examples |
|---|---|---|
| OpenAI | gpt-5.4-mini | any chat-completions model |
| Anthropic | claude-sonnet-4-6 | any Claude messages-API model |
| OpenAI-compatible | (no default — set it) | Groq, Together, self-hosted vLLM (custom base URL) |
API keys live in the OS keychain. They’re never in config.json or
shell history.
Per-task model routing
Section titled “Per-task model routing”You don’t have to pick one. Route different tasks to different providers:
{ "llm": { "provider": "ollama", "model": "gemma4", "models": { "standup": "gemma4", "chat": "gemma4", "brag": "claude-sonnet-4-6" } }}Daily standups stay free on local Ollama; the quarterly brag doc flips to Claude for output quality.
Fallback chain
Section titled “Fallback chain”When the primary provider is down or rate-limited, DevRecall falls through:
Primary BYOK (e.g., Anthropic) ↓ failureSecondary BYOK (e.g., OpenAI) ↓ failure or not configuredLocal Ollama (if running) ↓ not runningTemplate-based output (no LLM, structured but not synthesized)Configured in config.json:
{ "llm": { "provider": "anthropic", "fallback": [ { "provider": "openai", "model": "gpt-5.4-mini" }, { "provider": "ollama", "model": "gemma4" } ] }}The implicit final step (template) means DevRecall always works — even if every LLM you configured is unreachable.
Embeddings
Section titled “Embeddings”Embeddings power chat / search. By default, DevRecall uses all-MiniLM-L6-v2 via ONNX — small (80 MB), fast, runs on CPU, bundled with the binary. Nothing to install, nothing leaves your machine.
If you want better recall on chat, you can switch to OpenAI embeddings (uses your BYOK key):
{ "llm": { "models": { "embed": "text-embedding-3-small" } }}Rate limits and errors
Section titled “Rate limits and errors”DevRecall handles 429, 401, 5xx, and quota exhaustion explicitly:
429withRetry-After→ exponential backoff, up to 3 retries- Quota exhausted → stop retrying, fall through the chain
401→ prompt to re-enter the key5xx→ retry with backoff, then fall through
User-facing output is plain language:
⚠ Anthropic rate limit reached. Retrying in 30s…⚠ Still rate limited. Falling back to local Ollama (gemma4)…