Skip to content

LLM strategy

DevRecall does not run an LLM proxy. There’s no DevRecall API key; we have no inference infrastructure. Two modes, both end-to-end:

ModeHow it worksPrivacyCost
LocalDevRecall calls localhost:11434 (Ollama)Maximum — nothing leaves deviceZero
BYOKDevRecall calls OpenAI / Anthropic / OpenAI-compatible directlyData → provider, never via DevRecallYou pay provider

A proxy would mean:

  • Your Slack messages and commit diffs pass through DevRecall infrastructure
  • We’d be on the hook for SOC2 / DPA / DLP for every enterprise eval
  • Inference cost scales linearly with users
  • One outage and nobody gets standups

End-to-end (local or BYOK) is the only model consistent with “on-device by design.”

The default model is gemma4. Install Ollama, pull the model, and DevRecall picks it up automatically:

Terminal window
ollama pull gemma4

If you want a different model, set it in ~/.devrecall/config.json:

{ "llm": { "provider": "ollama", "model": "gemma4" } }

Any chat-capable model that Ollama can serve will work. Larger models give richer brag docs and chat answers; smaller models are faster for daily standups. Pick what fits your machine.

DevRecall does not bundle or auto-install Ollama. It’s a separate project with its own update cycle.

{ "llm": { "provider": "anthropic", "model": "claude-sonnet-4-6" } }
Terminal window
# Prompts for key, stores it in the OS keychain
devrecall auth anthropic

Supported providers and their defaults when model is omitted:

ProviderDefault modelOther examples
OpenAIgpt-5.4-miniany chat-completions model
Anthropicclaude-sonnet-4-6any Claude messages-API model
OpenAI-compatible(no default — set it)Groq, Together, self-hosted vLLM (custom base URL)

API keys live in the OS keychain. They’re never in config.json or shell history.

You don’t have to pick one. Route different tasks to different providers:

{
"llm": {
"provider": "ollama",
"model": "gemma4",
"models": {
"standup": "gemma4",
"chat": "gemma4",
"brag": "claude-sonnet-4-6"
}
}
}

Daily standups stay free on local Ollama; the quarterly brag doc flips to Claude for output quality.

When the primary provider is down or rate-limited, DevRecall falls through:

Primary BYOK (e.g., Anthropic)
↓ failure
Secondary BYOK (e.g., OpenAI)
↓ failure or not configured
Local Ollama (if running)
↓ not running
Template-based output (no LLM, structured but not synthesized)

Configured in config.json:

{
"llm": {
"provider": "anthropic",
"fallback": [
{ "provider": "openai", "model": "gpt-5.4-mini" },
{ "provider": "ollama", "model": "gemma4" }
]
}
}

The implicit final step (template) means DevRecall always works — even if every LLM you configured is unreachable.

Embeddings power chat / search. By default, DevRecall uses all-MiniLM-L6-v2 via ONNX — small (80 MB), fast, runs on CPU, bundled with the binary. Nothing to install, nothing leaves your machine.

If you want better recall on chat, you can switch to OpenAI embeddings (uses your BYOK key):

{
"llm": {
"models": { "embed": "text-embedding-3-small" }
}
}

DevRecall handles 429, 401, 5xx, and quota exhaustion explicitly:

  • 429 with Retry-After → exponential backoff, up to 3 retries
  • Quota exhausted → stop retrying, fall through the chain
  • 401 → prompt to re-enter the key
  • 5xx → retry with backoff, then fall through

User-facing output is plain language:

⚠ Anthropic rate limit reached. Retrying in 30s…
⚠ Still rate limited. Falling back to local Ollama (gemma4)…