LLM strategy

DevRecall does not run an LLM proxy. There’s no DevRecall API key; we have no inference infrastructure. Two modes, both end-to-end:

Mode	How it works	Privacy	Cost
Local	DevRecall calls `localhost:11434` (Ollama)	Maximum — nothing leaves device	Zero
BYOK	DevRecall calls OpenAI / Anthropic / OpenAI-compatible directly	Data → provider, never via DevRecall	You pay provider

Why no proxy

A proxy would mean:

Your Slack messages and commit diffs pass through DevRecall infrastructure
We’d be on the hook for SOC2 / DPA / DLP for every enterprise eval
Inference cost scales linearly with users
One outage and nobody gets standups

End-to-end (local or BYOK) is the only model consistent with “on-device by design.”

Local: Ollama

The default model is gemma4. Install Ollama, pull the model, and DevRecall picks it up automatically:

ollama pull gemma4

If you want a different model, set it in ~/.devrecall/config.json:

{ "llm": { "provider": "ollama", "model": "gemma4" } }

Any chat-capable model that Ollama can serve will work. Larger models give richer brag docs and chat answers; smaller models are faster for daily standups. Pick what fits your machine.

DevRecall does not bundle or auto-install Ollama. It’s a separate project with its own update cycle.

BYOK: OpenAI / Anthropic

{ "llm": { "provider": "anthropic", "model": "claude-sonnet-4-6" } }

# Prompts for key, stores it in the OS keychain
devrecall auth anthropic

Supported providers and their defaults when model is omitted:

Provider	Default model	Other examples
OpenAI	`gpt-5.4-mini`	any chat-completions model
Anthropic	`claude-sonnet-4-6`	any Claude messages-API model
OpenAI-compatible	(no default — set it)	Groq, Together, self-hosted vLLM (custom base URL)

API keys live in the OS keychain. They’re never in config.json or shell history.

Per-task model routing

You don’t have to pick one. Route different tasks to different providers:

{
  "llm": {
    "provider": "ollama",
    "model": "gemma4",
    "models": {
      "standup": "gemma4",
      "chat": "gemma4",
      "brag": "claude-sonnet-4-6"
    }
  }
}

Daily standups stay free on local Ollama; the quarterly brag doc flips to Claude for output quality.

Fallback chain

When the primary provider is down or rate-limited, DevRecall falls through:

Primary BYOK (e.g., Anthropic)
   ↓ failure
Secondary BYOK (e.g., OpenAI)
   ↓ failure or not configured
Local Ollama (if running)
   ↓ not running
Template-based output (no LLM, structured but not synthesized)

Configured in config.json:

{
  "llm": {
    "provider": "anthropic",
    "fallback": [
      { "provider": "openai", "model": "gpt-5.4-mini" },
      { "provider": "ollama", "model": "gemma4" }
    ]
  }
}

The implicit final step (template) means DevRecall always works — even if every LLM you configured is unreachable.

Embeddings

Embeddings power chat / search. By default, DevRecall uses all-MiniLM-L6-v2 via ONNX — small (80 MB), fast, runs on CPU, bundled with the binary. Nothing to install, nothing leaves your machine.

If you want better recall on chat, you can switch to OpenAI embeddings (uses your BYOK key):

{
  "llm": {
    "models": { "embed": "text-embedding-3-small" }
  }
}

Rate limits and errors

DevRecall handles 429, 401, 5xx, and quota exhaustion explicitly:

429 with Retry-After → exponential backoff, up to 3 retries
Quota exhausted → stop retrying, fall through the chain
401 → prompt to re-enter the key
5xx → retry with backoff, then fall through

User-facing output is plain language:

⚠ Anthropic rate limit reached. Retrying in 30s…
⚠ Still rate limited. Falling back to local Ollama (gemma4)…