- Chat model — generates responses in conversations
- Embedding model — powers semantic memory recall
Local vs cloud
Local
llama.cpp and LM Studio run models on your hardware. No API keys, no data leaves your machine. Requires downloading model files and enough RAM/VRAM.
Cloud
Bedrock, Gemini, and OpenRouter run models on remote infrastructure. Requires an API key. No local hardware requirements beyond the gateway itself.
Available providers
Local
Cloud
| Provider | Chat | Embeddings | Auth |
|---|---|---|---|
| AWS Bedrock | Yes | Yes (Titan / Nova) | Bearer token |
| Google Gemini | Yes | Yes | API key |
| OpenRouter | Yes | No | API key |
The two-server pattern
A common local setup runs chat and embeddings on separate servers: This works because chat and embeddings are independent subsystems. Configure them separately in Settings:- Settings > Chat — provider, base URL, model, API key
- Settings > Memory — embedding provider, base URL, model, dimensions
Hot-swapping
Chat provider changes (provider, model, base URL, system prompt) take effect immediately — no gateway restart needed. Embedding provider changes require a restart and may invalidate existing vector memory if the model or dimensions change.Choosing a provider
| Priority | Recommended |
|---|---|
| Privacy first, no cloud | llama.cpp or LM Studio |
| Best quality, cost is fine | Bedrock (Claude, Nova) or Gemini |
| Widest model selection | OpenRouter |
| Simple local setup | LM Studio (built-in model browser) |
| Maximum control | llama.cpp (direct llama-server flags) |
