Custom LLMs
The default Bolti experience is to pick an LLM from a curated list of major providers — OpenAI, Gemini, Groq. That covers most use cases. But for some agents you'll want to go further:
- Cost — voice agents talk a lot, and OSS models on dedicated infrastructure are often 5–10× cheaper at scale than GPT-4o-class models.
- Latency — voice is unforgiving. The right OSS model on the right GPU can hit sub-150ms time-to-first-token, which translates to perceptibly faster agents.
- Data control — keep prompts off the major hyperscalers' general APIs.
- Fine-tunes — you have a model trained on your domain (your call transcripts, your product, your tone) and want it in production.
- Capability — large open-weights models (Llama 4 Maverick, Qwen3-235B, DeepSeek V3.1) are now competitive with frontier closed models on a lot of conversational tasks.
This page covers how Bolti supports custom / open-source LLMs.
What's available today
Bolti has Baseten integrated as a first-class LLM provider. When you pick Baseten in the LLM tab of agent settings, you get a curated list of production-ready OSS models running on Baseten's GPU infrastructure:
| Model | Best for | Approximate scale |
|---|---|---|
| DeepSeek-V3.1 | Strong reasoning, JSON / tool-call quality, English + multilingual. Often the best balance of quality and cost for voice. | 671B MoE (37B active) |
| Llama-4-Maverick-17B-128E-Instruct | Lower-latency conversational agent with very long context (1M tokens). Good at instruction following. | 17B active / 400B total MoE |
| Qwen3-235B-A22B | Highest-quality open model in this list. Use when capability matters more than latency. | 235B MoE (22B active) |
These show up in the LLM dropdown alongside OpenAI, Gemini, Groq, and DeepSeek — pick one and the agent uses it. No additional setup required: API access is managed by Bolti.
See Agent Setup → LLM for the full configuration surface.
Why Baseten
A short answer to the obvious question — why this provider, not Together, Fireworks, vLLM-on-RunPod, etc.?
- Latency optimization is the product. Baseten's pitch is sub-second cold starts and aggressive inference optimization (speculative decoding, FP8 weights, custom batching). For voice — where every 100ms of LLM TTFT shows up as audible silence in the call — this matters more than for chat.
- OpenAI-compatible API. Drops cleanly into Bolti's existing LLM plugin system. The same code path that handles OpenAI handles Baseten.
- Scale to zero. Agents that only run during business hours, or sub-accounts with very bursty traffic, don't pay for idle GPUs.
- Multi-region GPU pools. Baseten runs in multiple regions, which helps when data residency matters.
- Production-grade ops. SOC 2 Type II, active monitoring, autoscaling that actually works under load. Not something you'd want to build yourself for the first deployment.
Caveats worth knowing:
- Dedicated GPU pricing. You pay for GPU time, not per-token. For very low-traffic agents (a few calls per day) this is more expensive than calling a shared inference API. The break-even is roughly when you have steady traffic — once a model is "warm," dedicated inference wins on both cost and latency.
- Vendor concentration. You're adding Baseten as an upstream dependency. They have good uptime, but it's another moving piece.
- Model selection still matters. A bad-fit model on the fastest infrastructure is still bad. Test with Preview and real calls before committing to a model.
For most teams this trade is worth it. For very low-volume agents, sticking with shared OpenAI / Groq is cheaper.
Choosing among the Baseten models
A practical decision tree:
| If you want… | Pick |
|---|---|
| The default safe choice for production voice | DeepSeek-V3.1 — best balance of quality, latency, and cost. |
| The fastest possible TTFT with good-enough quality | Llama-4-Maverick — fewest active parameters, smallest TTFT. |
| Best output quality, willing to spend more on latency | Qwen3-235B-A22B — highest reasoning capability of the three. |
| A long-context agent (analysing long call histories, large knowledge bases inline) | Llama-4-Maverick — 1M-token context window. |
Run real calls through both your top choices via Preview and pick on actual conversation quality, not benchmarks. Voice exposes prompt-following weaknesses that text benchmarks hide.
Bringing your own model
Two paths exist today, depending on your needs.
1. Your own deployment on Baseten (recommended for fine-tunes)
If you have a fine-tuned model — Llama / Mistral / Qwen / your own — the cleanest path is to deploy it on Baseten and have Bolti point at it. You get:
- Your weights, your model
- Bolti's existing Baseten integration with no additional code on our side
- The same latency / scale-to-zero / observability story as the curated models above
This is a short engagement-required step today (we add your model id to your workspace's allowed model list). Reach out via hello@bolti.co.in with:
- The Baseten model ID you want to use
- A workspace ID
- Whether you need it scoped to specific agents or available globally in your workspace
2. Self-hosted endpoint (on-prem / strict data control)
For organizations who can't use a third-party hosting provider at all — strict data residency, on-prem mandates, government workloads — Bolti can be configured to point its OpenAI-compatible LLM plugin at any endpoint that speaks the OpenAI Chat Completions protocol:
- vLLM with OpenAI-compatible serving (
vllm serve --served-model-name your-model) - TGI (Text Generation Inference) from HuggingFace
- llama.cpp server for smaller models on CPU/Apple Silicon
- Any commercial OpenAI-compatible API — Together, Fireworks, OpenRouter, Anthropic-via-proxy, etc.
This is part of an On-Prem Deployment engagement — we wire your endpoint URL into the agent runtime configuration, so the OpenAI provider in the LLM dropdown actually points at your endpoint instead of api.openai.com.
Per-agent BYO endpoints (where each agent in the dashboard could point at a different URL) are on the roadmap but not shipped — today the runtime endpoint is set at the deployment level, not per agent.
Latency expectations
Approximate time-to-first-token (TTFT) numbers we've measured on a warm Baseten deployment, ordered by typical voice quality:
| Model | TTFT (warm) | Notes |
|---|---|---|
gpt-4o-mini (OpenAI) | 250–400ms | Bolti default for the wizard. Reliable. |
llama-3.1-8b-instant (Groq) | 80–150ms | Fastest, but smaller model — quality ceiling. |
Llama-4-Maverick (Baseten) | 150–250ms | Sweet spot for latency at frontier-OSS quality. |
DeepSeek-V3.1 (Baseten) | 200–350ms | Slower TTFT but stronger reasoning. |
Qwen3-235B-A22B (Baseten) | 300–500ms | Highest quality, costs you ~150ms in the user's experience. |
gpt-5.1 (OpenAI) | 400–800ms | Capable but you'll feel the silence on every turn. |
Cold starts add up to a few seconds on Baseten-hosted models when traffic is bursty. For 24/7 production agents, set a minimum-replica count via your Baseten deployment so the first call of the day isn't the slowest.
When not to use a custom LLM
To save you a wrong turn — stick with the default OpenAI/Gemini/Groq path if:
- You're still in the wizard / first agent. Get the agent working before optimizing the model.
- Traffic is genuinely low (a few calls a day). The cost / ops overhead isn't worth it.
- Your bottleneck is prompt engineering, not model capability. A better system prompt almost always beats a bigger model.
- You haven't measured the latency yet. Measure first, then optimize.
Related
- Agent Setup → LLM — picking and configuring the LLM in the dashboard
- Understanding Providers — overview of all provider categories
- Call Latencies — where latency comes from in a Bolti call
- On-Prem Deployment — running Bolti in your own infrastructure with self-hosted LLMs
- PII Data Protection — controlling what reaches your LLM provider
To deploy a fine-tuned model or wire up a self-hosted endpoint, reach out via hello@bolti.co.in.