Skip to main content

Custom LLMs

The default Bolti experience is to pick an LLM from a curated list of major providers — OpenAI, Gemini, Groq. That covers most use cases. But for some agents you'll want to go further:

  • Cost — voice agents talk a lot, and OSS models on dedicated infrastructure are often 5–10× cheaper at scale than GPT-4o-class models.
  • Latency — voice is unforgiving. The right OSS model on the right GPU can hit sub-150ms time-to-first-token, which translates to perceptibly faster agents.
  • Data control — keep prompts off the major hyperscalers' general APIs.
  • Fine-tunes — you have a model trained on your domain (your call transcripts, your product, your tone) and want it in production.
  • Capability — large open-weights models (Llama 4 Maverick, Qwen3-235B, DeepSeek V3.1) are now competitive with frontier closed models on a lot of conversational tasks.

This page covers how Bolti supports custom / open-source LLMs.

What's available today

Bolti has Baseten integrated as a first-class LLM provider. When you pick Baseten in the LLM tab of agent settings, you get a curated list of production-ready OSS models running on Baseten's GPU infrastructure:

ModelBest forApproximate scale
DeepSeek-V3.1Strong reasoning, JSON / tool-call quality, English + multilingual. Often the best balance of quality and cost for voice.671B MoE (37B active)
Llama-4-Maverick-17B-128E-InstructLower-latency conversational agent with very long context (1M tokens). Good at instruction following.17B active / 400B total MoE
Qwen3-235B-A22BHighest-quality open model in this list. Use when capability matters more than latency.235B MoE (22B active)

These show up in the LLM dropdown alongside OpenAI, Gemini, Groq, and DeepSeek — pick one and the agent uses it. No additional setup required: API access is managed by Bolti.

See Agent Setup → LLM for the full configuration surface.

Why Baseten

A short answer to the obvious question — why this provider, not Together, Fireworks, vLLM-on-RunPod, etc.?

  • Latency optimization is the product. Baseten's pitch is sub-second cold starts and aggressive inference optimization (speculative decoding, FP8 weights, custom batching). For voice — where every 100ms of LLM TTFT shows up as audible silence in the call — this matters more than for chat.
  • OpenAI-compatible API. Drops cleanly into Bolti's existing LLM plugin system. The same code path that handles OpenAI handles Baseten.
  • Scale to zero. Agents that only run during business hours, or sub-accounts with very bursty traffic, don't pay for idle GPUs.
  • Multi-region GPU pools. Baseten runs in multiple regions, which helps when data residency matters.
  • Production-grade ops. SOC 2 Type II, active monitoring, autoscaling that actually works under load. Not something you'd want to build yourself for the first deployment.

Caveats worth knowing:

  • Dedicated GPU pricing. You pay for GPU time, not per-token. For very low-traffic agents (a few calls per day) this is more expensive than calling a shared inference API. The break-even is roughly when you have steady traffic — once a model is "warm," dedicated inference wins on both cost and latency.
  • Vendor concentration. You're adding Baseten as an upstream dependency. They have good uptime, but it's another moving piece.
  • Model selection still matters. A bad-fit model on the fastest infrastructure is still bad. Test with Preview and real calls before committing to a model.

For most teams this trade is worth it. For very low-volume agents, sticking with shared OpenAI / Groq is cheaper.

Choosing among the Baseten models

A practical decision tree:

If you want…Pick
The default safe choice for production voiceDeepSeek-V3.1 — best balance of quality, latency, and cost.
The fastest possible TTFT with good-enough qualityLlama-4-Maverick — fewest active parameters, smallest TTFT.
Best output quality, willing to spend more on latencyQwen3-235B-A22B — highest reasoning capability of the three.
A long-context agent (analysing long call histories, large knowledge bases inline)Llama-4-Maverick — 1M-token context window.

Run real calls through both your top choices via Preview and pick on actual conversation quality, not benchmarks. Voice exposes prompt-following weaknesses that text benchmarks hide.

Bringing your own model

Two paths exist today, depending on your needs.

If you have a fine-tuned model — Llama / Mistral / Qwen / your own — the cleanest path is to deploy it on Baseten and have Bolti point at it. You get:

  • Your weights, your model
  • Bolti's existing Baseten integration with no additional code on our side
  • The same latency / scale-to-zero / observability story as the curated models above

This is a short engagement-required step today (we add your model id to your workspace's allowed model list). Reach out via hello@bolti.co.in with:

  • The Baseten model ID you want to use
  • A workspace ID
  • Whether you need it scoped to specific agents or available globally in your workspace

2. Self-hosted endpoint (on-prem / strict data control)

For organizations who can't use a third-party hosting provider at all — strict data residency, on-prem mandates, government workloads — Bolti can be configured to point its OpenAI-compatible LLM plugin at any endpoint that speaks the OpenAI Chat Completions protocol:

  • vLLM with OpenAI-compatible serving (vllm serve --served-model-name your-model)
  • TGI (Text Generation Inference) from HuggingFace
  • llama.cpp server for smaller models on CPU/Apple Silicon
  • Any commercial OpenAI-compatible API — Together, Fireworks, OpenRouter, Anthropic-via-proxy, etc.

This is part of an On-Prem Deployment engagement — we wire your endpoint URL into the agent runtime configuration, so the OpenAI provider in the LLM dropdown actually points at your endpoint instead of api.openai.com.

Per-agent BYO endpoints (where each agent in the dashboard could point at a different URL) are on the roadmap but not shipped — today the runtime endpoint is set at the deployment level, not per agent.

Latency expectations

Approximate time-to-first-token (TTFT) numbers we've measured on a warm Baseten deployment, ordered by typical voice quality:

ModelTTFT (warm)Notes
gpt-4o-mini (OpenAI)250–400msBolti default for the wizard. Reliable.
llama-3.1-8b-instant (Groq)80–150msFastest, but smaller model — quality ceiling.
Llama-4-Maverick (Baseten)150–250msSweet spot for latency at frontier-OSS quality.
DeepSeek-V3.1 (Baseten)200–350msSlower TTFT but stronger reasoning.
Qwen3-235B-A22B (Baseten)300–500msHighest quality, costs you ~150ms in the user's experience.
gpt-5.1 (OpenAI)400–800msCapable but you'll feel the silence on every turn.

Cold starts add up to a few seconds on Baseten-hosted models when traffic is bursty. For 24/7 production agents, set a minimum-replica count via your Baseten deployment so the first call of the day isn't the slowest.

When not to use a custom LLM

To save you a wrong turn — stick with the default OpenAI/Gemini/Groq path if:

  • You're still in the wizard / first agent. Get the agent working before optimizing the model.
  • Traffic is genuinely low (a few calls a day). The cost / ops overhead isn't worth it.
  • Your bottleneck is prompt engineering, not model capability. A better system prompt almost always beats a bigger model.
  • You haven't measured the latency yet. Measure first, then optimize.

To deploy a fine-tuned model or wire up a self-hosted endpoint, reach out via hello@bolti.co.in.