Custom LLMs

The default Bolti experience is to pick an LLM from a curated list of major providers — OpenAI, Gemini, Groq. That covers most use cases. But for some agents you'll want to go further:

Cost — voice agents talk a lot, and OSS models on dedicated infrastructure are often 5–10× cheaper at scale than GPT-4o-class models.
Latency — voice is unforgiving. The right OSS model on the right GPU can hit sub-150ms time-to-first-token, which translates to perceptibly faster agents.
Data control — keep prompts off the major hyperscalers' general APIs.
Fine-tunes — you have a model trained on your domain (your call transcripts, your product, your tone) and want it in production.
Capability — large open-weights models (Llama 4 Maverick, Qwen3-235B, DeepSeek V3.1) are now competitive with frontier closed models on a lot of conversational tasks.

This page covers how Bolti supports custom / open-source LLMs.

What's available today

Bolti has Baseten integrated as a first-class LLM provider. When you pick Baseten in the LLM tab of agent settings, you get a curated list of production-ready OSS models running on Baseten's GPU infrastructure:

Model	Best for	Approximate scale
DeepSeek-V3.1	Strong reasoning, JSON / tool-call quality, English + multilingual. Often the best balance of quality and cost for voice.	671B MoE (37B active)
Llama-4-Maverick-17B-128E-Instruct	Lower-latency conversational agent with very long context (1M tokens). Good at instruction following.	17B active / 400B total MoE
Qwen3-235B-A22B	Highest-quality open model in this list. Use when capability matters more than latency.	235B MoE (22B active)

These show up in the LLM dropdown alongside OpenAI, Gemini, Groq, and DeepSeek — pick one and the agent uses it. No additional setup required: API access is managed by Bolti.

See Agent Setup → LLM for the full configuration surface.

Why Baseten

A short answer to the obvious question — why this provider, not Together, Fireworks, vLLM-on-RunPod, etc.?

Latency optimization is the product. Baseten's pitch is sub-second cold starts and aggressive inference optimization (speculative decoding, FP8 weights, custom batching). For voice — where every 100ms of LLM TTFT shows up as audible silence in the call — this matters more than for chat.
OpenAI-compatible API. Drops cleanly into Bolti's existing LLM plugin system. The same code path that handles OpenAI handles Baseten.
Scale to zero. Agents that only run during business hours, or sub-accounts with very bursty traffic, don't pay for idle GPUs.
Multi-region GPU pools. Baseten runs in multiple regions, which helps when data residency matters.
Production-grade ops. SOC 2 Type II, active monitoring, autoscaling that actually works under load. Not something you'd want to build yourself for the first deployment.

Caveats worth knowing:

Dedicated GPU pricing. You pay for GPU time, not per-token. For very low-traffic agents (a few calls per day) this is more expensive than calling a shared inference API. The break-even is roughly when you have steady traffic — once a model is "warm," dedicated inference wins on both cost and latency.
Vendor concentration. You're adding Baseten as an upstream dependency. They have good uptime, but it's another moving piece.
Model selection still matters. A bad-fit model on the fastest infrastructure is still bad. Test with Preview and real calls before committing to a model.

For most teams this trade is worth it. For very low-volume agents, sticking with shared OpenAI / Groq is cheaper.

Choosing among the Baseten models

A practical decision tree:

If you want…	Pick
The default safe choice for production voice	DeepSeek-V3.1 — best balance of quality, latency, and cost.
The fastest possible TTFT with good-enough quality	Llama-4-Maverick — fewest active parameters, smallest TTFT.
Best output quality, willing to spend more on latency	Qwen3-235B-A22B — highest reasoning capability of the three.
A long-context agent (analysing long call histories, large knowledge bases inline)	Llama-4-Maverick — 1M-token context window.

Run real calls through both your top choices via Preview and pick on actual conversation quality, not benchmarks. Voice exposes prompt-following weaknesses that text benchmarks hide.

Bringing your own model

Two paths exist today, depending on your needs.

1. Your own deployment on Baseten (recommended for fine-tunes)

If you have a fine-tuned model — Llama / Mistral / Qwen / your own — the cleanest path is to deploy it on Baseten and have Bolti point at it. You get:

Your weights, your model
Bolti's existing Baseten integration with no additional code on our side
The same latency / scale-to-zero / observability story as the curated models above

This is a short engagement-required step today (we add your model id to your workspace's allowed model list). Reach out via hello@bolti.co.in with:

The Baseten model ID you want to use
A workspace ID
Whether you need it scoped to specific agents or available globally in your workspace

2. Self-hosted endpoint (on-prem / strict data control)

For organizations who can't use a third-party hosting provider at all — strict data residency, on-prem mandates, government workloads — Bolti can be configured to point its OpenAI-compatible LLM plugin at any endpoint that speaks the OpenAI Chat Completions protocol:

vLLM with OpenAI-compatible serving (vllm serve --served-model-name your-model)
TGI (Text Generation Inference) from HuggingFace
llama.cpp server for smaller models on CPU/Apple Silicon
Any commercial OpenAI-compatible API — Together, Fireworks, OpenRouter, Anthropic-via-proxy, etc.

This is part of an On-Prem Deployment engagement — we wire your endpoint URL into the agent runtime configuration, so the OpenAI provider in the LLM dropdown actually points at your endpoint instead of api.openai.com.

Per-agent BYO endpoints (where each agent in the dashboard could point at a different URL) are on the roadmap but not shipped — today the runtime endpoint is set at the deployment level, not per agent.

Latency expectations

Approximate time-to-first-token (TTFT) numbers we've measured on a warm Baseten deployment, ordered by typical voice quality:

Model	TTFT (warm)	Notes
`gpt-4o-mini` (OpenAI)	250–400ms	Bolti default for the wizard. Reliable.
`llama-3.1-8b-instant` (Groq)	80–150ms	Fastest, but smaller model — quality ceiling.
`Llama-4-Maverick` (Baseten)	150–250ms	Sweet spot for latency at frontier-OSS quality.
`DeepSeek-V3.1` (Baseten)	200–350ms	Slower TTFT but stronger reasoning.
`Qwen3-235B-A22B` (Baseten)	300–500ms	Highest quality, costs you ~150ms in the user's experience.
`gpt-5.1` (OpenAI)	400–800ms	Capable but you'll feel the silence on every turn.

Cold starts add up to a few seconds on Baseten-hosted models when traffic is bursty. For 24/7 production agents, set a minimum-replica count via your Baseten deployment so the first call of the day isn't the slowest.

When not to use a custom LLM

To save you a wrong turn — stick with the default OpenAI/Gemini/Groq path if:

You're still in the wizard / first agent. Get the agent working before optimizing the model.
Traffic is genuinely low (a few calls a day). The cost / ops overhead isn't worth it.
Your bottleneck is prompt engineering, not model capability. A better system prompt almost always beats a bigger model.
You haven't measured the latency yet. Measure first, then optimize.

Agent Setup → LLM — picking and configuring the LLM in the dashboard
Understanding Providers — overview of all provider categories
Call Latencies — where latency comes from in a Bolti call
On-Prem Deployment — running Bolti in your own infrastructure with self-hosted LLMs
PII Data Protection — controlling what reaches your LLM provider

To deploy a fine-tuned model or wire up a self-hosted endpoint, reach out via hello@bolti.co.in.

What's available today​

Why Baseten​

Choosing among the Baseten models​

Bringing your own model​

1. Your own deployment on Baseten (recommended for fine-tunes)​

2. Self-hosted endpoint (on-prem / strict data control)​

Latency expectations​

When not to use a custom LLM​

Related​