Every AI voice agent that makes or receives phone calls does it through a SIP trunk. Whether you are building on Vapi, Bland.ai, Retell, or a custom stack, the PSTN connectivity layer is SIP. Here is how to architect it correctly and fix the most common failures.
SIPSymposium is an independent platform not affiliated with or endorsed by any product or company mentioned in this guide.
An AI voice agent that answers or makes phone calls sits in the media path of a SIP call. The architecture has three layers:
The SIP trunk connects layer 1 to layer 2. The AI platform handles the connection between layer 2 and 3. Your choice of SIP trunk provider, codec, and media architecture directly impacts both call quality and AI response latency.
| Platform | SIP connectivity model | BYOC support |
|---|---|---|
| Vapi | Twilio or Vonage built-in, or BYOC SIP trunk | Yes — SIP URI termination |
| Bland.ai | Built-in telephony + BYOC | Yes — custom SIP endpoint |
| Retell AI | Built-in Twilio telephony + BYOC | Yes — custom SIP trunk |
| ElevenLabs Conversational | Twilio integration or SDK | Via Twilio BYOC |
| Custom stack | Any SIP trunk + FreeSWITCH/Asterisk | Full control |
BYOC (Bring Your Own Carrier) on AI platforms means you connect your own SIP trunk to the AI platform instead of using their bundled telephony. Benefits: lower per-minute costs, use existing carrier relationships, custom number inventory, better geographic coverage.
AI voice platforms have specific SIP trunk requirements that differ from traditional PBX deployments:
Twilio Elastic SIP Trunking, Bandwidth, and Telnyx are widely used with AI voice platforms. Bandwidth and Telnyx have lower per-minute rates than Twilio and offer competitive SIP trunking for high-volume AI deployments. VoIP.ms and Voip.ms work well for testing and lower volume.
Perceived conversational latency in AI voice is the time from when the caller stops speaking to when they hear the AI start responding. Target under 1.5 seconds for natural conversation. The SIP/RTP layer contributes to this budget:
| Component | Latency contribution | Optimization |
|---|---|---|
| RTP network (carrier to AI) | 10-50ms | Colocate AI with carrier PoP |
| Audio buffering / ptime | 20-40ms | Use 20ms ptime, avoid buffering |
| Codec transcoding | 0-30ms | Use G.711 natively, no transcoding |
| STT end-of-utterance detection | 100-300ms | Aggressive VAD, streaming STT |
| LLM first token | 200-800ms | Smaller models, streaming output |
| TTS first audio chunk | 50-200ms | Streaming TTS, sentence-level |
The SIP/RTP layer (first three rows) should contribute under 100ms total. The AI processing layer dominates the latency budget. Optimizing the SIP layer: place your media server in the same data center region as your AI inference, use G.711 to eliminate transcoding, and minimize buffering.
AI voice deployments need monitoring at both the SIP/RTP layer and the AI layer:
AI voice agents connect to phone networks (PSTN) via SIP trunks from carriers like Twilio, Bandwidth, or Telnyx. The carrier routes calls to the AI platform via SIP INVITE. The AI platform terminates the SIP call, receives RTP audio, and feeds it through an STT-LLM-TTS pipeline. The synthesized audio is sent back via RTP to the carrier and ultimately to the caller.
For AI voice agents, use G.711 codec to avoid transcoding overhead, choose a carrier with media servers geographically close to your AI inference infrastructure, and select a provider with elastic concurrent call capacity. Twilio Elastic SIP Trunking, Bandwidth, and Telnyx are popular choices. For BYOC on platforms like Vapi or Retell, verify the platform supports your carrier format for the INVITE Request-URI.
AI voice calls drop after 30 seconds when RTP keepalives are not configured. During AI processing gaps (silence while the AI is generating a response), no audio is sent and the carrier or intermediate device times out the RTP stream. Enable RTP keepalives or comfort noise generation to send continuous low-level audio during silence. Set RTP inactivity timeout to at least 60 seconds on your PBX or media server.
Capture RTP from your AI media server and upload to SIPSymposium. The analyzer measures packet loss, jitter, codec negotiation, and RTP timing issues that affect AI voice agent performance.