AI and VoIP Guide

SIP Trunking for AI Voice Agents

9 min read · Updated April 2026

Every AI voice agent that makes or receives phone calls does it through a SIP trunk. Whether you are building on Vapi, Bland.ai, Retell, or a custom stack, the PSTN connectivity layer is SIP. Here is how to architect it correctly and fix the most common failures.

SIPSymposium is an independent platform not affiliated with or endorsed by any product or company mentioned in this guide.

In this guide

01How AI agents connect to PSTN via SIP 02SIP connectivity on AI voice platforms 03SIP trunk requirements for AI agents 04Latency architecture for sub-second response 05Common SIP issues on AI voice platforms 06Monitoring AI voice call quality

1. How AI voice agents connect to PSTN via SIP

An AI voice agent that answers or makes phone calls sits in the media path of a SIP call. The architecture has three layers:

PSTN layer: A SIP trunk from a carrier (Twilio, Bandwidth, Telnyx, VoIP.ms) provides phone numbers and PSTN routing
SIP media layer: A media server (built into the AI platform or your own FreeSWITCH/Asterisk) terminates the SIP call and feeds audio to the AI pipeline
AI processing layer: STT (speech-to-text) converts incoming audio to text, an LLM generates responses, TTS (text-to-speech) converts responses to audio sent back via RTP

The SIP trunk connects layer 1 to layer 2. The AI platform handles the connection between layer 2 and 3. Your choice of SIP trunk provider, codec, and media architecture directly impacts both call quality and AI response latency.

; Typical AI voice agent call flow
Caller dials your number
  -> Carrier (SIP trunk provider)
    -> INVITE to your AI platform SBC/media server
      -> RTP audio to AI media server
        -> Audio frames to STT engine
          -> Transcript to LLM
            -> LLM response to TTS
              -> Audio from TTS back to media server
                -> RTP back to carrier and caller

2. SIP connectivity on major AI voice platforms

Platform	SIP connectivity model	BYOC support
Vapi	Twilio or Vonage built-in, or BYOC SIP trunk	Yes — SIP URI termination
Bland.ai	Built-in telephony + BYOC	Yes — custom SIP endpoint
Retell AI	Built-in Twilio telephony + BYOC	Yes — custom SIP trunk
ElevenLabs Conversational	Twilio integration or SDK	Via Twilio BYOC
Custom stack	Any SIP trunk + FreeSWITCH/Asterisk	Full control

BYOC (Bring Your Own Carrier) on AI platforms means you connect your own SIP trunk to the AI platform instead of using their bundled telephony. Benefits: lower per-minute costs, use existing carrier relationships, custom number inventory, better geographic coverage.

3. SIP trunk requirements for AI voice agents

AI voice platforms have specific SIP trunk requirements that differ from traditional PBX deployments:

G.711 codec strongly preferred — no transcoding overhead in the AI audio pipeline. Opus or G.722 only if the AI platform specifically supports them. See VoIP codec comparison for the trade-offs.
Low latency SIP proxies — choose a carrier with media servers geographically close to your AI inference infrastructure
20ms ptime — standard packet timing. Some AI platforms support 10ms for lower latency but check platform documentation
RFC 2833 DTMF — required for touch-tone IVR interactions. Never in-band with AI pipelines
SRTP support — many AI platforms require encrypted media
High concurrency — AI deployments often need many simultaneous calls. Choose a carrier with elastic capacity

Recommended carriers for AI voice

Twilio Elastic SIP Trunking, Bandwidth, and Telnyx are widely used with AI voice platforms. Bandwidth and Telnyx have lower per-minute rates than Twilio and offer competitive SIP trunking for high-volume AI deployments. VoIP.ms and Voip.ms work well for testing and lower volume.

4. Latency architecture for sub-second AI response

Perceived conversational latency in AI voice is the time from when the caller stops speaking to when they hear the AI start responding. Target under 1.5 seconds for natural conversation. The SIP/RTP layer contributes to this budget:

Component	Latency contribution	Optimization
RTP network (carrier to AI)	10-50ms	Colocate AI with carrier PoP
Audio buffering / ptime	20-40ms	Use 20ms ptime, avoid buffering
Codec transcoding	0-30ms	Use G.711 natively, no transcoding
STT end-of-utterance detection	100-300ms	Aggressive VAD, streaming STT
LLM first token	200-800ms	Smaller models, streaming output
TTS first audio chunk	50-200ms	Streaming TTS, sentence-level

The SIP/RTP layer (first three rows) should contribute under 100ms total. The AI processing layer dominates the latency budget. Optimizing the SIP layer: place your media server in the same data center region as your AI inference, use G.711 to eliminate transcoding, and minimize buffering.

5. Common SIP issues on AI voice platforms

Issue 01

Calls connect but AI does not respond

RTP not reaching the AI media server. Check firewall rules allow UDP on the RTP port range. Verify the SDP c= line contains the correct public IP. Common with BYOC where the AI platform sends RTP to a private IP in the SDP.

Issue 02

AI hears its own voice (echo loop)

The AI TTS audio is being captured by the STT engine, creating a feedback loop. Implement acoustic echo cancellation or mute STT input while TTS is playing. Most AI voice SDKs have barge-in handling for this — check that it is enabled.

Issue 03

Calls drop after 30-60 seconds

RTP inactivity timeout during AI processing gaps (when AI is thinking, no audio is sent). Enable RTP keepalives or comfort noise generation to maintain RTP flow during silence. Set RTP timeout on PBX to at least 60 seconds.

Issue 04

Poor STT accuracy

Audio quality issue in the RTP path. Check for packet loss and jitter between carrier and AI media server — even 1% loss significantly impacts STT accuracy. Verify codec is G.711 and no transcoding is occurring. Check for network congestion on the media path.

6. Monitoring AI voice call quality

AI voice deployments need monitoring at both the SIP/RTP layer and the AI layer:

SIP/RTP layer metrics

RTCP receiver reports — packet loss and jitter per call
SIP 4xx/5xx error rates — trunk health
Call setup time (INVITE to 200 OK) — signaling latency
RTP packet loss rate — media quality

AI layer metrics

STT accuracy rate — word error rate on known test phrases
End-of-utterance detection latency — time from caller stops speaking to STT fires
LLM first token latency — time to first response word
TTS first audio latency — time from text to first audio chunk
Total response latency — end-to-end from caller pause to AI audio start

; Capture AI voice call PCAP for analysis
tcpdump -i eth0 -w /tmp/ai-call.pcap   udp portrange 10000-20000

; Extract RTCP stats
tshark -r ai-call.pcap -Y rtcp   -T fields -e rtcp.ssrc -e rtcp.fraction_lost   -e rtcp.inter_arrival_jitter

Frequently asked questions

How do AI voice agents connect to phone networks?

AI voice agents connect to phone networks (PSTN) via SIP trunks from carriers like Twilio, Bandwidth, or Telnyx. The carrier routes calls to the AI platform via SIP INVITE. The AI platform terminates the SIP call, receives RTP audio, and feeds it through an STT-LLM-TTS pipeline. The synthesized audio is sent back via RTP to the carrier and ultimately to the caller.

What SIP trunk should I use for AI voice agents?

For AI voice agents, use G.711 codec to avoid transcoding overhead, choose a carrier with media servers geographically close to your AI inference infrastructure, and select a provider with elastic concurrent call capacity. Twilio Elastic SIP Trunking, Bandwidth, and Telnyx are popular choices. For BYOC on platforms like Vapi or Retell, verify the platform supports your carrier format for the INVITE Request-URI.

Why do AI voice calls drop after 30 seconds?

AI voice calls drop after 30 seconds when RTP keepalives are not configured. During AI processing gaps (silence while the AI is generating a response), no audio is sent and the carrier or intermediate device times out the RTP stream. Enable RTP keepalives or comfort noise generation to send continuous low-level audio during silence. Set RTP inactivity timeout to at least 60 seconds on your PBX or media server.

Troubleshooting SIP issues in your AI voice deployment?

Capture RTP from your AI media server and upload to SIPSymposium. The analyzer measures packet loss, jitter, codec negotiation, and RTP timing issues that affect AI voice agent performance.

Analyze my trace Create free account

Related guides