AI & VoIP Guide

AI Voice Agent SIP Integration

9 min read · Updated April 2026

AI voice agents — conversational AI systems that handle phone calls — connect to traditional telephony through SIP. The integration patterns are well-established but the failure modes are unique. Here's how to architect it correctly and troubleshoot the common issues.

In this guide

01AI voice agent architecture 02Connecting to SIP infrastructure 03Media handling and codecs 04Latency optimization 05Common integration issues 06Testing and monitoring

1. AI voice agent architecture

An AI voice agent sits in the media path of a SIP call, processing audio in real time. The typical architecture:

SIP trunk or PBX — handles call routing and PSTN connectivity
Media server or SBC — terminates RTP, handles codec transcoding, sends audio to the AI layer
STT (Speech-to-Text) — converts incoming audio to text in real time
LLM inference — generates response based on conversation context
TTS (Text-to-Speech) — converts LLM output to audio
Audio playback — audio sent back via RTP to the caller

End-to-end latency budget: STT ~100-300ms + LLM inference ~200-500ms + TTS ~100-200ms = 400-1000ms total. Callers tolerate up to ~1.5 seconds before perceiving the delay as unnatural. Optimization is critical.

Two main integration patterns

Pattern A — SIP endpoint: The AI agent registers as a SIP endpoint (softphone) to your PBX. Calls are routed to the extension like any other phone. Simplest approach, works with any SIP PBX. Limited to one call per registration unless using multiple accounts.

Pattern B — SIP trunk / B2BUA: The AI platform acts as a SIP trunk or B2BUA, receiving calls directly from your PBX or SBC. More scalable — handles multiple concurrent calls. Required for high-volume deployments.

2. Connecting to SIP infrastructure

Connecting via SIP trunk to Asterisk

; pjsip.conf — trunk to AI voice platform
[ai-platform-auth]
type=auth
auth_type=userpass
username=ai-agent
password=strongpassword

[ai-platform-aor]
type=aor
contact=sip:ai.platform.example.com:5060

[ai-platform]
type=endpoint
transport=transport-udp
context=ai-calls
disallow=all
allow=ulaw      ; G.711 for lowest latency
allow=alaw
dtmf_mode=rfc4733
direct_media=no ; Force media through Asterisk for recording/monitoring

Routing calls to AI agent in dialplan

; Route inbound calls to AI platform
[ai-calls]
exten => s,1,Answer()
same => n,Set(CHANNEL(audioreadformat)=ulaw)
same => n,Set(CHANNEL(audiowriteformat)=ulaw)
same => n,Dial(PJSIP/ai-platform,60,U(sub-ai-monitoring))
same => n,Hangup()

; Route specific DIDs to AI
[from-trunk]
exten => _+1NXXNXXXXXX,1,Goto(ai-calls,s,1)

WebRTC to AI integration

For web-based AI agents, use WebRTC directly in the browser. The AI platform acts as a WebRTC peer, receiving DTLS-SRTP media. The SIP layer is only needed when bridging to PSTN calls.

3. Media handling and codec selection

Use G.711 (PCMU/PCMA) for AI voice agents — not G.729 or Opus. For why these specific codecs differ in this regard, see VoIP codec comparison; for the latency cost of any codec change in the path, see transcoding in VoIP. Reasons:

G.711 is uncompressed — no encoding/decoding delay adds to the latency budget
STT systems are typically trained on 8kHz audio — G.711 matches perfectly
No transcoding needed if your PBX and trunk both support G.711
Simpler audio pipeline — fewer failure points

For higher quality AI (better STT accuracy), use G.722 (16kHz wideband) if your SIP trunk supports it. The wider frequency range improves transcription accuracy by capturing more consonant sounds.

DTMF handling

AI agents frequently need to detect DTMF for IVR-style interactions (press 1 for sales, etc.). Use RFC 2833 (payload type 101) — never in-band DTMF. The AI audio pipeline may process or filter audio in ways that distort DTMF tones. RFC 2833 is signaling, not audio, and passes through cleanly.

4. Latency optimization

Perceived latency in AI voice agents comes from four sources:

Component	Typical Latency	Optimization
STT (end-of-utterance detection)	100-500ms	Streaming STT, aggressive VAD
LLM inference (first token)	200-800ms	Smaller models, streaming output
TTS (first audio chunk)	50-200ms	Streaming TTS, sentence-level chunking
Network (RTP)	10-50ms	Colocate AI with SIP infrastructure

Key optimizations: Use streaming STT that detects end-of-utterance with VAD rather than waiting for silence. Use streaming LLM output — start TTS on the first sentence, not after the full response. Colocate your AI inference with your SIP infrastructure to minimize RTP round-trip.

5. Common integration issues

Issue 01

Echo on AI agent calls

The AI is hearing its own TTS output through the caller's microphone and feeding it back into STT, creating a loop. Fix: implement acoustic echo cancellation (AEC) before STT, or mute the STT input while TTS is playing. Most AI voice SDKs have barge-in handling for this.

Issue 02

STT not transcribing correctly

Audio quality too low or codec mismatch. Verify the audio reaching the STT engine is 8kHz (G.711) or 16kHz (G.722) PCM. Check for transcoding in the path that degrades quality. Background noise on caller end reduces STT accuracy — implement noise suppression before STT if possible.

Issue 03

Calls drop after AI response starts playing

The AI platform sends RTP with incorrect timestamps or sequence numbers, triggering the PBX RTP inactivity timeout. Ensure RTP from the AI platform uses monotonically increasing timestamps at the correct sample rate. Enable RTP keepalives on the PBX side during STT processing gaps.

Issue 04

High latency — callers complain about unnatural pauses

End-to-end latency exceeds 1.5 seconds. Profile each component: STT, LLM first-token time, TTS first chunk time. Implement streaming at each stage. Consider using a faster/smaller LLM for initial response with a larger model for follow-up if needed.

6. Testing and monitoring

Load testing: AI voice agents have unique concurrency characteristics — LLM inference doesn't parallelize like traditional telephony. Test with realistic concurrent call loads to find the breaking point. Monitor GPU utilization and LLM queue depth, not just CPU and network.

Call quality monitoring: Capture RTCP reports from AI agent calls to measure actual MOS scores. STT accuracy degradation often tracks with audio quality metrics — a drop in MOS correlates with reduced transcription accuracy.

SIP trace analysis: AI voice agent calls follow standard SIP dialogs. When calls fail or audio breaks down, the root cause appears in the SIP signaling and RTP streams — same as any VoIP call. Capture and analyze traces the same way you would for traditional telephony.

Frequently asked questions

How do AI voice agents connect to SIP?

AI voice agents connect to SIP infrastructure either as SIP endpoints (registering to a PBX like a softphone) or as SIP trunks/B2BUAs receiving calls directly. The agent receives RTP audio, processes it through STT, LLM inference, and TTS, then sends synthesized audio back via RTP. G.711 is the recommended codec for lowest latency.

What codec should I use for AI voice agents over SIP?

Use G.711 (PCMU/PCMA) for AI voice agent SIP integration. G.711 is uncompressed so it adds no encoding/decoding delay to your latency budget, STT systems are typically trained on 8kHz audio matching G.711, and no transcoding is needed. For higher STT accuracy, use G.722 (16kHz wideband) if your SIP trunk supports it.

How do I reduce latency in AI voice agent calls?

Reduce AI voice agent latency by using streaming STT with VAD-based end-of-utterance detection, starting TTS before the full LLM response is generated (stream at sentence level), colocating AI inference infrastructure with your SIP servers, and using G.711 to eliminate codec transcoding delay. Target total latency under 1.5 seconds for natural conversation.

Troubleshooting an AI voice agent SIP integration?

Paste your SIP trace into SIPSymposium. The analyzer checks RTP timing for AI agent calls, identifies codec mismatches, detects call drops from RTP timeout, and verifies DTMF signaling.

Analyze my trace Create free account

Related guides