AI voice agents — conversational AI systems that handle phone calls — connect to traditional telephony through SIP. The integration patterns are well-established but the failure modes are unique. Here's how to architect it correctly and troubleshoot the common issues.
An AI voice agent sits in the media path of a SIP call, processing audio in real time. The typical architecture:
End-to-end latency budget: STT ~100-300ms + LLM inference ~200-500ms + TTS ~100-200ms = 400-1000ms total. Callers tolerate up to ~1.5 seconds before perceiving the delay as unnatural. Optimization is critical.
Pattern A — SIP endpoint: The AI agent registers as a SIP endpoint (softphone) to your PBX. Calls are routed to the extension like any other phone. Simplest approach, works with any SIP PBX. Limited to one call per registration unless using multiple accounts.
Pattern B — SIP trunk / B2BUA: The AI platform acts as a SIP trunk or B2BUA, receiving calls directly from your PBX or SBC. More scalable — handles multiple concurrent calls. Required for high-volume deployments.
For web-based AI agents, use WebRTC directly in the browser. The AI platform acts as a WebRTC peer, receiving DTLS-SRTP media. The SIP layer is only needed when bridging to PSTN calls.
Use G.711 (PCMU/PCMA) for AI voice agents — not G.729 or Opus. For why these specific codecs differ in this regard, see VoIP codec comparison; for the latency cost of any codec change in the path, see transcoding in VoIP. Reasons:
For higher quality AI (better STT accuracy), use G.722 (16kHz wideband) if your SIP trunk supports it. The wider frequency range improves transcription accuracy by capturing more consonant sounds.
AI agents frequently need to detect DTMF for IVR-style interactions (press 1 for sales, etc.). Use RFC 2833 (payload type 101) — never in-band DTMF. The AI audio pipeline may process or filter audio in ways that distort DTMF tones. RFC 2833 is signaling, not audio, and passes through cleanly.
Perceived latency in AI voice agents comes from four sources:
| Component | Typical Latency | Optimization |
|---|---|---|
| STT (end-of-utterance detection) | 100-500ms | Streaming STT, aggressive VAD |
| LLM inference (first token) | 200-800ms | Smaller models, streaming output |
| TTS (first audio chunk) | 50-200ms | Streaming TTS, sentence-level chunking |
| Network (RTP) | 10-50ms | Colocate AI with SIP infrastructure |
Key optimizations: Use streaming STT that detects end-of-utterance with VAD rather than waiting for silence. Use streaming LLM output — start TTS on the first sentence, not after the full response. Colocate your AI inference with your SIP infrastructure to minimize RTP round-trip.
Load testing: AI voice agents have unique concurrency characteristics — LLM inference doesn't parallelize like traditional telephony. Test with realistic concurrent call loads to find the breaking point. Monitor GPU utilization and LLM queue depth, not just CPU and network.
Call quality monitoring: Capture RTCP reports from AI agent calls to measure actual MOS scores. STT accuracy degradation often tracks with audio quality metrics — a drop in MOS correlates with reduced transcription accuracy.
SIP trace analysis: AI voice agent calls follow standard SIP dialogs. When calls fail or audio breaks down, the root cause appears in the SIP signaling and RTP streams — same as any VoIP call. Capture and analyze traces the same way you would for traditional telephony.
AI voice agents connect to SIP infrastructure either as SIP endpoints (registering to a PBX like a softphone) or as SIP trunks/B2BUAs receiving calls directly. The agent receives RTP audio, processes it through STT, LLM inference, and TTS, then sends synthesized audio back via RTP. G.711 is the recommended codec for lowest latency.
Use G.711 (PCMU/PCMA) for AI voice agent SIP integration. G.711 is uncompressed so it adds no encoding/decoding delay to your latency budget, STT systems are typically trained on 8kHz audio matching G.711, and no transcoding is needed. For higher STT accuracy, use G.722 (16kHz wideband) if your SIP trunk supports it.
Reduce AI voice agent latency by using streaming STT with VAD-based end-of-utterance detection, starting TTS before the full LLM response is generated (stream at sentence level), colocating AI inference infrastructure with your SIP servers, and using G.711 to eliminate codec transcoding delay. Target total latency under 1.5 seconds for natural conversation.
Paste your SIP trace into SIPSymposium. The analyzer checks RTP timing for AI agent calls, identifies codec mismatches, detects call drops from RTP timeout, and verifies DTMF signaling.