Why Navan Should Use LiveKit

A comprehensive proposal covering the LiveKit open-source framework, recommended architecture, and the advantages of LiveKit Cloud for hosting your voice AI infrastructure

Current State

Navan Voice Agent Current Architecture

Twilio SIP Trunk + Azure-hosted Voice Agent with OpenAI LLM

End UserPSTN / Mobile
T
TwilioSIP Trunk
WebSockets
Voice Agent
Azure
Navan AgentAzure Container Apps
WebSockets
OpenAIGPT-4 / Realtime API
1

Inbound Call

User dials in via PSTN, Twilio receives and converts to SIP

2

SIP Trunking

Twilio SIP Trunk bridges to Azure via Websockets

3

Agent Processing

Voice Agent on Azure handles STT, TTS, and conversation flow

4

LLM Response

OpenAI generates intelligent responses, streamed back to user

Disadvantages of This Architecture

1

Limited to OpenAI Realtime API Capabilities

Every new feature request becomes an architecture crisis

Building on an Incomplete and Unstable Foundation

OpenAI's Realtime API only does one thing—everything else requires workarounds

PM wants to add screen sharing to support calls? Can't do it.

Product wants an AI avatar for branding? Impossible.

Customer success needs conversation recording with custom metadata? Not supported.

Sales wants multi-agent handoffs for specialized expertise? Rewrite everything.

You're constantly telling stakeholders 'no' because the API only supports voice-in, voice-out. Each workaround means duct-taping external services together, creating technical debt and maintenance nightmares.

Each Workaround Creates More Problems
  • Separate recording infrastructure (Recall, MeetKay, custom builds)
  • Bolt-on analytics that can't see inside the conversation flow
  • Hacked interruption handling that feels janky to users
  • Zero ability to inject visual context or tools mid-conversation
  • Manual session management code that's fragile and hard to test
2

Building your Own Framework for Voice Agents vs. Leveraging Open Source

No new updates for voice features: better turn detection, better barge-in detection, new model/plugin support

Building Voice AI Without an Open-Source Framework

Every feature you need, you must build and maintain yourself

DIY
STT → LLM → TTS Pipeline
DIY
Turn Detection
DIY
Barge-in Handling
DIY
LLM Orchestration
DIY
Multi-modal (Voice/Video/Text)
DIY
Tool Use & Function Calling
DIY
Multi-agent Handoff
DIY
Provider Integrations
The Result
  • Months of engineering time to build core features
  • No community contributions or shared improvements
  • Every new AI model requires custom integration
  • Security patches are your responsibility