When we started building Enhanced Rentals, we assumed the chat bot was the hard part. A guest sends a message on Airbnb, a webhook arrives, you call OpenAI, you send a reply. Straightforward enough.
The voice hotline was a different problem entirely. A guest calls a real phone number at 2am. Within two seconds they need to hear a voice, feel like somebody is listening, and get an answer to “what's the door code?” — using data that lives in a Property Management System on the other side of an API — before they give up and leave a one-star review.
This is a full engineering walkthrough of how we built that system. Not a sanitised tutorial, but what we actually shipped: the architecture decisions, the latency problems, and the moments we had to throw out our first approach and start over.
What you need to build this
Before picking any specific technology, it helps to understand the distinct problems you need to solve. A production AI voice hotline has six independent concerns — each with its own failure modes and provider tradeoffs.
Phone number provisioning and PSTN termination
The problem: You need a real phone number that guests can call from any phone, anywhere. This means a carrier relationship — someone who holds numbers in each country, handles regulatory compliance per market, and provides a SIP trunk to bridge the call into your infrastructure.
Our choice: Twilio handles number purchasing, porting, regulatory bundles per country, and SIP. The tradeoff is cost — Twilio is not the cheapest carrier. The benefit is a single API for PL, AE, US, DE without separate carrier agreements in each market.
Real-time voice session management
The problem: Once the call is bridged into your system, you need a layer that handles WebRTC/SIP participants, audio routing between the guest and your AI agent, voice activity detection (knowing when someone has stopped speaking), and the ability to add a human to an ongoing call without dropping it.
Our choice: LiveKit is an open-source WebRTC server built for exactly this. It handles multi-participant rooms, VAD, agent dispatch, and browser-based joining. The alternative (building on raw WebRTC) would have taken months and given us less reliability.
Speech-to-text (STT)
The problem: The agent needs to convert guest audio to text in real time, with low latency and good accuracy across accents, languages, and background noise (guests often call from noisy environments — airports, lobbies, streets).
Our choice: LiveKit's STT pipeline integrates with multiple providers. Configuration per tenant — language hints, vocabulary biases — was critical for non-English deployments, particularly Polish, where standard models performed poorly on proper nouns and addresses.
Language model inference
The problem: You need a model that can reason about property-specific context, handle multi-turn conversation state, classify intent reliably, and do it all fast enough that the guest isn't left waiting. The same model is used for both voice (latency-critical) and chat (quality-critical).
Our choice: OpenAI's API. Consistent output format for structured classification, good instruction following for template-based context injection, and widely supported by tooling. For voice we use a smaller, faster model; for chat we can afford slightly more latency in exchange for quality.
Text-to-speech (TTS)
The problem: AI-generated text needs to be converted to audio and streamed back to the guest in real time. The voice needs to sound natural, support streaming (not wait for a full sentence before starting playback), and handle the operator's language correctly.
Our choice: We evaluated several providers. Key requirements: streaming support, Polish language quality, latency under 200ms to first audio byte. Most providers failed on at least one of these. We also added a normalisation preprocessing step for diacritics and proper nouns before synthesis.
Property Management System (PMS) integration
The problem: The AI is only useful if it knows who the guest is, what property they're staying at, when check-in is, and what the door code is. This data lives in the PMS — not in your system. Every answer the AI gives about a specific guest's stay depends on a live query to the PMS.
Our choice: Hostaway. It's the leading PMS for short-term rental operators who use channel managers (Airbnb, Booking.com, Vrbo). It exposes a REST API for listings and reservations, and fires webhooks for inbound guest messages and reservation events — exactly what the pipeline needs.
Data synchronisation layer — PMS mirroring with background workers
The problem: You cannot call Hostaway directly on every AI inference. A single guest message triggers STT, intent classification, context assembly, and LLM inference — all within a latency budget of under two seconds. A live Hostaway API call adds 300–800ms and introduces an external failure point. If Hostaway is slow or rate-limiting, your AI goes silent.
Our choice: We mirror all PMS data into our own database: listings, reservations, custom fields, conversation history. Background workers run continuously via a message queue — syncing new reservations, processing webhook events, updating listing data on a schedule. Real-time webhooks from Hostaway are enqueued immediately and processed async. The AI always reads from local data, never from an external API mid-call.
Observability — making a distributed pipeline debuggable
The problem: A voice call touches six services before the guest hears a response: Twilio, your webhook handler, the PMS sync layer, bot selection, LLM inference, and TTS. When something goes wrong — wrong answer, unexpected silence, missed escalation — you need to know exactly where and why. Without instrumentation, debugging is guesswork.
Our choice: Every step in the pipeline writes structured logs with a shared trace ID that follows the request from the inbound webhook through to the final reply. Transcripts store not just the conversation but the action code classification, which bot was selected, which template fired, and which reservation context was injected — so you can replay the exact state the AI saw when it gave a specific answer. Every background worker run records status, duration, and error payload. A failed sync is immediately visible, not silently dropped.
On top of these eight components you'll need an operator configuration layer, a multi-tenant data model, an escalation pipeline, and an admin interface. The rest of this article covers how those fit together.
The stack
| Layer | Technology | Role |
|---|---|---|
| Phone | Twilio | Number provisioning, SIP termination (PL, AE, US, DE) |
| Voice infra | LiveKit | Real-time call sessions, agent dispatch, WebRTC |
| LLM | OpenAI | Chat reasoning and voice intent classification |
| PMS | Hostaway | Reservations, listings, door codes, messaging webhooks |
| Backend | NestJS (microservices) | API, automation, and background worker services |
| Voice agent | Node.js service | LiveKit agent, STT/TTS pipeline, transcript storage |
| Database | PostgreSQL | Multi-tenant, one schema per operator |
1Getting a phone number to ring something
The entry point is a Twilio phone number. When a guest calls, Twilio executes a TwiML webhook — an XML response that tells Twilio what to do with the call.
For a simple IVR you'd return a <Say> verb and be done with it. For a live AI agent, you return a <Connect> verb pointing at a LiveKit SIP endpoint. Twilio bridges the PSTN call into a LiveKit room.
<!-- TwiML response for an AI agent call -->
<Response>
<Connect>
<Stream url="wss://your-livekit-sip.example.com/sip" />
</Connect>
</Response>The TwiML webhook is served by our automation service. When the request arrives, we know the called Twilio number — and from that we look up which hotline and tenant it belongs to, which tells us which AI agent configuration to load.
2LiveKit — rooms, agents, and dispatch
LiveKit is the real-time layer. Every inbound call creates a LiveKit room. The Twilio SIP bridge joins as a participant. Our voice agent service joins the same room as another participant and starts talking.
Agent dispatch works via LiveKit's agent dispatch API. When a room is created, LiveKit fires a webhook to our voice agent service, which instantiates a session and connects it to the room. The agent session:
- Listens to the guest audio stream via LiveKit's STT pipeline
- Maintains conversation state — transcript, context, and reservation data
- Generates responses via OpenAI
- Synthesises audio via a TTS provider and streams it back into the room
Voice call flow
Guest calls phone number
Twilio receives PSTN call
TwiML webhook fires to automation service
LiveKit room created
Twilio joins as participant
Caller → Hostaway
Reservation + listing context loaded
Voice agent joins room
Session initialised with reservation context
Guest speaks → STT pipeline
Speech-to-text, per-turn transcript stored
OpenAI inference
Routing bot classifies intent → action code
TTS synthesis
Audio streamed back to guest
Slack alert fires
Agent goes silent
Human joins via browser
Full transcript visible · no phone needed
The latency problem
The single hardest engineering problem in voice AI is end-to-end latency. The pipeline is:
Guest finishes speaking
→ STT (speech-to-text) ~300–600ms
→ OpenAI inference ~400–800ms
→ TTS (text-to-speech) ~200–400ms
→ Audio plays to guest
─────────────────────────────────────
Total per turn: ~1.0–1.8sTwo seconds of silence after you finish speaking does not feel like an intelligent AI. It feels broken. Our early prototype had exactly this problem.
What actually helped:
- Streaming TTS — begin synthesising and playing audio before the full LLM response is complete. As soon as the first sentence token arrives, start generating audio for it.
- Interrupt handling — if the guest speaks while the agent is talking, stop immediately. LiveKit's VAD (voice activity detection) handles the detection; we wired it to cancel the outgoing audio stream.
- Short voice prompts — voice inference is not the place for 2,000-token context dumps. We trimmed to under 400 tokens per session: just the reservation data, property rules, and escalation instruction.
- Smaller models for voice — inference latency scales with model size. For structured Q&A about a property, a smaller, faster model gives acceptable quality at meaningfully lower latency than a flagship model.
3Injecting PMS context into every call
An AI that answers “what's the door code?” with “I don't have access to that information” is useless. The whole point is that it has the information — because it's connected to Hostaway.
When a call comes in, we know the called Twilio number, which tells us the hotline and tenant. But we don't automatically know which guest is calling. Our lookup chain:
- Look up the caller's number against active reservations in Hostaway.
- If matched, load the full reservation and listing data from the tenant's PMS.
- If no match, the agent introduces itself and asks for the guest's name to find their booking.
The reservation context injected into every agent session:
{
guest_name: "Jan Kowalski",
arrival_date: "2026-04-14",
departure_date: "2026-04-17",
listing_name: "Apartament Mokotów",
checkInTimeStart: "15:00",
checkOutTime: "11:00",
door_code: "4821", // time-gated — see canAccess below
wifi_password: "GuestWifi2026",
address: "ul. Puławska 42, Warsaw",
houseRules: "No smoking. Quiet hours after 22:00.",
specialInstruction: "Ring bell 2B on arrival."
}Critically, some data is time-gated. A guest calling three weeks before check-in should not receive the door code. We compute a canAccess boolean based on reservation status and time-to-arrival. The agent prompt conditionally exposes access instructions only when appropriate — the same flag used across both the voice and chat pipelines.
4The bot selection layer
The platform supports multiple operators, multiple properties, and potentially different AI behaviour per property. A Dubai luxury villa answers questions differently from a Warsaw city apartment. We use a two-bot model:
Routing Bot
Classifies the guest message into a structured action code. Returns JSON: { actionCode, score, reasoning }
Informational Bot
Generates a freeform contextual reply when the action requires dynamic reasoning rather than a fixed template.
Example action codes the routing bot can emit: ACCESS_INSTRUCTIONS WIFI_QUESTION PARKING_QUESTION ESCALATE NO_MESSAGE
Bots are matched to reservations using sift — a MongoDB-style query engine evaluated against the live reservation object. An operator can scope a bot to a specific property, language, or reservation type:
// Only handle confirmed reservations at listing 1234
{ "listingMapId": { "$in": [1234] }, "status": "modified" }
// Only engage for guests staying more than 7 nights
{ "numberOfNights": { "$gte": 7 } }5Escalation handling
The AI cannot handle everything. A burst pipe, a safety concern, a payment dispute — these need a human. The system needs to recognise when to stop trying.
When the routing bot emits ESCALATE:
- A Slack notification fires immediately to the operator's escalation channel — guest name, property, message content, and a direct link to the support case.
- The AI stops generating replies.
handling_mode = 'Human'is set on the active case. - The bot stays silent until the case is resolved — even if the guest sends follow-up messages.
For voice calls, we added browser call joining. When an operator clicks Join Call in the admin panel, they are connected directly into the LiveKit room via WebRTC — no phone number needed. The full live transcript is visible on screen.
6Transcripts
Every call produces a transcript, stored per turn:
[
{ speaker: "guest", text: "Hi, what time can I check in?", ts: 1744123400 },
{ speaker: "agent", text: "Check-in at Apartament Mokotów is from 3pm.", ts: 1744123403 },
{ speaker: "guest", text: "And the door code?", ts: 1744123410 },
{ speaker: "agent", text: "I can share that closer to your arrival.", ts: 1744123413 }
]Transcripts serve two purposes: operator review after an escalation, and debugging when the AI gives a wrong answer. Retained for 90 days.
Building reliable low-latency transcription required significant tuning of LiveKit's STT pipeline. Default settings produced real-time word-level transcripts but hallucinated badly on proper nouns — guest names, property names, Polish street addresses. We added vocabulary hints and per-tenant language settings to address this.
7The webhook pipeline for chat
The voice hotline gets the glamour, but chat handles the majority of guest interactions. The architecture has the same PMS context injection problem, handled differently.
When a guest messages on Airbnb, Booking.com, or WhatsApp, Hostaway fires a webhook to our automation service:
POST /webhook/hostaway
{
"event": "message.received",
"data": {
"reservationId": 51923847,
"listingMapId": 1042,
"body": "Hi, what time is check-in tomorrow?"
}
}We don't reply immediately. Guest messages often arrive in bursts. Replying to each one individually looks robotic and wastes API calls. We use a debounce queue keyed by reservation ID:
| Reservation type | Debounce window | Reason |
|---|---|---|
| Inquiry (browsing) | 5s | Responsiveness matters — they're comparing options |
| Confirmed (staying) | 120s | 2-minute delay is acceptable; reduces noise |
After the debounce window, we re-fetch the full conversation thread fresh from Hostaway, run bot selection, build the system prompt with live reservation context, call OpenAI, and send the reply via the Hostaway messaging API.
Chat message flow
Guest sends message
Airbnb · Booking.com · WhatsApp
Hostaway fires webhook
event: message.received
Debounce queue
5s inquiry · 120s confirmed reservation
Re-fetch full conversation
Fresh from Hostaway — catches 'never mind'
Bot selection
Sift conditions matched against reservation
Load reservation + listing context
Live data from Hostaway PMS
OpenAI inference
Routing bot → action code
Reply sent to guest
Via Hostaway messaging API
Slack alert · bot silenced
Human handles via Hostaway
8What adjusting the knowledge base actually looks like
Operators configure bot knowledge via a system prompt template that combines EJS conditionals with Mustache variable injection. A real example from one of our beta operators:
You are {{botName}}, guest assistant for {{listing.name}}.
Check-in: {{listing.checkInTimeStart}}.
Check-out: {{listing.checkOutTime}}.
<% if (canAccess) { %>
Door code: {{listing.door_code}}
WiFi: {{listing.wifi_password}}
<% } else { %>
Access details will be shared 24 hours before arrival.
<% } %>
<% if (listing.petsAllowed) { %>
Pets are welcome. Please clean up after them.
<% } %>
House rules: {{listing.houseRules}}This template is rendered fresh on every message using live data from Hostaway. No deployments, no code changes — the operator edits it in the admin panel.
Early versions used static system prompts. The problem was stale data: a guest asking about check-in two weeks before arrival would get the right answer. A guest asking on arrival day, after a late check-out extension pushed check-in back two hours, would get the wrong one. Dynamic rendering per message fixed this.
Building a no-code editor for the conditionals
The raw template syntax works well for developers. It does not work for a support manager who needs to update house rules on a Saturday afternoon without filing a ticket.
We built a visual instruction editor on top of the template engine. Instead of writing <% if (canAccess) { %>, the operator sees a block labelled “Show only when guest can access the property” with a toggle. Each conditional maps to a named rule in the UI — check-in window, reservation status, pet policy, early check-in upsell accepted — and the editor assembles the underlying template automatically.
Plain text sections (house rules, special instructions, welcome messages) are just free-text fields. The result is a knowledge base that non-technical support staff can update in minutes, while the underlying template engine retains full flexibility for operators who want to go deeper.
Separating the webhook API from the platform API
Early in development, Hostaway webhooks and operator-facing API requests ran through the same service. This created two problems.
First, availability requirements are different. Webhooks from Hostaway must be acknowledged within seconds or Hostaway retries — and eventually stops sending. An operator browsing their settings page has a much higher tolerance for a slow response. Mixing them meant a traffic spike on the operator API could delay webhook processing.
Second, security models are different. Webhook endpoints are authenticated by a shared secret from Hostaway. Operator API endpoints use session tokens from Cognito. Running them through the same auth middleware required awkward carve-outs and made the security model harder to reason about.
The fix was to split them into separate services with separate deployments. The webhook service is stateless, scales horizontally, and has no dependency on the operator session layer. The platform API handles everything operator-facing. They share database access but nothing else.
What we got wrong
Hardcoding bot IDs
Early on, some action code routing referenced specific bots by database ID. One tenant, no problem. The moment we added a second operator, it broke silently — their escalation routing was pointing at the first tenant's bot. Dynamic lookup by action code field, not ID, is the only correct approach in a multi-tenant system.
Not interrupting the agent when a human joins
When an operator clicked Join Call, the AI was often mid-sentence. We had to add interrupt logic that stops the agent the moment a human participant joins the LiveKit room — otherwise the operator joins a call already in progress and can't get a word in.
Overlong voice prompts
Our first voice agent prompt was 1,800 tokens. Fine for chat. For voice, it added ~300ms to every inference call — about 20% of total perceived latency. We stripped it to the essential context (reservation data, property rules, escalation instruction) and kept it under 400 tokens.
Debounce without re-fetch
The original debounce implementation processed the first message received, not the most recent. A guest who sent three messages — with the last being "never mind" — would still get a reply to the first. Re-fetching the full conversation thread after the debounce window is the correct approach.
Polish diacritics in TTS
Polish has characters like ą, ę, ó, ź. Standard TTS models rendered them as gibberish or skipped them entirely. We switched to a TTS provider with explicit Polish language support and added a normalisation step for common proper noun variations before synthesis.
UX for humans — and the limits of AI
Once the AI pipeline was working reliably, we ran into a different class of problem. The engineering was solid. The user experience for the humans using it wasn't.
The agent experience
When a guest call escalates, a support agent gets a Slack notification and needs to act fast. In the original flow, they had to open the admin panel, find the support case, read the transcript, and then figure out how to respond — usually by switching to Hostaway in a separate tab.
That's too many steps under pressure. We redesigned the escalation view around what an agent actually needs in the first ten seconds:
- Guest name, property, and the exact message or call segment that triggered escalation — visible immediately, no scrolling
- Full conversation history on the same screen — no switching tabs to Hostaway
- For voice: a single Join Call button that drops them into the LiveKit room via browser — no dialling, no hold music
- For chat: a reply field that sends directly back through the Hostaway messaging channel
- Once a human replies or joins, the AI is automatically silenced — no accidental double-response from the bot
The design principle was: an agent who has never seen this guest before should be able to take over confidently within 15 seconds of clicking the Slack notification.
The realisation: not everything should be AI-handled
As we processed more real reservations, a pattern emerged that changed how we thought about the product entirely.
Some calls should never reach the AI in the first place.
The clearest example: Airbnb Support calls. When Airbnb's own support team calls an operator — about a dispute, a policy violation, a host guarantee claim — failing to answer carries real consequences. Airbnb tracks response rates and can penalise operators who miss these calls. An AI answering on behalf of the operator is worse than no answer at all.
This class of call needs guaranteed human delivery. Not escalation-after-AI-attempt. Immediate forwarding to a real phone, every time, with no AI in the path.
Critical call forwarding uses caller ID pattern matching. Operators configure rules: calls from numbers matching a known Airbnb support range, or from specific area codes, or at specific times of day, are routed directly to a configured phone number without ever entering the AI pipeline. The AI never picks up. The human phone rings immediately.
How this shaped the product lineup
These realisations — the need for a better agent UX and the hard limits of AI handling — are what drove the shape of our solution set. We didn't start by designing products. We started by building infrastructure, then observed what operators actually needed from it.
AI Hotline
The core pipeline. AI handles the majority of inbound calls using PMS context.
Escalation Handling
When AI cannot resolve a call or message, a human takes over — via browser, with full transcript, without picking up a phone.
Critical Call Forwarding
Calls that must reach a human every time — Airbnb Support, specific numbers, time-sensitive patterns — bypass AI entirely and ring a real phone.
Guest Communication
The same bot pipeline applied to chat channels (Airbnb, Booking.com, WhatsApp) with reservation-aware context and debounce logic.
Each solution grew out of a real operational constraint, not a product roadmap. That grounding — in 25,000 real reservations and two live operators — is why the product works the way it does.
The result
25,000+
reservations processed
<1.5s
end-to-end call latency
8–12%
escalation rate
The system handles calls and messages for STR operators across Poland and the UAE. End-to-end call latency (guest finishes speaking → agent starts responding) is consistently under 1.5 seconds on good network conditions. The escalation rate — messages and calls the AI cannot handle — sits at 8–12% depending on operator and property type. Everything else is resolved without human involvement.
The core insight: the AI itself is not the hard part. Getting the right context to the AI at the right moment — fresh reservation data, correct property details, time-gated access information — is where most of the engineering effort goes. The rest is latency optimisation and making graceful degradation feel intentional rather than broken.
