for most of the last year, “real-time ai” was a category in which the demos looked impressive and the production deployments looked rare. the stack you assembled for live use cases (a voice-activity detector, a streaming transcription model, a large language model, a text-to-speech voice) was four moving parts glued together with a websocket and a hopeful latency budget. it worked. it didn’t scale. it didn’t feel native.
that changed quickly in early may. openai shipped three dedicated streaming audio models across the realtime API surface. thinking machines lab, in the same week, published a position paper arguing that real-time interaction should be a property of the model itself rather than a wrapper layer around it. the first event tells you what you can build today. the second tells you where the field is going.
i’ve spent the past couple of weeks building reference implementations on top of the new stack: two distinct patterns, both bidirectional, both for live conversational settings where translation and transcription have to happen inside the conversation, not after it. this is a write-up of what the technology actually does, the architectural patterns that hold up under load, the engineering decisions that aren’t in any docs yet, and the use cases that are now in reach for any team willing to spend two weeks on a serious prototype.
what just shipped on the realtime stack
three sibling models now live across the realtime API surface, reachable through the transports that matter for live systems: webrtc, websocket, and sip.
gpt-realtime-2 is the voice-agent model with reasoning baked in. openai prices it by audio, text, and image tokens, while gpt-realtime-translate is priced by audio duration and gpt-realtime-whisper is designed for low-latency transcript deltas from live audio. (openai realtime conversations, gpt-realtime-2, gpt-realtime-translate, gpt-realtime-whisper, pricing)
a few details turn out to matter more than they look:
- translation and transcription sessions are continuous streams, not request/response. you open the session, you push audio in, you read deltas out. no assistant turn, no response object, no conversation state. that simplifies a lot of failure modes, but it also means there is no clean “the user is done speaking” event; the silence has to tell you.
- webrtc is the recommended transport for browser clients, with ephemeral client secrets minted server-side. the raw api key never crosses into the browser. for server-to-server pipelines (telephony bridges, media workers) websocket remains the right tool.
- one translation session per target language is a hard architectural rule. for a two-person bidirectional flow that means two sessions minimum; for a three-language room it means three. this cascades into cost, lifecycle, and observability decisions you’ll make on day one.
- regulated deployments need a separate privacy and contracting pass. do not treat the realtime transport choice as the compliance answer. check data residency, retention, BAA coverage, audit logging, and whether your workflow requires a qualified human in the loop before you pilot in healthcare or legal settings.
- and the silent gotcha:
gpt-realtime-translatedoes not support custom prompts or glossaries. medication names, legal terms of art, industry shorthand, brand names, product skus, units, and idioms are at the model’s training mercy. the only documented workaround is “test the terminology you care about directly.” plan around this from day zero.
two architectural patterns that hold up under load
the realtime models give you the raw substrate. the patterns that turn that substrate into a working product are not in the docs. two distinct patterns emerged from the prototypes i built; almost any live-conversation application maps cleanly onto one or the other.
pattern 1: single-environment shared input
the room has one microphone, capturing two or more speakers. think a meeting-room speakerphone, a kiosk, an exam-room laptop, a customer-service tablet. you don’t know which person is speaking until after the words come back from the transcription model, and you don’t know what language to translate into until you know who spoke.
flowchart LR
Mic([shared mic]) --> WS[streaming transcription]
Mic --> Buf[pcm ring buffer]
WS --> Deltas[transcript deltas]
Deltas --> Hush{silence hold}
Hush -->|yes| Commit[commit turn]
Commit --> Done[final transcript + item_id]
Done --> Route{translate?}
Route -->|same lang| Store[(persist source)]
Route -->|cross lang| Slice[slice ring buffer]
Slice --> Translate[translation session]
Translate --> Store
Translate --> Audio[translated audio]
Audio --> Swap[swap mic to silent during playback]
the moving parts:
- a continuous transcription session listens to the shared mic.
- a local pcm ring buffer captures the same audio in parallel. this lets you re-extract the exact source audio for a turn without trying to fuzzy-match text back to audio.
- when transcript deltas stop arriving for a configurable silence window (somewhere between 700 and 1200 ms in practice), the client commits the buffer and reads a final transcript with a stable item id.
- a routing layer decides whether translation is needed and in which direction. this is where speaker identity, language detection, and any business logic about who is allowed to be translated to whom belong.
- if translation is needed, a separate session opens, the precise audio slice is streamed in, and translated text and audio come back.
- while translated audio plays, the live mic input is replaced with silence. otherwise the model will transcribe its own output and the room becomes a feedback loop.
the trade-off is that you can’t begin translating until after a turn is transcribed and routed. you pay one round-trip in latency. you gain the ability to never translate in the wrong direction.
pattern 2: track-separated multi-participant
the participants are in different physical or virtual rooms. each one has their own microphone and audio track. think a telehealth visit, a remote sales call, a multilingual standup, a board meeting on zoom or livekit. crucially, you know the speaker identity intrinsically because each track carries it.
sequenceDiagram
participant A as participant a
participant Room as live room
participant B as participant b
participant T as translation session
participant Store as persistence
A->>Room: publish a's mic
B->>Room: publish b's mic
B->>Room: subscribe a's mic
B->>T: webrtc with a's track attached
Note over Room,T: a speaks
Room-->>B: a's audio
B->>T: forward a's audio
T-->>B: translated audio + source/target text
B-->>B: duck a's original on playback
B-->>Store: persist final turn
Note over A,Store: symmetric flow runs in a's client
the loop:
- each participant publishes their own track and subscribes to the others.
- for each remote track, in each listener’s target language, exactly one translation sidecar opens. the remote
MediaStreamTrackis attached directly to the openai peer connection. no mixing, no down-sampling, no copying. - translation runs continuously while the sidecar is open. turn boundaries are inferred from a silence debounce on the transcript deltas, because translation sessions do not emit a clean “turn done” event.
- translated audio plays through a hidden audio sink. the original remote audio is ducked, not muted, while the translation plays. ducking is driven off the audio element’s playback events, not off the api’s data-channel signals, because the data channel races with the actual audio pipeline.
- settings (interpretation on/off, language pairs, ducking level) propagate between clients through the underlying media platform’s data channel.
this is the cleaner pattern of the two. it’s what the openai cookbook recommends, it’s what livekit, daily, and pion are all positioned for, and it’s what virtually every modern collaboration product can adopt today with relatively little plumbing.
five engineering lessons for anyone building on this stack
these are the five recommendations i would give any team starting on the new realtime stack today.
1. one sidecar per listener per source track per target language. the cookbook says this once, in passing. the implication is that every additional listener language costs you another concurrent webrtc session. a meeting with spanish, mandarin, and vietnamese listeners pays three times the per-minute translation cost, not one. design your fan-out, your rate limits, and your billing model around this from the beginning.
2. keep sessions warm. do not open one per utterance. sdp negotiation takes 300 to 800 ms in practice. that is enough to swallow the first one or two words of a sentence (“the chest pain is…”). per-utterance sessions are the obvious-looking optimization that fails listeners every time. open the sidecar when interpretation turns on. tear it down when the conversation ends or the source track disappears. nothing in between.
3. duck audio on browser playback events, not on api signals. the audio element is the source of truth for the listener’s ears. the api’s data channel races with the actual playback pipeline. if you duck the original audio when the api says “translated audio is coming,” you will get a window of silence between the duck and the playback. if you duck when the audio element fires its playing event, you duck exactly when the listener hears something new.
4. mint ephemeral secrets, then validate context server-side at mint time. the token-minting endpoint must enforce that this particular caller has the right to translate audio for this particular session, room, and participant. otherwise you have built a token-minting service that anyone with a session cookie can use to translate arbitrary audio at your expense. this is the first thing to get right in production.
5. identity comes from track metadata, not from inference. participants in any live room carry identifiers. those identifiers should encode the role at the prefix. inferring role from “the count of remote participants” or “who joined first” fails the moment someone joins late or rejoins after a network drop. this is one of the silent ways a real-time application breaks for the user with the worst connection.
what this stack actually unlocks
real-time translation and transcription have an obvious headline use case (live language access) and a longer tail of less-obvious ones.
healthcare and clinical communication is the use case i have spent the most time on, and the one where the value-per-deployment is highest. about 29.6 million people in the united states have limited english proficiency (2019-2023 acs five-year). federal law has required meaningful language access from any clinic touching federal money since lau v. nichols in 1974, and the 2024 hhs section 1557 final rule re-codifies it in more detail. interpreter services cost clinics $1.25 to $3.00 per minute over the phone and $1.95 to $3.49 per minute by video, with most encounters running $25 to $90 in pure interpreter cost. medicare does not reimburse them. the cost gap between human-interpreter and ai-assisted interpretation is large, real, and not yet sufficient on its own; accuracy and regulatory acceptance still matter most. but the gap is meaningful enough that pilots are happening.
beyond healthcare:
- contact centers and customer support. live multilingual queues without rostering native speakers in every supported language. agent-assist that listens to the call and surfaces account context, related cases, and suggested responses in the agent’s native language even when the conversation is in another.
- international meetings and cross-border calls. the m&a deal that previously needed a $300/hour simultaneous interpreter on a phone bridge can run a translation sidecar per listener for cents on the dollar. translated transcripts become searchable artifacts of every meeting.
- field operations. warehouses, manufacturing floors, and construction sites with multilingual workforces. real-time translation of safety briefings, equipment instructions, and shift transitions through inexpensive shared devices.
- education and training. lectures, certifications, and corporate training delivered live in multiple languages without dubbing. the same architecture supports captions for hearing-impaired participants on the same session.
- legal proceedings. depositions, witness interviews, client consultations. every word captured, attributed, and persisted as a durable artifact. the regulatory bar is high but the workflow gain is enormous for international litigation.
- live events and broadcast. speaker captions in the audience’s language without a human interpretation booth. mixed-language panels become a much shorter conversation to schedule.
the broader pattern is that any business setting where a conversation is the work product (sales, support, education, healthcare, legal) and where the conversation has historically vanished into someone’s notes is now a setting where the conversation can be a durable, multilingual, queryable surface from the moment it happens.
the real constraints
a few things to know before you commit to a quarterly roadmap.
accuracy is uneven across languages. machine translation accuracy in the latest comparative studies remains strong for spanish and chinese and drops materially for russian, vietnamese, and lower-resource languages (2025 discharge-instructions study). plan a tiered deployment: trust the model where it earns trust, and keep a human in the loop where it doesn’t.
no glossary, no system prompt for translation. medication names, legal terms of art, product names, internal acronyms are all at the model’s mercy. for high-stakes vocabularies, plan a post-translation correction layer or a structured-output companion model the user can review.
session caps and latency. realtime sessions are bounded at 60 minutes; long meetings need a graceful reconnect, which the cookbook does not cover. first-translation latency is measured from the listener’s perspective, not the speaker’s. expect 0.8 to 1.5 seconds end-to-end for translated audio in a healthy setup, longer if you stack post-processing.
regulated industries still need a human in the loop. for healthcare and legal use cases the current technology is a force multiplier for a qualified human, not a replacement. the regulatory bars (qualified interpreter under section 1557, certified court interpreter for legal proceedings) are written in human terms and have not been updated to accommodate ai systems. that will change. it has not yet.
where this is heading
thinking machines lab’s recent position paper on interaction models is the cleanest signal of where the field moves next. their argument, in short: real-time interaction should be a property of the model architecture itself, not a software harness wrapped around a chat model. the bits we currently build by hand (the silence detection, the turn debounce, the playback ducking, the input swapping) exist because today’s models can’t yet not need them. when interaction-model-native architectures ship at scale, the harness shrinks. it doesn’t disappear; it moves under the model. the application surface gets smaller. the parts that touch the user’s workflow get more of the surface area.
a more immediate prediction: production-grade deployments will move translation off the end-user’s browser onto controlled media workers. for a prototype, browser sidecars are fine. for production you want centralized audit logging, retries, observability, the option to republish translated audio back into the room for recording, and the ability to swap models without a client release. livekit, daily, and pion all have the right primitives for this. expect this to become the default architecture within twelve to eighteen months.
if you’re building with this
real-time ai is a small enough surface to learn quickly and a big enough lever to be worth getting right. the cookbook covers the basics. the patterns above cover the second hour. the third hour is product-specific, and that’s where most teams should be spending most of their time.
if you’re thinking about a live-translation deployment, a real-time transcription pipeline, or a multilingual conversational product and want to compare notes on architecture, latency budgets, or regulatory posture, i’m happy to talk. drop me a note. and if you’re early-enough to want a thinking partner on the build itself, that conversation is open too.
thanks for reading.
/ ansar
selected references
- openai. realtime conversations. https://developers.openai.com/api/docs/guides/realtime-conversations
- openai. realtime translation guide. https://developers.openai.com/api/docs/guides/realtime-translation
- openai. realtime transcription guide. https://developers.openai.com/api/docs/guides/realtime-transcription
- openai. realtime webrtc guide. https://developers.openai.com/api/docs/guides/realtime-webrtc
- openai cookbook. realtime translation guide. https://developers.openai.com/cookbook/examples/voice_solutions/realtime_translation_guide
- thinking machines lab. interaction models: a scalable approach to human-ai cooperation. https://thinkingmachines.ai/blog/interaction-models/
- evaluation of the accuracy and safety of machine translation of patient-specific discharge instructions (2025). https://pmc.ncbi.nlm.nih.gov/articles/PMC12252260/
- us census, american community survey 2019-2023, language use. summarized at https://slator.com/number-non-english-speaking-households-continues-to-rise-united-states/
- hhs. section 1557 final rule, 2024. https://www.federalregister.gov/documents/2024/05/06/2024-08711/nondiscrimination-in-health-programs-and-activities