Building a video-call stack that fails gracefully - from video to audio to record
Capture the consultation as data, not as the call. The discipline behind a telemedicine app that worked when the network didn't.
In April 2020, the question that mattered most was not "is the video pretty?" It was "did the consultation happen?"
Those are different questions, and a telemedicine platform that conflates them gets shipped, gets praised, and silently fails the people who need it most - the patient on a 2G connection in a tier-3 town, the doctor consulting from a home with a flaky ISP. We had to design for them, not for the demo.
This is how the video call stack was built so the doctor-patient consultation happened as a record even when the video itself didn't.
The decision
The consultation was a structured record, not a video stream. The video was a transport mechanism - useful, often the best mechanism - but the record's existence did not depend on the video succeeding.
Concretely, the system always wrote:
consultation.started → at the moment the call connected consultation.audio_only → if/when video was downgraded to audio consultation.text_fallback → if/when call failed and a structured chat was opened consultation.completed → at the moment the doctor closed the consultation prescription.issued → with structured fields, signed files.exchanged → photos, lab reports, etc.
Each was a row in the database. The video stream was a side-effect. If the audio worked but the video crashed, the consultation was complete, the prescription was issued, the record was clean. The patient got their care; the platform's database was honest about what happened.
The transport layer: what we bought, what we didn't build
We integrated a managed real-time-comm provider (Twilio Programmable Video, equivalent to Agora or Daily today). We did not build our own WebRTC signaling or run our own media servers.
This was a deliberate, slightly contrarian decision in 2020. The "we can build it cheaper" voice was loud. We resisted it for three reasons:
- The failure mode if we got it wrong was bad. A doctor's call dropping mid-consultation isn't a UX bug. It's a clinical event. The probability of getting media-server scaling right under load, in a country with massive bandwidth heterogeneity, on a four-week timeline, was not 99%. We bought the 99%.
- The cost wasn't the cost we were quoted. A self-hosted media server has TURN servers, geographic distribution, encoding/decoding, codec selection, NAT traversal, packet loss handling, congestion control. Each one is a project. The provider bundles them.
- Our differentiation was not the video. Every telemedicine app had video. Our differentiation was the consultation as a record. Time spent on media servers was time not spent on the prescription pad and the doctor's queue.
We connected to the provider with this shape:
type ConsultationSession = { consultationId: string; doctorId: string; patientId: string; roomSid: string; // provider-side room identifier status: "scheduled" | "live" | "audio_only" | "completed" | "failed"; startedAt?: Date; endedAt?: Date; }; async function joinConsultation(consultationId: string, role: "doctor" | "patient") { const session = await db.consultations.findById(consultationId); if (!session) throw new Error("not found"); // Generate short-lived token for the provider's SDK; 30-min TTL const token = await provider.createAccessToken({ identity: `${role}:${role === "doctor" ? session.doctorId : session.patientId}`, roomName: session.roomSid, ttl: 30 * 60, }); // Side-effect: emit the audit event await emitConsultationEvent(consultationId, "consultation.joined", { role }); return { token, roomSid: session.roomSid }; }
The token was short-lived (30 minutes). If the call dropped and reconnected, the session reused the same room with a fresh token. The room identity was stable; the credentials weren't.
Bitrate adaptation and the audio-only fallback
The provider's SDK supported adaptive bitrate. We tuned it more aggressively for our population:
const VIDEO_CONSTRAINTS = { // start conservative, ramp up if connection allows width: { ideal: 480, max: 720 }, height: { ideal: 360, max: 540 }, frameRate: { ideal: 15, max: 24 }, }; const AUDIO_CONSTRAINTS = { echoCancellation: true, noiseSuppression: true, // standard voice band; favours clarity over fidelity sampleRate: 16000, };
The 480p ceiling was deliberate. Most patients didn't have screens that benefited from higher resolution; the bandwidth saving was real.
We also implemented automatic downgrade-to-audio when the bitrate target couldn't be met. The threshold was empirical:
function watchAndDowngrade(track: VideoTrack, session: ConsultationSession) { const sub = provider.network.subscribe((stats) => { if (stats.targetBitrate < 100_000 && stats.packetLoss > 0.05) { // sub-100kbps with >5% packet loss for 10s = downgrade if (sustainedFor(stats, 10_000)) { track.disable(); emitConsultationEvent(session.consultationId, "consultation.audio_only", { reason: "low_bandwidth", bitrate: stats.targetBitrate, packetLoss: stats.packetLoss, }); showToast("Switching to audio for better connection"); } } }); return sub; }
The toast was important. A user whose video disappears without explanation thinks the platform crashed. A user who reads "Switching to audio for better connection" knows what happened and adjusts. Anxiety reduction was a direct translation of state into UI.
When bandwidth recovered, we did not automatically re-enable video. Reaching for video the moment a network hiccup resolved produced a bad oscillation pattern (video on, video off, video on, every 8 seconds). We required a manual tap to re-enable. The static state was less anxious than the oscillating one.
Reconnection logic
Calls dropped. The pattern was always the same: 4G handover at the patient's end, ISP blip at the doctor's end, 30 seconds of dead air. The question was whether this was a 30-second blip or a permanent failure.
function setupReconnection(session: ConsultationSession) { let reconnectStart: Date | null = null; const MAX_RECONNECT_MS = 60_000; // 60s ceiling, then we give up provider.connectionStateChanged.subscribe(async (state) => { if (state === "reconnecting" && !reconnectStart) { reconnectStart = new Date(); showToast("Reconnecting…"); emitConsultationEvent(session.consultationId, "consultation.reconnecting", {}); } if (state === "connected" && reconnectStart) { const downMs = Date.now() - reconnectStart.getTime(); reconnectStart = null; showToast("Reconnected"); emitConsultationEvent(session.consultationId, "consultation.reconnected", { downMs, }); } if (state === "disconnected") { const downMs = reconnectStart ? Date.now() - reconnectStart.getTime() : MAX_RECONNECT_MS; if (downMs >= MAX_RECONNECT_MS) { // give up; record what we have; offer text fallback await emitConsultationEvent( session.consultationId, "consultation.text_fallback_offered", { downMs } ); navigateTo(`/consultation/${session.consultationId}/text-fallback`); } } }); }
The 60-second ceiling was empirical. Beyond that, the call wasn't coming back. We routed the patient and doctor into a structured chat session with the same consultation record - the consultation continued, just on a different transport.
The text fallback was the most-criticised feature in early reviews. Patients said "this isn't telemedicine, this is just chat." But for the 5% of consultations where video and audio both failed, it was the difference between care happening and care not happening. We kept it. The metric we cared about was "consultations completed," not "consultations done over video."
The prescription as a structured record
The single decision that made the platform clinically credible was treating the prescription as data, not a PDF.
create table prescriptions ( id uuid primary key, consultation_id uuid not null, doctor_id uuid not null, patient_id uuid not null, issued_at timestamptz not null, signature_blob bytea not null, -- doctor's signature image diagnosis_codes text[], -- ICD-10 codes when applicable prescribed_drugs jsonb not null, -- structured: name, dosage, frequency, duration advice text, -- free text follow_up_after interval, -- e.g. '7 days' pdf_render_url text -- cached PDF rendering of the structured fields );
Doctors filled a structured form. The PDF was a render of the structured data, not the data itself. This had three benefits:
- The patient's record was queryable. "What drugs has this patient been prescribed in the last six months?" became a SQL query.
- Dosage validation was possible. The form caught obvious data-entry errors (decimal placement, duration mismatch) before the prescription was issued.
- The PDF was reproducible. If the rendering changed (new clinic logo, new format), every prescription's PDF could be re-rendered without losing the underlying data.
Doctors initially resisted the form ("just let me write it"). Within a week, the speed advantage of structured fields with autocomplete won out.
What surprised me
Audio-only was used more than expected. Doctors kept the audio call going while writing the prescription, even when video was available, because audio was reliable.
The events table outlasted the rest of the stack. When we rebuilt parts of the platform later, the structured-record schema was the only thing that didn't change. Transports came and went; the consultation row was stable.
The text fallback reduced anxiety even when not used. Patients who knew the safety net existed completed video calls they would otherwise have abandoned.
What I'd do differently
Server-side recording from day one. We added it later for clinical-review use cases. Should have started with it. The events table captured the state, but a recording captured the medical content. Both have value.
End-to-end encryption from the start. We started with the provider's transport encryption (TLS, SRTP). Adding E2E later for sensitive consultations took longer than building it in. For a healthcare app, default to E2E.
A standalone audio-only mode. The downgrade-to-audio path was reactive. Some consultations didn't need video at all (follow-ups, prescription refills). An explicit audio-only mode would have skipped the bandwidth dance entirely for those.
If you are building real-time communication where outcomes matter more than aesthetics, the discipline is: capture the outcome as data, not as the call. The call is a transport. The record is the product.