Designing an event-sourced state machine for real-world fleet operations
Why the rental row never had a status column, and what we built instead.
Most engineers reach for a "current state" column on the rental row when they build a fleet platform. Updates land via UPDATE rentals SET status = ?. Six months in, you discover the database thinks a rider has a vehicle that the showroom thinks was returned, the battery operator thinks is in for service, and the rider's app thinks is paused.
The state column drifted. There is no way back to ground truth. Disputes resolve to whoever shouts loudest.
This was the problem on day one of the EV rental platform. We never let it happen because the rental row never had a status column.
The decision
Every rental was modelled as an append-only event log with derived current-state views. No mutable status field anywhere. The state of a rental at any moment was a projection - replay the events, fold them into a state - not a stored value that could drift out of sync with reality.
The events table sat at the centre:
create table rental_events ( id uuid primary key, rental_id uuid not null, event_type text not null, payload jsonb not null, reason_code text, actor_id uuid not null, -- who emitted the event actor_role text not null, -- rider | showroom | admin | battery_op occurred_at timestamptz not null, recorded_at timestamptz not null default now(), idempotency_key text not null unique ); create index on rental_events (rental_id, occurred_at);
Three things matter here:
occurred_atandrecorded_atare different. A battery operator at a swap point with no signal logs an event at 06:42, the network reconnects at 07:15, and the event lands in the database at 07:15. Both timestamps are preserved. Replays useoccurred_at; reconciliation usesrecorded_at.idempotency_keyis a hard unique. The battery operator's offline-tolerant app retries on reconnect. The same swap event must not double-write. The key is deterministic -{actor_id}-{rental_id}-{event_type}-{minute_bucket}- so a retry produces the same key and collides on the unique constraint.actor_roleis on every row. Disputes ("the showroom said the vehicle was returned") resolve by filtering events by role.
The alternatives we considered
Plain CRUD with a status column. Fastest to build. Drift within weeks. Every dispute becomes a finger-pointing exercise because there is no audit of how the status got that way.
Status column plus an audit log table. Better - at least there is a record. But the status column is still the source of truth, and the audit log is the apology when reality disagrees with it. We would have spent operations time reconciling the status column against the log instead of against reality.
Full event sourcing with a CQRS framework (e.g. Eventuate, Axon). Tempting, but the framework was bigger than the problem. We needed event sourcing on one aggregate (the rental); we did not need it on customers, payments, vehicles. The framework cost outweighed the benefit.
The chosen approach was event sourcing on the rental aggregate, plain CRUD on everything else. Hybrid is fine. Don't event-source the customer profile.
What the projections looked like
The "current state" of a rental was a SQL view that folded the events:
create view rental_current_state as select rental_id, -- last assignment event wins ( select payload->>'vehicle_id' from rental_events e2 where e2.rental_id = e.rental_id and e2.event_type = 'vehicle_assigned' order by occurred_at desc limit 1 ) as current_vehicle_id, -- last battery swap wins ( select payload->>'battery_id' from rental_events e3 where e3.rental_id = e.rental_id and e3.event_type = 'battery_swapped' order by occurred_at desc limit 1 ) as current_battery_id, -- derived status case when exists ( select 1 from rental_events x where x.rental_id = e.rental_id and x.event_type = 'rental_returned' ) then 'returned' when exists ( select 1 from rental_events x where x.rental_id = e.rental_id and x.event_type = 'rental_paused' and not exists ( select 1 from rental_events y where y.rental_id = e.rental_id and y.event_type = 'rental_resumed' and y.occurred_at > x.occurred_at ) ) then 'paused' else 'active' end as status, max(occurred_at) as last_event_at from rental_events e group by rental_id;
Hot reads went through this view in development. In production, we materialised it - refresh materialized view concurrently after every event insert, batched at 1-second granularity. Read latency stayed under 30ms even with millions of events.
The materialisation was a cache, not the source of truth. If it ever drifted (it never did), we could rebuild it by replaying. That property mattered more than the latency.
Idempotent events for the offline app
The battery operator's tablet ran inside a swap point with intermittent backhaul. Operators logged swaps in seconds and walked to the next bay. The app could not pause for the network.
Local-first writes solved the latency. Idempotency keys solved the duplicate problem.
// Battery operator app - local write, queued for sync function logSwap(rentalId: string, oldBatteryId: string, newBatteryId: string) { const occurredAt = new Date(); const minuteBucket = occurredAt.toISOString().slice(0, 16); // YYYY-MM-DDTHH:MM const idempotencyKey = `battery_op-${operatorId}-${rentalId}-battery_swapped-${minuteBucket}`; const event = { rentalId, eventType: "battery_swapped", payload: { oldBatteryId, newBatteryId }, actorId: operatorId, actorRole: "battery_op", occurredAt, idempotencyKey, }; db.localEvents.put(event); // commits to local SQLite syncQueue.enqueue(event); // best-effort sync to server return event; }
The same operator scanning the same battery twice within 60 seconds collapsed to one event. The minute bucket was the right granularity for our workflow - finer was unnecessary, coarser would have lost legitimate retries.
On the server side, the idempotency_key unique constraint did the rest. Conflicting inserts threw, the sync layer caught the conflict, and the local event was marked synced.
The race the model dissolves
The original "scary" race: the rider finishes a shift and arrives at a swap point. Simultaneously, the showroom manager updates the rider's vehicle assignment in the back-office because they think the rider returned the bike yesterday.
In a status-column world, this is a nightmare. Two writes hit the same row from different actors with different intents. Whoever lands last wins. The other loses, silently.
In the event log:
12:01:30 swap_point: battery_swapped { rental_id: R1, vehicle_id: V1, old_battery: B1, new_battery: B2 } 12:01:32 showroom: vehicle_unassigned { rental_id: R1, vehicle_id: V1, reason: "rider returned yesterday" }
Both events land. Both are visible. The projection now shows: current_vehicle_id = null (last event wins on assignment) but the swap event for V1 is still in the log. An operator looking at the rental can see that two contradicting events happened in two seconds, by two different roles, and can adjudicate. The system isn't broken; it has captured a real-world disagreement that needs human resolution.
That's the entire point. Software shouldn't paper over disagreements between actors. It should surface them.
What surprised me
The reason_code column carried more weight than expected. Every event had to ship with a structured reason from a fixed enum (vehicle_in_service, rider_no_show, battery_overheated, etc.). At first this felt like ceremony. After three months, the reason codes were the most-queried field in the database - they powered fleet-wide diagnostics, billing exception reports, and the operating playbook itself.
Materialised view refresh strategy mattered more than I thought. refresh materialized view concurrently requires a unique index on the view, and the lock semantics are different from a non-concurrent refresh. We hit subtle issues at scale and ended up with a custom incremental projection updater that subscribed to the event stream and wrote the projection rows on the same transaction as the event insert. Faster, simpler, no refresh dance.
Event types should be additive forever. We renamed an event type once early on. The migration to rewrite historical events lost a day. After that, deprecated event types stayed in the enum forever; new types were added; nothing was renamed.
What I'd do differently
Start the materialised projections inside the application layer. Postgres views and refresh materialized view got us 80% of the way. Going from 80% to 100% would have been smoother with an explicit projection updater that owned its own table from day one. The migration when we eventually built that was painful only because we had inherited the materialised-view tooling first.
Snapshots after a threshold. Replaying every event for a long-running rental is wasteful. Our heaviest fleet renters crossed 1,200 events per rental within 18 months, mostly battery swaps. We added snapshots at the 1,000-event mark, but only after we had to. Should have built snapshot infrastructure into the projection layer from day one.
A typed event registry. We had loose JSON payloads with runtime validation. Should have had a typed registry per event type - code generation from a schema, mandatory shapes, breaking-change protection. Worth the up-front cost.
If your software sits on top of a physical operation where multiple humans can update the same thing in the same minute: do not give it a status column. Give it events. The status column is the lie the database tells when reality is more complicated than the schema. The events are reality.