The thesis Quorum is built on is uncomfortable and true: the tools a team uses to coordinate an incident often live in the same region as the thing that is failing. When the region goes, the incident response goes with it. You are now coordinating a region outage over a status page that the region outage took down. Quorum is an incident command plane designed to survive a region loss. This post is about how the failover works, what the live demo does and does not prove, and where the survival story currently ends, because a database audience will ask all three and they deserve a straight answer. What DSQL gives you A multi-region DSQL cluster in the US set is three regions: two full regions, which for Quorum are us-east-1 and us-east-2, and a log-only witness in us-west-2 that has no cluster endpoint of its own. Both full-region endpoints present a single logical database with strong consistency, and the architecture is designed for 99.999% multi-region availability with no single point of failure and automated failure recovery . The behavior that matters for an incident tool is stated plainly in the GA announcement : applications can keep reading and writing with strong consistency even when they are unable to connect to a region's cluster endpoint, and the third region acts as a log-only witness with no cluster resource or endpoint. The survivor keeps serving; the witness holds the log so the surviving region keeps commit quorum. Quorum is, in effect, a live demonstration of that reference behavior with an incident-command product wrapped around it. Quorum's failover layer AWS's guidance for multi-region DSQL is to put routing in front of the endpoints: either DNS-based routing with Route 53, or application-level routing logic, so traffic redirects automatically when an endpoint becomes unreachable. This is laid out in Implement multi-Region endpoint routing for Amazon Aurora DSQL . Quorum, a Next.js app on Vercel, does the application-level version: it detects an unreachable region and routes writes and reads to the healthy endpoint. The piece I am most satisfied with is that the health panel is itself failover-protected. A monitor Lambda re-validates failover on a schedule and writes a status snapshot through DSQL . So the component that tells you about the outage reads from the same database that survives the outage. The status display cannot become a casualty of the thing it is reporting on, which is the failure mode that makes most status pages useless at the exact moment you need them. Ingestion works the same way. A CloudWatch alarm fires, EventBridge routes it, and an ingest Lambda writes the signal into DSQL as an event. Monitoring events become incidents through a path that does not hinge on a single region's data layer. The capstone is recursive. Running a failover drill inside the product opens a real sev1 incident, "us-east-1 region impairment," which you then coordinate from the surviving region, in the same war room, on the same database, and resolve when the region restores. The drill exercises the exact flow a real region failure would. Because the event UUID is the idempotency key, the drill is safe to run repeatedly without leaving residue. What the demo proves, and what it does not Now the precise part, because precision here is the whole point, and it cuts in my favor before it cuts against. Here is what the demo proves, and it is all real: the application detects an unreachable region and routes to the survivor, the incident record does not fork under contention, the recursive drill opens and coordinates a real incident from the surviving side, and the health panel keeps reading because it reads through DSQL. Those are measured live, on the click. Recovery point is effectively zero, because strong consistency means a failover loses no committed data. Here is the boundary. The chaos toggle simulates a region's endpoint becoming unreachable, which is AWS's own framing of the failure scenario, and it exerc
← WSZYSTKIE NEWSY
Surviving the region you run in: failover on Aurora DSQL, and what the demo proves
AUTHOR · Jonathan
The thesis Quorum is built on is uncomfortable and true: the tools a team uses to coordinate an incident often live in the same region as the thing that is failing. When the region goes, the incident response goes with it. You are now coordinating a region outage over a status page that the region outage took down. Quorum is an incident command plane designed to survive a region loss. This post is about how the failover works, what the live demo does and does not prove, and where the survival story currently ends, because a database audience will ask all three and they deserve a straight answe