The Challenge
A healthcare technology provider needed to meet strict NHS data residency requirements while ensuring 99.99% availability for their patient-facing platform.
The Challenge
The brief was clear enough on paper. An NHS patient management platform, running in a single region, needed multi-region resilience without patient data ever leaving the UK. The client had already had a close call: a partial eu-west-1 outage six months earlier had knocked their portal offline for nearly two hours, and a handful of NHS trusts started asking difficult questions about continuity.
Their existing setup was a single region deployment in Ireland. It worked, but it was one bad day away from a serious incident. They needed failover capability, they needed it to satisfy NHS data residency rules, and they needed it without doubling their AWS bill.
The Solution
We spent the first week mapping out the architecture and quickly landed on an active-passive design across eu-west-1 (Ireland, primary) and eu-west-2 (London, secondary). Both regions sit within the UK and Ireland, so data residency was covered from the start.
Aurora Global Database was the obvious choice for the data layer because it gave us asynchronous replication with sub-second lag and automatic failover. We put read replicas in the London region too, which cut query latency for users hitting that edge.
Then we hit our first snag. At the time, eu-west-2 didn’t support one of the WAF managed rule groups we wanted for NHS-specific threat filtering. We ended up writing custom rules against the OWASP baseline and layering in rate limiting on the ALBs ourselves. It took an extra few days, but the compliance auditor preferred it. She said it showed we understood the threat model rather than just ticking a box with a managed ruleset.
On the networking side, we locked CloudFront down with geo-restriction to UK-only delivery. Belt and braces, but it meant there was no path for patient data to accidentally egress to a non-compliant region.
For failover, we wired up Route 53 health checks to trigger database promotion and DNS cutover. Our target was under two minutes. The first failover test went sideways, though. DNS caching on the client’s legacy Java services meant some connections hung on to the old primary endpoint for nearly ten minutes. We fixed it by dropping the TTL on the health-checked records and adding a connection draining step to the runbook. After that, cutover consistently landed under 90 seconds.
The rest of the security stack was solid but unglamorous. KMS with customer-managed keys for encryption at rest. TLS 1.3 enforced on every API connection. Secrets Manager rotating credentials every 30 days via Lambda. AWS Config rules watching encryption status, IAM policies, and security groups, with automatic flagging if anything drifted.
We piped CloudTrail logs into a centralised logging stack so the client could pull audit trails during regulatory reviews without scrambling.
The Results
- 99.99% uptime over the first 12 months, comfortably above the 99.95% NHS SLA
- RTO of 2 minutes, RPO under 30 seconds thanks to continuous replication and the Route 53 failover chain
- Zero compliance findings across 18 months of operation after go-live
- SOC 2 Type II certification wrapped up within 6 months. NHS Digital DSPT requirements passed without remediation items.
- Monthly failover tests running automatically with no manual intervention needed
- 18% cost overhead for the secondary region. The client’s CTO called it the cheapest insurance policy they’d ever bought.
- DR testing time cut from 4 hours to 20 minutes using AWS Backup and Lambda orchestration
The best feedback came from the NHS trusts themselves. Procurement teams that had previously flagged resilience concerns started approving the platform without further questions. Patient data stays in the UK, failover is tested every month, and the whole thing runs without anyone having to babysit it. That was the goal from day one.