Services / Disaster Recovery Architecture
Disaster recovery architecture is the technical infrastructure that determines whether your organization can actually restore its systems after a failure and not whether a plan says it can.
Most organizations have some form of backup but most have never successfully restored from it under real failure conditions, under time pressure, with the people who are actually available at 3am on a Sunday and where the gap between having backup infrastructure and having a validated recovery capability is where the exposure lives where DR architecture is distinct from business continuity planning.
BCP covers the organizational response of who does what, in what order, communicating with whom and which DR architecture is the technical substrate that either makes the BCP executable or makes it fiction.
A BCP that states a 4 hour RTO for a system whose actual recovery time, under realistic conditions, is 14 hours, is not a plan but it is a liability which the architecture must be designed, implemented and validated to deliver the RTO the plan commits to.
This engagement designs the architecture, specifies the implementation and validates that it achieves the required performance and where we do not implement the infrastructure which your team or an infrastructure partner does that, separately and additionally we design, specify, test and validate, the separation matters where design errors cost far less to correct on paper than in deployed infrastructure.
Design and validation only. Infrastructure implementation is separate and additional.
Design phase only. Validation follows implementation which timeline depends on how fast your team or infrastructure partner implements.
What Actually Fails And Why
DR implementations fail in specific, predictable ways and understanding them before you build is cheaper than discovering them during a real incident.
58% of backups fail on first recovery attempt during real incidents and this is not bad luck but it is the predictable result of specific architectural and operational failures that are well understood and entirely preventable and if they are addressed at design time rather than discovered at incident time and where each failure mode below has a specific architectural response which the response must be in the design from the start. It cannot be retrofitted after a failure reveals it is missing.
Recovery Architecture Classification
What each recovery tier actually means and what it costs to implement
Recovery tiers are defined by RTO and RPO targets where the tier appropriate for a given system is determined by the Business Impact Analysis for that system which the financial, regulatory and operational consequence of downtime at each duration, every system in your environment requires a tier classification which getting the tier wrong in either direction is expensive where under investment means the RTO cannot be met; over investment means you are paying for capability you do not need and where the infrastructure implementation costs below are indicative ranges which actual costs depend on system complexity, data volumes, vendor pricing and existing infrastructure.
| Tier | RTO Target | RPO Target | Architecture Method | Typical Implementation Cost (per system) | Appropriate For |
|---|---|---|---|---|---|
| Tier 0 Hot Standby |
< 15 minutes | Zero data loss | Synchronous replication to active standby. Automated failover with no manual steps. Load balancer or DNS health check triggers redirect. | £40,000 to £250,000+ annually in infrastructure. Doubles the compute and storage footprint. Requires low latency network between primary and standby which typically limits DR site distance. | Payment processing. Clinical systems with direct patient safety impact. Real time trading platforms. Systems where any data loss creates regulatory breach. |
| Tier 1 Warm Cloud |
1 to 4 hours | < 15 minutes | Asynchronous replication to cloud hosted replica. Pre staged recovery infrastructure provisioned but not fully running. Automated provisioning completes on activation. Manual approval step before traffic redirect. | £8,000 to £60,000 annually per system in cloud infrastructure and replication costs. Significant reduction from Tier 0 by accepting 1 to 4 hour RTO and minimal replication lag. | Core ERP, CRM and operational systems. Primary databases supporting multiple business critical applications. Systems where 1 to 2 hours of downtime causes significant but survivable financial impact. |
| Tier 2 Warm Backup |
4–24 hours | < 4 hours | Scheduled incremental backup to cloud or secondary storage. Recovery infrastructure provisioned from IaC templates on activation. Manual recovery execution from documented runbooks. | £2,000 to £20,000 annually per system. Lower infrastructure cost but higher manual effort at recovery time and higher RTO exposure. | Line of business applications with moderate criticality. Supporting systems whose failure affects productivity but not immediate revenue or safety. Systems with natural daily break points that define acceptable RPO. |
| Tier 3 Cold Backup |
24 to 72 hours | < 24 hours | Full backup on defined schedule. No pre staged infrastructure. Recovery requires manual provisioning from scratch. Suitable only for systems whose MTPD allows multi-day downtime. | £500 to £5,000 annually per system in storage costs. Lowest infrastructure cost. Highest recovery time. Most commonly selected inappropriately for systems whose actual MTPD is shorter than the RTO. | Archive and historical data systems. Development and test environments. Low usage internal tools with no revenue or safety dependency. Systems with defined seasonal usage where off season downtime is acceptable. |
Engagement Tiers Scope, Price and Timeline
Three engagement tiers with design phase fixed price and validation phase priced after implementation.
Each engagement has two phases of design (fixed price, delivered by RJV) and Validation (fixed price, scoped and agreed after the implementation is complete) where the gap between phases is implementation which is executed by your team or an infrastructure partner, outside the scope of this engagement and which the validation phase cannot begin until implementation is sufficiently complete to test, this in practice this means the end to end timeline from engagement start to validated recovery capability depends substantially on how quickly implementation proceeds which is outside our control.
Bilateral Obligations
What both parties commit to and what happens when either fails.
These obligations are in the contract before work begins and the DR architecture engagement has specific dependencies that make bilateral obligations especially critical which the design depends on accurate information about the current state; the validation depends on implementation being complete and correct and the programme spans two separately contracted phases with a client controlled implementation gap between them which both phases require active participation from your team at defined points.
Questions to ask before signing anything
Start with a recovery assessment and bring your last test results, however old or the fact that there are none.
A 90 minute session in which we review your current DR infrastructure, your last test results if any exist and the RTOs your BCP or regulatory framework commits you to and where we assess the gap between the committed RTOs and the actual recovery capability the current infrastructure can deliver.
At the end of the session, you know whether your DR capability is adequate, marginal or fundamentally inadequate for your risk and regulatory position which most organizations find the assessment uncomfortable.
The gap between what the BCP says the DR capability delivers and what it actually delivers is, in most cases, larger than anyone has previously acknowledged explicitly and that acknowledgement, unpleasant as it is, is the necessary starting point for any genuine improvement.