2.3 High Availability

System uptime

99% availability => offline 3.65 days a year

Request Success Ratio

99% availability => 1 failed request out of 100

Principles Behind High Availability

Build redundancy to eliminate single points of failure
- regions, availability zones, fallback, data replication, high availability pair, etc.
Switch from one server to another without losing data
- DNS, load balancing, reverse proxy, API gateway, peer discovery, service discovery, etc.
Protect the system from atypical client behavior
- Load shedding, rate limiting, shuffle sharding, cell-based architecture, etc.
Protect the system from failures and performance degradation of its dependencies
- Timeouts, circuit breaker, bulkhead, retries, idempotency, etc.
Detect failures as they occur
- Monitoring, etc.

Processes Behind High Availability

Change management
- all code and configuration changes are reviewed and approved
QA
- reqularly exercise tests to validate that newly introduced changes meet functional and non-functional requirements
Deployment
- deploy changes to a production environment frequently, quickly, safely; automated rollback
Capacity planning
- monitor system utilization and add resources to meet growing demand
Disaster recovery
- recover system quickly in the event of a disaster; regularly test failover to disaster recovery
Root cause analysis
- establish the root cause of the failure and identify preventive measures
Operational readiness review
- evaluate system's operational state and identify gaps in operations; define actions to remediate risks
Game day
- simulate a failure or event and test system and team responses
Team culture
- Good team culture promotes process discipline

Service-level Objective (SLO)

a target number of availability, ex: 99.99%

Back to top