Skip to main content
2.3 High Availability
System uptime
- 99% availability => offline 3.65 days a year
Request Success Ratio
- 99% availability => 1 failed request out of 100
Principles Behind High Availability
- Build redundancy to eliminate single points of failure
- regions, availability zones, fallback, data replication, high availability pair, etc.
- Switch from one server to another without losing data
- DNS, load balancing, reverse proxy, API gateway, peer discovery, service discovery, etc.
- Protect the system from atypical client behavior
- Load shedding, rate limiting, shuffle sharding, cell-based architecture, etc.
- Protect the system from failures and performance degradation of its dependencies
- Timeouts, circuit breaker, bulkhead, retries, idempotency, etc.
- Detect failures as they occur
Processes Behind High Availability
- Change management
- all code and configuration changes are reviewed and approved
- QA
- reqularly exercise tests to validate that newly introduced changes meet functional and non-functional requirements
- Deployment
- deploy changes to a production environment frequently, quickly, safely; automated rollback
- Capacity planning
- monitor system utilization and add resources to meet growing demand
- Disaster recovery
- recover system quickly in the event of a disaster; regularly test failover to disaster recovery
- Root cause analysis
- establish the root cause of the failure and identify preventive measures
- Operational readiness review
- evaluate system's operational state and identify gaps in operations; define actions to remediate risks
- Game day
- simulate a failure or event and test system and team responses
- Team culture
- Good team culture promotes process discipline
Service-level Objective (SLO)
- a target number of availability, ex: 99.99%