Skip to main content

2.3 High Availability

System uptime

  • 99% availability => offline 3.65 days a year

Request Success Ratio

  • 99% availability => 1 failed request out of 100

Principles Behind High Availability

  • Build redundancy to eliminate single points of failure
    • regions, availability zones, fallback, data replication, high availability pair, etc.
  • Switch from one server to another without losing data
    • DNS, load balancing, reverse proxy, API gateway, peer discovery, service discovery, etc.
  • Protect the system from atypical client behavior
    • Load shedding, rate limiting, shuffle sharding, cell-based architecture, etc.
  • Protect the system from failures and performance degradation of its dependencies
    • Timeouts, circuit breaker, bulkhead, retries, idempotency, etc.
  • Detect failures as they occur
    • Monitoring, etc.

Processes Behind High Availability

  • Change management
    • all code and configuration changes are reviewed and approved
  • QA
    • reqularly exercise tests to validate that newly introduced changes meet functional and non-functional requirements
  • Deployment
    • deploy changes to a production environment frequently, quickly, safely; automated rollback
  • Capacity planning
    • monitor system utilization and add resources to meet growing demand
  • Disaster recovery
    • recover system quickly in the event of a disaster; regularly test failover to disaster recovery
  • Root cause analysis
    • establish the root cause of the failure and identify preventive measures
  • Operational readiness review
    • evaluate system's operational state and identify gaps in operations; define actions to remediate risks
  • Game day
    • simulate a failure or event and test system and team responses
  • Team culture
    • Good team culture promotes process discipline

Service-level Objective (SLO)

  • a target number of availability, ex: 99.99%