Skip to main content

Monitoring Trifecta - CloudWatch

Basics

  • "eyes and ears" of your AWS env
  • the "central nervous system" of the cloud
    • colelcts signals (data), makes sense of them (analysis), and can trigger reflexes (automation) when something goes wrong
  • a collections of tools

Types of Data

  • Metrics: numerical data like "what is the CPU usage of my server right now?"
  • Logs: text based records
  • Events: changes in the AWS env that can trigger automated responses

Goals:

  • Data Silos: single location to view the status of entire cloud env
  • Reactive Fixing: proactive monitoring, when something happens, it reports it
  • Manual Scaling: it can tell AWS to automatically add more servers to handle the load
  • Hidden Errors: provides deep visibility, search through logs

Why is it needed:

  • Operational Health: know something's wrong the moment it happens
  • Cost Optimization: identify zombie resources
  • Security & Compliance: monitoring logs, spot unusual activie, (an IP address logs 1k times in a minute, block it automatically)
  • TroubleShooting (MTTR): reduces the Mean Time to Resolution. when an error occurs, you can correlate a pike in a metric (like latency) directly with a specific log entry to find the root cause instantly

Components

  • Dashboards: visual graphs of your metrics
  • Alarms: if CPU > 80% for 5 mins, send alert
  • Logs Insights: query language to search through logs
  • Synthetics: scripts that "ping" your website to check for availability and broken links

CloudWatch Logs

  • can send logs to
    • Amazon S3
    • Kinesis Data Streams
    • Kinesis Data Firehose
    • AWS Lambda
    • OpenSearch
  • Sources include: SDK, CloudWatch Logs Agent, CloudWatch UnifiedAgent
  • Elastic Beanstack: collection of logs from application
  • ECS: collection from containers
  • AWS Lambda: collection from function logs
  • VPC Flow Logs: VPC specific logs
  • API Gateway
  • CloudTrail based on filter
  • Route53: Log DNS queries

CloudWatch Logs Insights

  • Search and analyze log data
  • ex). find a specific IP inside a log, count occurrences of "ERROR" in your logs
  • Provides a purpose-built query language
  • Can query multi logs groups in different AWS accounts
  • a query engine, not a real-time engine

CloudWatch Logs Subscriptions

  • get a real-time log events from CloudWatch Logs for processing and analysis
  • Send to Kinesis Data Streams, Kinesis Data Firehose, or Lambda
  • Subscription Filter: filter which logs are events delivered to your destination
  • Cross-Account Subscription: send log events to resources in a different AWS account (KDS, KDF)

CloudWatch Logs Agent & Unified Agent

  • CloudWatch Logs Agent
    • old version of the agent
    • can only send to CloudWatch Logs
  • CloudWatch Unified Agent
    • collect additional system-level metrics such as RAM, processes, etc
    • collect logs to send to CloudWatch Logs
    • Centralized configuration using SSM Parameter Store

CloudWatch Alarms

  • Targets:
    • stop, terminate, reboot, or recover an EC2 instance
    • trigger auto scaling action
    • send notification to SNS
  • Composite Alarms
    • used to monitor the states of multiple other alarms
    • AND and OR conditions
    • If alarms A and alarm B happens at the same time, trigger something
  • to test alarms and notifications, set the alarm state to Alaram using CLI
  • aws cloudwatch set-alarm-state --alarm-name "myalarm" --state-value ALARM --state-reason "testing purposes" 
    
    
    
    

Diff between AWS Config and AWS CloudWatch?

  • AWS Config monitors Compliance while CloudWatch monitors Performance
  • AWS Config is best for real-time monitoring and auto-scaling, where CloudWatch is best for tracking changes over time

The "Trifecta"

  • CloudWatch: tells you what is happening (high CPU)
  • AWS Config: tells you what changed in the setup (someone changed the instance to a smaller one)
  • CloudTrail: tells you who did it (user "Admin_Bob" made the change)