Monitoring Trifecta - CloudWatch
Basics
- "eyes and ears" of your AWS env
- the "central nervous system" of the cloud
- colelcts signals (data), makes sense of them (analysis), and can trigger reflexes (automation) when something goes wrong
- a collections of tools
Types of Data
- Metrics: numerical data like "what is the CPU usage of my server right now?"
- Logs: text based records
- Events: changes in the AWS env that can trigger automated responses
Goals:
- Data Silos: single location to view the status of entire cloud env
- Reactive Fixing: proactive monitoring, when something happens, it reports it
- Manual Scaling: it can tell AWS to automatically add more servers to handle the load
- Hidden Errors: provides deep visibility, search through logs
Why is it needed:
- Operational Health: know something's wrong the moment it happens
- Cost Optimization: identify zombie resources
- Security & Compliance: monitoring logs, spot unusual activie, (an IP address logs 1k times in a minute, block it automatically)
- TroubleShooting (MTTR): reduces the Mean Time to Resolution. when an error occurs, you can correlate a pike in a metric (like latency) directly with a specific log entry to find the root cause instantly
Components
- Dashboards: visual graphs of your metrics
- Alarms: if CPU > 80% for 5 mins, send alert
- Logs Insights: query language to search through logs
- Synthetics: scripts that "ping" your website to check for availability and broken links
CloudWatch Logs
- can send logs to
- Amazon S3
- Kinesis Data Streams
- Kinesis Data Firehose
- AWS Lambda
- OpenSearch
- Sources include: SDK, CloudWatch Logs Agent, CloudWatch UnifiedAgent
- Elastic Beanstack: collection of logs from application
- ECS: collection from containers
- AWS Lambda: collection from function logs
- VPC Flow Logs: VPC specific logs
- API Gateway
- CloudTrail based on filter
- Route53: Log DNS queries
CloudWatch Logs Insights
- Search and analyze log data
- ex). find a specific IP inside a log, count occurrences of "ERROR" in your logs
- Provides a purpose-built query language
- Can query multi logs groups in different AWS accounts
- a query engine, not a real-time engine
CloudWatch Logs Subscriptions
- get a real-time log events from CloudWatch Logs for processing and analysis
- Send to Kinesis Data Streams, Kinesis Data Firehose, or Lambda
- Subscription Filter: filter which logs are events delivered to your destination
- Cross-Account Subscription: send log events to resources in a different AWS account (KDS, KDF)
CloudWatch Logs Agent & Unified Agent
-
CloudWatch Logs Agent
- old version of the agent
- can only send to CloudWatch Logs
-
CloudWatch Unified Agent
- collect additional system-level metrics such as RAM, processes, etc
- collect logs to send to CloudWatch Logs
- Centralized configuration using SSM Parameter Store
CloudWatch Alarms
- Targets:
- stop, terminate, reboot, or recover an EC2 instance
- trigger auto scaling action
- send notification to SNS
- Composite Alarms
- used to monitor the states of multiple other alarms
- AND and OR conditions
- If alarms A and alarm B happens at the same time, trigger something
- to test alarms and notifications, set the alarm state to Alaram using CLI
-
aws cloudwatch set-alarm-state --alarm-name "myalarm" --state-value
ALARM --state-reason "testing purposes"
### Diff between AWS Config and AWS CloudWatch?
- - AWS Config monitors
Compliance**Compliance** while CloudWatch monitors Performance**Performance**
- - AWS Config is best for real-time monitoring and auto-scaling, where CloudWatch is best for tracking changes over time
### The "Trifecta"
-
CloudWatch**CloudWatch**: tells you what**what** is happening (high CPU)
- - **AWS
ConfigConfig**: tells you **what changedchanged** in the setup (someone changed the instance to a smaller one)
- -
CloudTrail**CloudTrail**: tells you who**who** did it (user "Admin_Bob" made the change)