When More Alerts Makes You Less Safe
There's a paradox at the center of security monitoring: adding more alerts can make your security posture worse. When engineers receive more alerts than they can meaningfully review, they start ignoring them. Alerts that once triggered investigation become background noise. The monitoring system that was supposed to make the team more responsive instead erodes their trust in alerts entirely — and they miss the real incidents buried in the noise.
This isn't a hypothetical. According to security operations surveys, the average SOC analyst spends a significant portion of their day on alerts that turn out to be false positives or low-priority events. For small teams, the problem is more acute: you don't have dedicated analysts, so alert fatigue competes directly with feature work and other engineering priorities.
This guide covers how to diagnose alert fatigue in your AWS environment and rebuild your monitoring configuration so that every alert that fires demands — and receives — attention.
Diagnosing Your Current Alert State
Before you can fix alert fatigue, you need to understand its scope. For each alert in your system, answer these questions:
- When did this alert last fire?
- What was done when it fired?
- What percentage of firings led to actual action?
- What percentage were false positives or low-priority events?
If you can't answer these questions — if your alerts fire into Slack channels that no one checks, or email inboxes that are ignored — you already have alert fatigue. The alerts are firing but not being acted on.
A practical audit: look at your last 30 days of CloudWatch alarms in ALARM state. For each one, check your incident tickets, Slack history, and runbook completions. How many were actually investigated? How many were dismissed or ignored?
The Alert Fatigue Taxonomy
Alert fatigue has several root causes, each with a different fix:
True Positives That Don't Need Action
An alert fires, the condition is real, but it doesn't require any action. Example: a CloudWatch alarm fires every time traffic spikes on Monday morning. The spike is expected; there's no action to take. Fix: eliminate these alerts or move them to a dashboard-only visualization.
Noisy Baselines
Thresholds set without understanding normal behavior. Your Lambda error rate alarm fires three times a day during normal operations because normal operations include a 2% error rate, and your threshold is 1%. Fix: establish baselines (see our proactive monitoring guide) and set thresholds based on actual behavior.
Low-Severity Events Paging High
A configuration finding that should be a weekly review item is sending pages to PagerDuty. Fix: audit alert severity assignments. Not every alert should page someone at 2am.
Duplicate Alerts
GuardDuty fires an alert. Security Hub fires an alert about the same finding. Your Vigilare dashboard fires an alert. Three notifications for the same event. Fix: consolidate alert sources to avoid duplication.
Irreversible Context
An alert fires, but by the time someone investigates, the condition no longer exists (an auto-scaling event that briefly triggered a CPU alarm, for example). No investigation is possible, no action is needed, but time has been spent. Fix: add evaluation periods and minimum duration requirements to alarms.
The Signal-to-Noise Framework
For each alert in your system, categorize it on two dimensions:
- Signal quality: How often does this alert indicate a real problem? (High/Medium/Low)
- Response urgency: When it does indicate a real problem, how urgent is the response? (Immediate/Hours/Daily)
| High Signal | Low Signal | |
|---|---|---|
| Immediate Urgency | Page: PagerDuty | Fix first: improve signal |
| Hours Urgency | Notify: Slack/email | Remove or tune |
| Daily Urgency | Dashboard/digest | Remove |
Anything in the "Remove or tune" and "Remove" categories should be addressed immediately. Anything in "Fix first: improve signal" needs either better detection logic or removal.
Tuning GuardDuty Findings
GuardDuty is one of the most common sources of alert fatigue because it generates findings for everything, including expected behavior that looks suspicious in isolation. Tuning approaches:
Trusted IP Lists
If your monitoring tools, CI/CD pipelines, or VPN exit nodes regularly appear in GuardDuty findings, add them to a trusted IP list. GuardDuty won't generate findings for traffic to/from trusted IPs:
aws guardduty create-ip-set --detector-id $DETECTOR_ID --name "TrustedIPs" --format TXT --location s3://my-bucket/trusted-ips.txt --activate
Threat Intelligence Suppressions
For specific finding types that are always false positives in your environment, create suppression rules. For example, if you have a pentest firm that regularly scans your infrastructure, suppress findings from their IP range during testing windows.
Finding Type Severity Calibration
Not all HIGH-severity GuardDuty findings require the same response. Evaluate your finding history and customize which finding types trigger pages vs. daily review. See our GuardDuty findings guide for finding type descriptions.
CloudWatch Alarm Best Practices for Noise Reduction
Use Evaluation Periods Appropriately
Don't alert on a single data point. For most metrics, requiring the threshold to be breached for multiple consecutive periods filters out transient spikes:
resource "aws_cloudwatch_metric_alarm" "error_rate" {
evaluation_periods = 3 # Must breach threshold for 3 consecutive 5-minute periods
datapoints_to_alarm = 2 # 2 out of 3 periods must breach (reduces noise further)
period = 300
}
Use Anomaly Detection Instead of Static Thresholds
Static thresholds need manual tuning every time traffic patterns change. CloudWatch Anomaly Detection automatically adapts to your metric's normal behavior. See our proactive monitoring guide for setup details.
Composite Alarms for Complex Conditions
CloudWatch composite alarms combine multiple alarms with AND/OR logic. Instead of paging when CPU is high, page when CPU is high AND error rate is elevated AND there's no known maintenance window. This dramatically reduces false positives for conditions that are frequently coincidental.
The Alert Ownership Model
Alert fatigue is often an organizational problem as much as a technical one. When alerts go to a shared Slack channel with no ownership, everyone assumes someone else will handle it. Effective alert ownership:
- Each alert has a designated primary owner (a team, a person)
- Acknowledgment time is tracked — alerts that aren't acknowledged within SLA are escalated
- Regular reviews (monthly) of alert handling: who responded, what they did, whether it was necessary
- Ownership includes responsibility for tuning — the owner of an alert is responsible for fixing it if it's noisy
Consolidating Alert Sources
Most AWS environments have too many alert sources producing duplicates. A rationalization approach:
- Single source of truth for security findings: Route all security findings through Security Hub. This deduplicates GuardDuty, Config, Inspector, and Macie findings and provides a consistent severity model.
- Single notification channel per tier: All Tier 1 alerts go to PagerDuty. All Tier 2 alerts go to #security-alerts Slack. All Tier 3 alerts go to a weekly email digest. No exceptions.
- Use Vigilare for multi-account aggregation: Rather than monitoring each account separately and receiving per-account alerts, Vigilare aggregates findings and alerts only on meaningful changes.
The Feedback Loop: Closing the Alerting Loop
The most important practice for fighting alert fatigue is tracking what happens to every alert. For each alert that fires:
- Was it a true positive or false positive?
- What action was taken?
- How long did investigation take?
- Should the alert be tuned?
This data, collected even informally in a spreadsheet, lets you identify your noisiest alert sources and prioritize tuning effort. The goal is continuous improvement: every false positive is a signal to improve your detection logic. See our guide on incident response for how alert handling fits into incident management.
FAQ
How do I explain to my organization that fewer alerts is better security?
Frame it around response rates, not alert volumes. If 90% of alerts are ignored, you effectively have no monitoring. 10 alerts that all get investigated is more security value than 1000 alerts that all get ignored. Security posture is about detection AND response, not detection alone.
What's a healthy alert-to-action ratio?
There's no universal standard, but a useful target: at least 50% of alerts that fire should result in some action (investigation, tuning, or escalation). If less than 50% of your alerts lead to action, your signal quality is too low and needs tuning.
Should I ever accept a known noisy alert rather than fixing it?
Occasionally, yes — if the cost of fixing the noise (refactoring the detection logic, changing infrastructure) exceeds the cost of the noise (a few minutes of investigation per week). But this should be a deliberate decision, documented, with a plan to fix it eventually. Never accept noise indefinitely as a default.
Protect your AWS accounts before it's too late
Vigilare monitors your AWS accounts for suspension risks — billing anomalies, IAM issues, GuardDuty findings, and more — and alerts you before AWS takes action.
Written by Viktor B.
Co-founder & CEO