Lambda architectures replace persistent server processes with thousands of short-lived function invocations. This model creates observability challenges that aren't present in traditional server-based deployments: a single "request" might trigger dozens of function invocations across multiple services; errors in one function propagate in ways that are hard to trace without distributed tracing; cold starts affect latency in ways that don't appear in average metrics.
Effective serverless monitoring requires instrumenting at multiple layers: Lambda-native metrics for function health, distributed tracing for end-to-end visibility, and application-level metrics emitted from function code for business logic observability.
Lambda Native CloudWatch Metrics
Lambda automatically publishes metrics to CloudWatch under the AWS/Lambda namespace without any configuration. The key metrics to monitor:
Invocations: Total count of function calls. Use this for traffic baselines and to detect unexpected usage spikes or drops. A sudden 0 invocation count for a function that's normally busy is as notable as a spike — it could indicate an upstream trigger stopped working.
Errors: Count of invocations that resulted in function execution errors (exceptions, memory exceeded, timeout). This doesn't include throttles (separate metric) or initialization errors. Monitor error rate as a percentage of invocations rather than absolute count, and set alarms at 1-5% error rate for production functions.
Throttles: Count of invocations rejected because the concurrency limit was reached. Any non-zero throttle count on a production function warrants investigation — it means requests were rejected, not just delayed. Create an alarm for throttles > 0 on business-critical functions.
Duration: Function execution time including p50, p95, and p99 percentiles. P95 and P99 are more important than average for understanding the tail latency experience. Functions timing out appear as errors with timeout message in CloudWatch Logs. If p99 duration is approaching the configured timeout, increase the timeout or optimize the function.
ConcurrentExecutions: Simultaneous in-progress invocations. Compare to your reserved and account concurrency limits. Sustained high concurrency approaching limits suggests you need to request a quota increase or optimize function performance to reduce duration.
AWS X-Ray for Distributed Tracing
X-Ray provides distributed tracing across Lambda functions and AWS services. When enabled, X-Ray traces each request through all participating components — API Gateway, Lambda, DynamoDB, SQS, external HTTP calls — and visualizes the trace as a service map and timeline. This makes it immediately visible which component is responsible for latency or which downstream service call is failing.
Enable X-Ray in the Lambda function configuration (requires the X-Ray execution role permission xray:PutTraceSegments and xray:PutTelemetryRecords). Instrument your function code with the X-Ray SDK to create custom subsegments for database calls, external HTTP calls, and other operations you want to trace separately. Without custom subsegments, X-Ray traces only show the function invocation itself; with instrumentation, you see the breakdown of time spent in each operation within the function.
X-Ray sampling controls cost. The default sampling rule captures 5% of traces at low traffic rates (first request per second plus 5% of additional requests). For latency-sensitive functions, create custom sampling rules that capture 100% of slow requests (duration > 500ms) even if overall sampling is lower.
Structured Logging for Application Observability
Lambda function stdout and stderr go to CloudWatch Logs. Unstructured log output (free-text log lines) is hard to query and correlate across invocations. Structured logging — JSON-formatted log output with consistent fields — enables Log Insights queries that surface application-level metrics.
Emit a structured log line at the end of each invocation with: request ID, function duration, outcome (success/failure), any relevant business metrics (orders processed, records updated), and error details if failed. CloudWatch Log Insights can then query these logs to calculate success rates, throughput, and error patterns without requiring custom metric instrumentation for every dimension you care about.
Lambda Powertools (available for Python, TypeScript, Java, .NET) provides a structured logger, metrics, and tracer that implement AWS best practices for Lambda observability. It significantly reduces the boilerplate for implementing structured logging and custom metrics. The structured logging module automatically includes function name, cold start indicator, and request ID in every log line.
Cold Start Detection and Management
Cold starts occur when Lambda needs to initialize a new execution environment — downloading code, initializing the runtime, and running initialization code outside the handler function. Cold starts add 100ms-5000ms latency on the affected invocation, depending on runtime (JVM is slowest; Python and Node.js are faster), code size, and VPC configuration (VPC functions have longer cold starts due to ENI setup).
Lambda Powertools marks each invocation as a cold start in logs and metrics. Track cold start rate as a percentage of invocations and monitor cold start duration separately from warm invocation duration. If your P99 latency is dominated by cold starts, consider provisioned concurrency for latency-sensitive functions, code optimization to reduce initialization time, or architecture changes that reduce invocation frequency (batching events before invoking Lambda).
Alerting for Serverless Applications
Alerts for serverless need to account for the invocation-based nature. Configure alarms on:
- Error rate above threshold (not absolute error count — rate accounts for traffic variability)
- Throttle count > 0 for critical functions
- P99 duration above threshold for user-facing functions
- Invocation count anomaly (unusual spikes or unexpected zero counts)
- Dead letter queue depth > 0 (failed async invocations reaching the DLQ)
The dead letter queue alarm is particularly important for event-driven architectures. Failed async invocations (from SQS, SNS, EventBridge) end up in the DLQ and need investigation. A growing DLQ depth is an early signal of systematic failures that might not generate obvious user-facing errors.
Related Reading
- Lambda concurrency limits — understanding the capacity constraints that monitoring reveals
- Lambda security — security monitoring alongside operational monitoring
- Lambda cost monitoring — using monitoring data for cost optimization
- CloudTrail alerting — API-level events that complement function-level monitoring
FAQ
How do I correlate logs across multiple Lambda functions in a request chain?
Propagate a correlation ID through all function invocations in a request chain. The first function in the chain creates or receives a request ID; each subsequent function receives it in the event payload or HTTP headers and includes it in its own log output. CloudWatch Log Insights can then query for all log lines with the same correlation ID to reconstruct the full request journey. X-Ray trace IDs serve the same purpose for distributed traces.
Are Lambda function logs retained indefinitely?
By default, CloudWatch Log Groups for Lambda functions have no expiration (infinite retention), which means storage costs grow indefinitely. Configure log retention periods on all Lambda Log Groups to match your operational and compliance requirements. 30-90 days covers most operational needs; archive older logs to S3 with lower-cost storage classes if longer retention is required for compliance.
What's the best way to monitor Lambda functions across hundreds of functions in a large application?
For large Lambda fleets, per-function dashboards don't scale. Use CloudWatch Container Insights (Lambda view) and CloudWatch ServiceLens for fleet-level dashboards that aggregate across all functions. Tag Lambda functions with application and environment tags, then build dashboards filtered by tag to see application-level health. Lambda Insights provides enhanced function-level metrics (memory usage, initialization duration, CPU time) that complement the standard Lambda CloudWatch metrics.
Protect your AWS accounts before it's too late
Vigilare monitors your AWS accounts for suspension risks — billing anomalies, IAM issues, GuardDuty findings, and more — and alerts you before AWS takes action.
Written by Vigilare Engineering
Platform Team