Introduction to System Health Monitoring for Lambda Functions:

System health monitoring always comes up when moving from a traditional (server based) architecture to a serverless architecture utilizing Amazon Web Services (AWS). How can you achieve accurate and in depth monitoring of your systems in a serverless world?

Within AWS, CloudWatch Metrics and Alarms give operations and development teams a way to monitor their architecture in a way that not only allows them to address issues as they come up, but also gives them the power to preemptively be alerted when things start to go sideways.

Recently I have been working extensively with Lambda functions, so in this post I will go over some of the basics of metrics and alarms that anyone can use to help them monitor the health of their Lambda functions, along with how to use CloudFormation to be able to easily deploy these alarms for any number of functions.

If you are more familiar with Terraform as opposed to CloudFormation, we also have a similar blog post with that being the focus here.

Which metrics to monitor for Lambda Functions and why:

Given the ephemeral nature of Lambda functions, traditional monitoring of metrics such as service uptime and memory footprint cannot be used as an accurate assessment of a service’s health. This is because Lambda functions are only considered “alive” and running when they are invoked. The infrastructure of a Lambda function is abstracted and code execution happens in an ephemeral runtime container. Traditional monitoring assumes that you have access to infrastructure details and have something durable that tools can “observe”.

This means rather than monitoring metrics like uptime, we instead need to monitor metrics like the number of throttled requests a function has, or the amount of errors in relation to the total number of invocations of the function. Things like memory utilization are still important, but we can define limits for that inside a function’s configuration. So instead of focusing on that, other metrics become important such as tracking function duration to indicate when a Lambda’s invocation time creeps up, which can signal problems potentially related to memory or CPU utilization.

Another aspect of Lambda health to keep in mind when designing alarms and their thresholds is that we are sometimes allowed a little more leniency than is normal in a traditional architecture. This is because Lambdas have features such as built-in retries for event-driven logic, which can, in certain cases, resolve intermittent issues without intervention.

Some of my suggested basic Lambda metrics to watch and why are given below, see the AWS documentation for a list of all Lambda metrics and their descriptions.

  • Concurrency Limit: This can be either ConcurrentExecutions or a combination of ConcurrentExecutions and UnreservedConcurrentExecutions depending on whether some of your functions have reserved concurrencies. This is a region-wide metric that determines when your Lambda functions will start to be throttled. It is important to track for systems that will have a lot of simultaneous events happening at once to make sure that your processes do not get throttled.

  • Errors: This is fairly self-explanatory; any error in a Lambda function can cause problems to other functions in your architecture or processes. You should be alerted when any of these pop-up. Given Lambdas ability for built-in retries there may be a higher threshold for when immediate action is necessary but all errors should be tracked and investigated.

  • DeadLetterErrors: This usually implies data loss. It arises when failed events are not able to be written to a safe backup location to be debugged and rerun. An alarm is usually needed to allow teams to address these issues immediately.

  • Duration: Lambda functions should be small and fast. Tracking the duration of functions and having an alarm when they reach an unacceptable threshold can be important to allow teams to debug and find the cause of events that can be considered edge cases or anomalies that could disrupt normal function flow.

Once you’ve become familiar with all of the different metrics that are exposed and what you want to monitor, you can define common monitoring criteria as code for teams to deploy alongside their functions across your organization.

CloudFormation and CloudWatch Alarms:

CloudFormation is an AWS-provided infrastructure as code language that gives you the power to use simple json and yaml files to model and provision all the resources your applications need across all regions and accounts in an automated and secure manner.

CloudWatch Alarms allow you to watch a single CloudWatch metric or the result of a math expression based on CloudWatch metrics. These alarms can perform one or more actions based on the value of the metric or expression that the alarm is tracking, relative to a user-defined threshold over a specified number of time periods. For our purposes, we will focus only on sending a notification to a specified Amazon SNS topic to alert certain teams or developers when the alarms are triggered.

You can use CloudWatch Alarms and CloudFormation together to make modular and reusable alarm templates to allow you to easily deploy monitoring for all the metrics you deem important to your system’s health. You can even set up two versions of each alarm for every metric, as a warning and a critical alert, if you would like different teams to be alerted when the alarms fire.

To start your CloudFormation template you must first become familiar with what properties CloudFormation allows you to configure for CloudWatch Alarms. This information can be found in the AWS documentation here. Below I’ve provided two code snippets for a basic CloudWatch alarm resources that will send a message to a user-specified SNS topic when a Lambda function fails to write to its dead letter queue and when a Lambda’s errors rise above 25 percent of its total invocations.

Utilizing CloudFormation parameters, identified below as CriticalSnsTopicName and FunctionName, we can make these templates modular to be able to be reused for any function and any SNS topic.

LambdaCriticalDeadLetterAlarm:
    Condition: DeadLetterAlarms
    Type: 'AWS::CloudWatch::Alarm'
    Properties:
      ActionsEnabled: true
      AlarmDescription: !Sub 'Lambda Critical DeadLetterQueue Alarm for ${FunctionName}' #CloudFormation parameter FunctionName allows you to reuse alarm template for multiple functions.
      AlarmName: !Sub '${FunctionName}-Lambda-Critical-DeadLetterQueueAlarm'
      AlarmActions:
        - !Sub 'arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${CriticalSnsTopicName}'
      ComparisonOperator: GreaterThanOrEqualToThreshold
      EvaluationPeriods: 1
      MetricName: DeadLetterErrors
      Namespace: AWS/Lambda
      Statistic: Maximum
      Threshold: 1
      DatapointsToAlarm: 1
      Dimensions:
        - Name: FunctionName
          Value: !Sub '${FunctionName}'
      Period: 300
      TreatMissingData: missing

The template above uses CloudFormation parameters (FunctionName and CriticalSnsTopicName) along with CloudFormation resource properties (AWS::Region and AWS::AccountId) to create a CloudWatch Alarm that monitors a function’s DeadLetterErrors metric, over a single evaluation period of 300 seconds, firing when one datapoint is above the threshold (one DeadLetterError). For this alarm we can use the Maximum statistic because we just want to know when any DeadLetterError has occurred in the evaluation period.

Below is another example of a CloudWatch Alarm, this one however uses metric math to compute the percentage of errors in relation to the number of invocations for a specific function. It has a threshold of .25 or 25%, so that it will fire when the number of errors divided by invocations is greater than or equal to 25%. For the metrics in this alarm, we take the Sum as the stat instead of the Maximum because we don’t care about the absolute number but rather care about the sum of all the data points over the time period.

LambdaCriticalErrorAlarm:
    Type: 'AWS::CloudWatch::Alarm'
    Properties:
      ActionsEnabled: true
      AlarmDescription: !Sub 'Lambda Critical Error Alarm for ${FunctionName}'
      AlarmName: !Sub '${FunctionName}-Lambda-Critical-Error-Alarm'
      AlarmActions:
        - !Sub 'arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${CriticalSnsTopicName}'
      ComparisonOperator: GreaterThanOrEqualToThreshold
      EvaluationPeriods: 1
      Threshold: 0.25
      DatapointsToAlarm: 1
      Metrics:
        - Id: !Sub "errorPercentage_${FunctionName}"
          Expression: !Sub "errors_${FunctionName} / requests_${FunctionName} * 100"
          Label: !Sub "${FunctionName}-ErrorPercentage"
          ReturnData: true
        - Id: !Sub "errors_${FunctionName}"
          MetricStat:
            Metric:
              Namespace: "AWS/Lambda"
              MetricName: "Errors"
              Dimensions:
                - Name: FunctionName
                  Value: !Sub '${FunctionName}'
                - Name: Resource
                  Value: !Sub '${FunctionName}'
            Period: 300
            Stat: Sum
          ReturnData: false
        - Id: !Sub "requests_${FunctionName}"
          MetricStat:
            Metric:
              Namespace: "AWS/Lambda"
              MetricName: "Invocations"
              Dimensions:
                - Name: FunctionName
                  Value: !Sub '${FunctionName}'
            Period: 300
            Stat: Sum
          ReturnData: false
      TreatMissingData: missing

For just about every service available on AWS, CloudWatch exposes different metrics that can be used to understand service and system health. I hope this short intro to Lambda metric monitoring with CloudFormation and CloudWatch Alarms is able to give you a look inside at what can be done to monitor your architecture and systems. If you have any questions or concerns feel free to comment below or reach out to me on LinkedIn!