How to run headless chrome puppeteer with AWS Lambda

Developers can automate a wide range of tasks by manipulating a browser environment via an API. This allows you to achieve things like generating PDF files, screenshotting webpages, or running health checks on a website - all from code. Additionally, you can run UI tests, automate form submissions and diagnose performance problems with it. A popular package for running a browser programmatically is Headless Chromium. You can configure this with minimal code whether you are loading a website or periodically fetching content.
 

On a local machine or a remote server, you can run a headless browser. AWS Lambda is a better choice for many typical browser automation tasks. Lambda functions can be scheduled or can start in response to an event. A Lambda instance can also be configured to scale up for load testing, thus eliminating the need to manage a fleet of instances.
 

The purpose of this post is to show you how you can configure a browser automation task to be deployed on Lambda. To simplify the deployment of cloud resources, the AWS Serverless Application Model (AWS SAM) is used. This blog post contains code that can be downloaded from the GitHub repository that corresponds to this post. Please refer to the README file for instructions on how to deploy to an AWS account.
 

Overview

An S3 bucket is used to save a screenshot of a webpage every 15 minutes using a Lambda function. Here is the architecture:
 

  1. An Amazon EventBridge rule invokes the Lambda function using a schedule expression .
  2. The Lambda function uses Chromium to load the target webpage. Once the page is loaded and rendered, it takes a screenshot.
  3. The screenshot is saved to an Amazon S3 bucket.

Here's how it works
 

In this example, Node.js is used to control the Chromium browser using an Npm package called Puppeteer. This is shown in a snippet of the Lambda function:
 

The code is written using JavaScript async/await syntax, which avoids callbacks and facilitates sequential code flow. Chromium (via Puppeteer) fetches the webpage once the browser object is defined. After loading the page and rendering the DOM, Chrome stores a screenshot of the page in a buffer variable. Finally, the image is written to S3 is then used to store the image:
 

const s3result = await s3 .upload({ Bucket: process.env.S3_BUCKET, Key: `${Date.now()}.png`, Body: buffer, ContentType: 'image/png', ACL: 'public-read' }) .promise() console.log('S3 image URL:', s3result.Location) 

 

 

The S3 upload method is supported by AWS SDK for JavaScript. In order to control object visibility, the bucket and key must be public in the access control list (ACL). Lastly, the URL of the public URL of the object needs to be logged.

The Puppeteer package for Lambda

With Lambda, you can package dependencies with your code in a zip file. An unzipped deployment package can be 250 MB or a zipped deployment package can be 50 MB. The new container packaging format for Lambda functions allows packages up to 10 GB for larger packages.
 

AWS SAM or Serverless framework allow you to create packages on your development machine, then zip them and upload them to Lambda. When the function is executed, these files are unzipped in the Lambda execution environment. You can streamline your deployments by using these tools.
 

Whenever you use a dependency containing a binary, you must make sure that the binary is packaged in a way that is compatible with Lambda's operating system. An entire Chromium browser is included in the Puppeteer package. By default, npm installs the binary that matches the operating system of your local development machine since the browser relies on binaries. As a result, Lambda requires the binary for Amazon Linux 2, which is Lambda's underlying operating system.
 

Community members have converted many popular packages into Lambda layers to help with this. You cWhen you create or update a Lambda function, the layers are copied into your deployment package.Chromium, this makes it easier to run one binThis makes it easier for Chromium developers to run one binary in development and another in production.e.s to simplify your development process.

A community-maintained Lambda layer

An AWS Lambda Chromium binary has been published to GitHub by a developer. This can be installed using Puppeteer in your Lambda function by including the libraries in package.json. Both can also be bundled with an existing Lambda layer. The chrome-aws-lambda package is frequently updated on this GitHub repository, and you can include it directly in your Lambda functions.
 

AWS Regions where layers are published are the only ones where the layers are available. It was published to 16 Regions by the maintainers of this public repository, and layer ARNs are provided in the README file. The ARNs listed above are public and can be used in any Lambda function within those regions within any AWS account.
 

Community members have bundled many popular libraries into Lambda layers. It contains layers for commonly used utilities such as GeoIP, MySQL, OpenSSL, Pandas, scikit-learn, and many others. The layer ARN must reside in a supported region in order to be used in Lambda functions from a compatible runtime.

 

For exmaple: https://github.com/shelfio/chrome-aws-lambda-layer

 

How to use the AWS SAM template

Using the AWS Management Console, this Lambda function could be defined directly. AWS SAM, on the other hand, allows you to define infrastructure as code. By clicking around the console, human error can be reduced and repeatable deployments can be created quickly.
 

All AW resources used by the application are defined in the AWS SAM template. The first thing it does is declare an S3 bucket:
 

S3Bucket: Type: AWS::S3::Bucket 

 

The code location and Lambda function are defined in the following section of the template. The memory is set to 4096 MB because the entire browser is running within the function. A 15 second timeout is configured to ensure that the function ends if the target webpage is unresponsive.
 

SnapshotFunction: 
Type: AWS::Serverless::Function 
Description: Invoked by EventBridge 
Scheduled Rule Properties: 
CodeUri: src
Handler: app.handler 
Runtime: nodejs12.x 
Timeout: 15 
MemorySize: 4096 

 

At deployment time, the Region code is substituted for the publicly available Chromium layer. In cases where the layer is available in one of the 16 Regions, the layer ARN is valid:
 

Layers: - !Sub 'arn:aws:lambda:${AWS::Region}:764866452798:layer:chrome-aws-lambda:22' 

 


    Available regions
    ap-northeast-1: arn:aws:lambda:ap-northeast-1:764866452798:layer:chrome-aws-lambda:31
    ap-northeast-2: arn:aws:lambda:ap-northeast-2:764866452798:layer:chrome-aws-lambda:31
    ap-south-1: arn:aws:lambda:ap-south-1:764866452798:layer:chrome-aws-lambda:31
    ap-southeast-1: arn:aws:lambda:ap-southeast-1:764866452798:layer:chrome-aws-lambda:31
    ap-southeast-2: arn:aws:lambda:ap-southeast-2:764866452798:layer:chrome-aws-lambda:31
    ca-central-1: arn:aws:lambda:ca-central-1:764866452798:layer:chrome-aws-lambda:31
    eu-north-1: arn:aws:lambda:eu-north-1:764866452798:layer:chrome-aws-lambda:31
    eu-central-1: arn:aws:lambda:eu-central-1:764866452798:layer:chrome-aws-lambda:31
    eu-west-1: arn:aws:lambda:eu-west-1:764866452798:layer:chrome-aws-lambda:31
    eu-west-2: arn:aws:lambda:eu-west-2:764866452798:layer:chrome-aws-lambda:31
    eu-west-3: arn:aws:lambda:eu-west-3:764866452798:layer:chrome-aws-lambda:31
    sa-east-1: arn:aws:lambda:sa-east-1:764866452798:layer:chrome-aws-lambda:31
    us-east-1: arn:aws:lambda:us-east-1:764866452798:layer:chrome-aws-lambda:31
    us-east-2: arn:aws:lambda:us-east-2:764866452798:layer:chrome-aws-lambda:31
    us-west-1: arn:aws:lambda:us-west-1:764866452798:layer:chrome-aws-lambda:31
    us-west-2: arn:aws:lambda:us-west-2:764866452798:layer:chrome-aws-lambda:31

 

An environment variable specifies the URL of the target website and the bucket name in which the image will be stored. Last but not least, since the function only writes data to S3, it uses an AWS SAM policy template to grant write permissions. The principle of least privilege is followed here:
 

Environment: 
Variables: 
TARGET_URL: 'https://serverlessland.com' 
S3_BUCKET: !Ref S3Bucket 
Policies: - 
S3WritePolicy: 
BucketName: !Ref S3Bucket 

Invocation of Lambda functions is triggered by events. EventBridge manages the interval at which the function runs. The template configures the function to run every 15 minutes using a schedule expression:
 

 Events: CheckWebsite
ScheduledEvent: 
Type: 
Schedule Properties: 
Schedule: rate(15 minutes) 

Run same deploy again whenever you make changes to the Lambda function or resource in this template. The AWS SAM CLI detects the differences between versions and deploys the new code and resources automatically.

 

All in one 

 

# serverless.yml

service: lambdaScreenshot

custom:
  # change this name to something unique
  s3Bucket: screenshot-files

provider:
  name: aws
  region: us-east-1
  versionFunctions: false
  # here we put the layers we want to use
  layers:
    # Google Chrome for AWS Lambda as a layer
    # Make sure you use the latest version depending on the region
    # https://github.com/shelfio/chrome-aws-lambda-layer
    - arn:aws:lambda:${self:provider.region}:764866452798:layer:chrome-aws-lambda:10
  # function parameters
  runtime: nodejs12.x
  memorySize: 2048 # recommended
  timeout: 30
  iamRoleStatements:
    - Effect: Allow
      Action:
        - s3:PutObject
        - s3:PutObjectAcl
      Resource: arn:aws:s3:::${self:custom.s3Bucket}/*

functions:
  capture:
    handler: src/capture.handler
    environment:
      S3_REGION: ${self:provider.region}
      S3_BUCKET: ${self:custom.s3Bucket}

resources:
  Resources:
    # Bucket where the screenshots are stored
    screenshotsBucket:
      Type: AWS::S3::Bucket
      DeletionPolicy: Delete
      Properties:
        BucketName: ${self:custom.s3Bucket}
        AccessControl: Private
    # Grant public read-only access to the bucket
    screenshotsBucketPolicy:
      Type: AWS::S3::BucketPolicy
      Properties:
        PolicyDocument:
          Statement:
            - Effect: Allow
              Action:
                - s3:GetObject
              Principal: "*"
              Resource: arn:aws:s3:::${self:custom.s3Bucket}/*
        Bucket:
          Ref: screenshotsBucket

Creating the function

// src/capture.js

// this module will be provided by the layer
const chromeLambda = require("chrome-aws-lambda");

// aws-sdk is always preinstalled in AWS Lambda in all Node.js runtimes
const S3Client = require("aws-sdk/clients/s3");

// create an S3 client
const s3 = new S3Client({ region: process.env.S3_REGION });


// The function to run
exports.handler = async (event) => {

  // launch a headless browser
  const browser = await chromeLambda.puppeteer.launch({
    args: chromeLambda.args,
    defaultViewport: chromium.defaultViewport,
    executablePath: await chromeLambda.executablePath
  });

  // Open a page and navigate to the url
  const page = await browser.newPage();
  await page.goto(event.url);

  // take a screenshot
  const buffer = await page.screenshot()

  // upload the image using the current timestamp as filename
  const result = await s3
    .upload({
      Bucket: process.env.S3_BUCKET,
      Key: `${Date.now()}.png`,
      Body: buffer,
      ContentType: "image/png",
      ACL: "public-read"
    })
    .promise();

  // return the uploaded image url
  return { url: result.Location };
};


The final step is to write the Lambda function for AWS.

The code below spins up a headless Chrome instance, navigates to a page, and takes a screenshot using puppeteer. The screenshot is then uploaded to the S3 client for storage. The URL of the screenshot file is returned at the end.
 

Function testing

Go to the Lambda console after deploying the example application. Open SnapshotFunction deployed by AWS SAM. The function is invoked automatically every 15 minutes, but you can trigger it manually by choosing Test:
 

The Log output contains information about the function duration and the URL of the image stored in the S3 bucket. To view the screenshot, navigate to this URL in a browser:

 

The scheduled event has invoked the function multiple times after it has been deployed for a few hours. Amazon CloudWatch metrics can be used to monitor its performance. Select Monitoring to see the number of invocations, average duration, and errors for the Lambda function:
 

See the date-stamped objects created by each Lambda invocation by opening the S3 bucket created by the AWS SAM deployment:

In AWS Lambda, what are the timeout limits?

Three major components make up a Lambda Serverless application. Your serverless application can experience timeouts due to each of these components:
 

  • Event source - usually the AWS API Gateway
     
  • Lambda function – limited by AWS Lambda service limits
     
  • Services – third party applications, DynamoDB, S3, and other resources that the Lambda function integrates with
     

Each component's timeout limits and important considerations are summarized in the following table.

Serverless ComponentMax TimeoutComments
API Gateway50 milliseconds – 29 secondsConfigurable
Lambda Function900 seconds (15 minutes)Also limited to 1,000 concurrent executions. If not handled, can lead to throttling issues.
DynamoDB Streams40,000 write capacity units per table 
S3No timeout by default, can be configured to 5-10 secondsUnlimited objects per bucket
Downstream ApplicationsCheck your applications 

Puppeteer functions should be called using lambda functions if they will be called for longer than 30 seconds.
 

As a conclusion

The ability to programmatically control a web browser allows you to automate many useful tasks. You can minimize infrastructure overhead and simplify scaling by using Lambda for many of these. A headless browser is used in this blog post to take periodic screenshots of a webpage using an example application.
 

The Lambda layer can simplify deployment of commonly used libraries or packages with operating-system-specific binaries. You can include publicly maintained layers in Lambda functions in many libraries. You can define your code and layers together in YAML with infrastructure as code tools, like AWS SAM, so you can automate deployments and accelerate development.

 

Compare with puppeteer-docker.com

With puppeteer-docker.com, you can easily run your code package with a puppeteer ready enviornment. You donot need config AWS lambda as talked above. Comparing with AWS lambda, puppeteer-docker.com has some benifits:

0 config - puppeteer is ready to use

Large concurrency - AWS lambda has a concurrency limitation which is very low

Cost friendly - Our price is around 50% of AWS lambda

Proxy ready - We support rotation proxies 


It will be a nice idea to try puppeteer-docker.com before you start project on AWS lambda. If you need any help, please let us know and we are very glad to provide related helps.