Headless Chrome : A Puppeteer Basic TutorialRead More
Using a headless browser is a somewhat non-traditional method for performing web scraping (web automation).
What is headless Chrome and what’s the benefit?
When considering web design, think about using frameworks such as AngularJS or React. These frameworks are interactive and help you create a beautiful UI, which is more likely to increase the usability of a site.
Historically, JavaScript has been processed by web browsers. As a result, in order to extract content from dynamic web pages, you may need headless browser automation. After having these programs run and scraping desired information from the website in the process, you can then filter the data you've obtained.
Logically, a headless design implies you don't have to rely on visuals like mouse or touch. With command line interface, you get utility rather than interacting with GUI.
Learn how to use the headless version of Google Chrome called "Puppeteer"
There are many tools for web scraping, with different degrees of efficiency. Puppeteer and Intoli's Remote Browser are newer additions to the field that both have pros and cons.
Puppeteer is different than other puppets because it can control headless Chrome or Chromium and interact with the DevTools protocol. It is maintained by the Chrome DevTools team and an open-source community of developers. With headless browsing and Puppeteer, you can now quickly jump into the code to automate web scraping for your business goals.
How to install Puppeteer on Ubuntu
You need Node.js version 8+ installed on your machine. You can install Node.js quickly by running the following in your terminal. Here's how:
curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -
sudo apt-get install -y nodejs
To use this example environment, you'll need ubuntu 20. If your computer doesn't have this, these packages will help:
sudo apt-get install -yq --no-install-recommends libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 libnss3
Start a new Javascript project with Puppeteer and Headless Chrome
As a recommendation, I would recommend installing the Puppeteer package with npm. This is a node library for controlling headless Chrome over the DevTools Protocol to automate browser testing.
npm i puppeteer --save
To prevent this, Puppeteer is relying on the Chromium application. You will need to download and install it.
We are now ready to begin discussion.
An intro to scraping the web with Puppeteer API
Basic example: Launch browser -> Take screenshot
const puppeteer = require('puppeteer');
Now let's take a look at the URL for where we are about to post, based on the arguments in our command-line.
const url = process.argv[2];
if (!url) {
throw "Please provide a URL as the first argument";
}
To use async/await, we must first define a function which is async and put all the code that relates to Puppeteer in it. The difference between promise-based libraries like Puppeteer and callback-based libraries like async task will become more clear as we go through this article.
async function run () {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await page.screenshot({path: 'screenshot.png'});
browser.close();
}
run();
How Puppeteer, Chromium and Headless Mode Works
const puppeteer = require('puppeteer');
const url = process.argv[2];
if (!url) {
throw "Please provide URL as a first argument";
}
async function run () {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await page.screenshot({path: 'screenshot.png'});
browser.close();
}
run();
The following command starts the program in the project directory:
node screenshot.js https://github.com
The browser is headless and scrapes working DOMs. Launch new browser and navigate to URL given in command-line argument. Save screenshot once webpage loaded. Close browser after automating. After we covered the basics, let's move on to something slightly more challenging now that we have done that.
Puppeteer script optimization
Puppeteer is a powerful tool that can help you automate and optimize your web browsing experience. One of the best features of Puppeteer is its ability to automatically generate optimized scripts for headless Chrome. This means that you can save time by having Puppeteer do all the heavy lifting for you. In addition, Puppeteer's script optimization can help improve your web browsing performance by making sure that your scripts are as efficient as possible.
const browser = await puppeteer.launch({
userDataDir: './data',
});
We should be able to see a nice bump in performance, since most of the CSS and images will be cached in the data directory upon the first request, so Chrome won't have to download them again and again.
As a result, those assets will still be used during the rendering of the page. Since we are scraping news articles from Y Combinator, we do not need to worry about any visuals, including the images, as a part of our scraping needs for these articles. The only thing we are interested in is the bare HTML output, so let's try to block every request that comes in.
The good thing about Puppeteer, which can be used in this case, is that it comes with support for custom hooks, which makes it fairly easy to use. It is possible to provide an interceptor for every request and cancel the ones we are not really interested in.
In order to define the interceptor, the following can be done:
await page.setRequestInterception(true);
page.on('request', (request) => {
if (request.resourceType() === 'document') {
request.continue();
} else {
request.abort();
}
});
You can see that we have full control over the requests that are initiated. Based on the resourceType, we can write custom logic to allow or abort specific requests. Additionally, we have access to a variety of other data, such as request.url, so if we wish, we may block only specific URLs.
We will only allow requests with the resource type "document" to pass our filter, which means that we will block all images, CSS, and everything else other than the original HTML response.
The final code is as follows:
const puppeteer = require('puppeteer');
function run (pagesToScrape) {
return new Promise(async (resolve, reject) => {
try {
if (!pagesToScrape) {
pagesToScrape = 1;
}
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', (request) => {
if (request.resourceType() === 'document') {
request.continue();
} else {
request.abort();
}
});
await page.goto("https://news.ycombinator.com/");
let currentPage = 1;
let urls = [];
while (currentPage <= pagesToScrape) {
await page.waitForSelector('a.storylink');
let newUrls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll('a.storylink');
items.forEach((item) => {
results.push({
url: item.getAttribute('href'),
text: item.innerText,
});
});
return results;
});
urls = urls.concat(newUrls);
if (currentPage < pagesToScrape) {
await Promise.all([
await page.waitForSelector('a.morelink'),
await page.click('a.morelink'),
await page.waitForSelector('a.storylink')
])
}
currentPage++;
}
browser.close();
return resolve(urls);
} catch (e) {
return reject(e);
}
})
}
run(5).then(console.log).catch(console.error);
Rate limits keep you safe
There is no doubt that headless browsers are very powerful tools. There are a lot of things they can do in terms of web automation, and Puppeteer makes it even easier to accomplish these things. In spite of all the possibilities that are out there, it is important to comply with the website's terms of service in order to ensure we do not abuse the system.
I am not going to go into too much detail about this aspect in this Puppeteer tutorial because it is more architecture-related. In this case, though, adding a sleep command to a Puppeteer script is the most basic way to slow it down:
js await page.waitFor(5000);
By using this statement, your script will be forced to sleep for five seconds (5000 milliseconds). Before browser.close(), you can place this code anywhere you wish.
It is possible to control your Puppeteer usage in a number of other ways, just as you can limit your use of third-party services. Creating a queue system with a limited number of workers would be an example. In order to use Puppeteer, you would need to push new tasks into the queue every time you want to use it. There would, however, only be a limited number of workers available to perform the tasks listed there. Puppeteer web data scraping can be done in this manner as well when dealing with third-party API rate limits.
A Puppeteer's Place in a Fast-Moving World
The purpose of this Puppeteer tutorial is to demonstrate its basic functionality as a web scraping tool. However, it has a much broader range of applications, such as headless browser testing, PDF generation, and performance monitoring, among others.
The development of web technologies is accelerating. Some websites are so dependent on JavaScript rendering that it is nearly impossible to scrape them or automate them using simple HTTP requests. Thanks to projects such as Puppeteer and the awesome teams behind them, headless browsers are becoming increasingly available to handle all of our automation needs.