How to avoid getting blocked when scraping the website?Read More
A web scraper is an application that scrapes or extracts data from a website using a browser or HTTP protocol. Scraping or crawling a website means downloading the HTML code and parsing it to extract the data you need. Additionally, web scraping is not illegal unless you're trying to access non-public data (such as login credentials) that is unreachable to the public. It might not be a problem to scrape through small websites. Some big websites or even Google might ignore your requests or even block your IP address when you attempt web scraping. We will discuss some best practices to avoid getting yourself blocked (your IP) when scraping data from the web in this article.
An overview
"What's the point of using the API if you can do this with the API?"
A website's API may not be available on all websites, and APIs may not expose every bit of information. To extract website data, scraping is often the only option.
Web scraping can be used for a variety of purposes:
- Price monitoring for e-commerce
- An aggregator of news
- Generation of leads
- Monitoring of search engine result pages (SEO)
- Managing bank accounts
- Datasets that are otherwise unavailable to individuals and researchers
Most websites do not want to be scraped, so scraping is a problem. All of them want to be scraped by Google (except when it comes to Google - they all want to serve content to real users).
Scraping should not be considered robotic, so you should avoid being recognized as one. Using human tools and imitating human behavior are two ways to seem human.
The following article will explain how websites block you and how to overcome them.
What is Headless Browser and why use it?
Whenever you open your browser and access a webpage, you are usually asking an HTTP server for some content. Classic command-line tools such as cURL can be used to retrieve content from an HTTP server. The only difference between a headless browser and any other browser is that we cannot see anything on the screen. There is no way to view the program on the screen since it is actually running in the backend. Because of this, it is known to be the one lacking a GUI/head. In the same way as a normal browser, a Headless Browser performs all the tasks as per our program, such as clicking links, navigating pages, downloading documents, uploading documents, etc. In a conventional browser, every step of the program is presented to the user through a GUI, while in a Headless Browser, all steps are executed sequentially and correctly, and we can keep track of them via a console or command line.

Google can tell if you are not a human if you run curl www.google.com (for example by looking at the headers). Each HTTP request includes a header that contains information about the request. Among them is the infamous "User-Agent" header, which precisely describes the client making the request. The "User-Agent" header tells Google that you are using cURL. There is a great Wikipedia page about headers if you want to learn more. Here's an experiment you can try. https://www.whatismybrowser.com/ Your request's headers are displayed on this page.
The content of websites changes based on the browser you're using to request the data. In some cases, when scraping a website, the content is rendered by JavaScript Code rather than HTML. A custom headless browser may be required to scrape such websites. Such dynamic websites can also be controlled and scraped using automation browsers such as Selenium and Puppeteer. Although it takes a lot of effort, this is the most efficient method. A proper browser's User-Agent header can be provided with cURL, which makes headers easy to modify. In reality, several headers would need to be set. CURL or any other library can artificially clone an HTTP request so that it appears to have been made by a browser. There is no doubt about it. A website will check if you are using a real browser by executing JavaScript code, which cURL and libraries cannot do.
A JavaScript snippet embedded into the content of the website unlocks the webpage once it has been executed. You won't notice any difference if you're using a real browser. Unless you're a JavaScript expert, you'll receive this HTML page:Node.js makes it very easy to execute JavaScript outside of a browser, and this solution is not completely bulletproof either. In addition, the web has evolved, so you can use other tricks to figure out if your browser is authentic.
Node.js is not robust and difficult to use with JavaScript snippets. Furthermore, cURL and pseudo-JS execution with Node.js become useless with complicated check systems or huge single-page applications. Using a real browser is the best way to look like one.
Because they are real browsers, headless browsers behave like real ones, except you can use them programmatically. It is Chrome Headless that is most popular, which allows you to use Chrome without the user interface. A driver wraps all functionality into an easy API that makes it easy to use Headless Chrome. The most famous solutions are Selenium Playwright, Puppeteer, and Playwright. There are now ways to detect those as well, so even headless browsers won't always do the trick, and the arms race has been going on for a long time. In spite of the ease of running these solutions on your local computer, scaling them will be more challenging. In addition to its many other services, we provides scraping tools which offer smooth scalability and natural browsing behavior.
Using Rotating Proxies
Using the server log files, website owners can detect your footprint if you send repetitive requests from the same IP address. Rotating proxies can help you avoid this.
Proxy servers that rotate proxy addresses assign a new IP address from a pool of proxies. To avoid being detected by website owners, we need to use proxies and rotate our IP addresses. Our only requirement is to write a script that allows us to use any IP address from the pool. Rotating IP addresses is used to make it appear that you're not a bot but a human accessing data from different locations around the world.
Proxy servers that are free tend to disappear quickly. Most anti-scraping tools already blacklist free proxies due to their overuse on the internet. If you're using free proxy servers, you can automate the scraping process to avoid disruptions.
There are some paid proxies:
For your convenience, we have built-in rotation proxies in our system.And the price is lower than most proxy companies because we are all in one solution provider.
Scraping web pages slowly
A scraper scrapes data at inhuman speeds, which is easily detected by anti-scrapers plugins when we scrape data with an automated scraper. The scraper can be made to look like a human by adding random delays and actions. That way, the website owners won't be able to detect it. It is possible to crash the website if you send too many requests too quickly. Make sure that there is a limit on the number of requests made so that your IP address doesn’t get blocked for overloading the server.
The robot.txt of a site can also be used to determine the delay between two requests. To avoid being recognized as a crawler, robot.txt often has a crawl-delay field that indicates exactly what time should be between requests.
Make sure you are using the correct User-Agent string
Servers and peers use it to identify the application or OS version the user is requesting. User agents from non-major browsers may be blocked by some sites. You will not be able to access content on some websites if they are not set up properly. Use fake-useragent(python) or create a list of user agents to solve this problem.
You can get the most pupular user agents by a API. https://developers.whatismybrowser.com/api/docs/v2/integration-guide/#software-version-numbers
Popular user agents exmaples:
Platform | Latest Windows User Agents |
---|---|
Edge on Windows |
|
Internet-Explorer on Windows |
|
Chrome on Windows |
|
Firefox on Windows |
|
Vivaldi on Windows |
|
Browser fingerprinting should be blocked
Online fingerprinting is more intrusive than cookie-based tracking. By analyzing your computer hardware, software, add-ons, and preferences, a company creates a digital fingerprint of you. You can create a fingerprint based on your computer settings, such as your screen size, fonts installed on your computer, and even the web browser you use.
In the case of commonly used laptops, PCs, and smartphones, fingerprinting may be more difficult. It's easier to find a specific add-on, font, or setting if you have a lot of unique add-ons, fonts, or settings. Your fingerprint can be created by companies using this unique combination of information. We blocks known fingerprinting so you can still use your favorite extensions, themes, and customizations without being tracked.
How to hide your fingerprint in puppeteer? By replacing the native HTMLCanvasElement.prototype.toDataURL function with the following code, all other code on the page will be overridden. When the function detects that the website is painting an image with a width of 220px and a height of 30px, it returns a fake fingerprint. To avoid affecting any other functionality, it runs the original toDataURL function.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.evaluateOnNewDocument(() => {
const originalFunction = HTMLCanvasElement.prototype.toDataURL;
HTMLCanvasElement.prototype.toDataURL = function (type) {
if (type === 'image/png' && this.width === 220 && this.height === 30) {
// this is likely a fingerprint attempt, return fake fingerprint
return '';
}
// otherwise, just use the original function
return originalFunction.apply(this, arguments);
};
});
await page.goto('https://browserleaks.com/canvas');
})();
Lastly,
Hope you have gained a better understanding of what web-scrapers face and how they can be countered.
As a result of leveraging and combining all of these techniques, puppeteer-docker's web scraping API is capable of handling thousands of requests per second without being blocked. Try puppeteer-docker if you don't want to spend too much time setting everything up. You're welcome to call 1k APIs for free, of course.
In addition, we have recently published a guide about the best puppeteer docker on the market, so be sure to check it out!