Tips and Tricks for Web Scraping with Puppeteer

The Google Chrome team made waves last year when it released Puppeteer, a NodeJS API for running headless Chrome instances. It represents a marked improvement both in terms of speed and stability over existing solutions like PhantomJS and Selenium, and was named one of the ten best web scraping tools of 2018. However, it is not without its own set of warts, and getting Puppeteer running smoothly for large web scraping jobs can bring its own set of complexities (at Scraper API, we use Puppeteer to scrape and render Javascript from millions of web pages each month). Here are a few lessons we’ve learned.

Using Browserless

Running headless Chrome instances on the same server as your application code is generally a bad idea as CPU and RAM usage can be unpredictable. In order to avoid a spike in CPU usage from taking down your application server as well, it is a good idea to run headless Chrome on its own server. Luckily, this is incredibly easy with the Browserless library. Here are the settings we use in production:

https://medium.com/media/3119226feb8bc28ccdcdf025fd7e7743/href

These settings time out Chrome sessions after 5 minutes (this is to prevent stray sessions from running indefinitely and eventually crashing your server), and allow up to 5 sessions at any given time. 5 concurrent sessions seems to be a sweet spot that runs comfortably on a $5 Digital Ocean VPS.

Browser Settings

There are a few browser-level Puppeteer settings you should know about to speed up your browser instances:

https://medium.com/media/b3480fd748663d10090cf3f8f497a21a/href

Because the Puppeteer library is still quite young and being very actively developed, some of these flags may be already on by default by the time you read this, basically these are sensible defaults that we’ve found in Github issues like this and this while debugging errors. They will ensure that you don’t run into the same cross platform and hard-to-debug memory errors that we ran into.

Page Settings

Scraping a web page requires creating a new Page (this is what Puppeteer calls creating a new browser tab), navigating to the correct page, and returning the HTML. Here are the Page-level settings we are using.

https://medium.com/media/26b810cc8af2533f748152b6b6154db4/href

There are a few things to notice here. Puppeteer has a waitUntil option, that allows you to define when a page is finished loading. ‘networkidle2’ means that there are no more than 2 active requests open. This is a good setting because for some websites (e.g. websites using websockets) there will always be connections open, so using ‘networkidle0’ your connections will time out every time. Here is the full documentation for waitUntil. We then wait for an additional 3 seconds after there are only 2 active requests left to let the last two requests finish, and then return the HTML (after checking that the response status code is not an error).

When scraping at scale, you may not want to download all of the files on each web page, especially larger files like images. You can intercept requests by using the setRequestInterception command, and block requests that you don’t need to be making. You can see the documentation for Puppeteer resource types here. You can block any domain or subdomain just by adding it to the skippedResources list.

Using Proxies

When scraping a large number of pages on a single website, it may be necessary to use a proxy service. One common issue with Puppeteer is that proxies can only be set at the Browser level, not the Page level, so each Page (browser tab) must use the same proxy. To use different proxies with each page, you will need to use the proxy-chain module. Because Puppeteer/Chromium have some issues with stripping headers, it is safest to use the User-Agent header which is reliably set on each request. Simply set up your proxy server to read the User-Agent from the request, and use a different proxy for each User-Agent. Here is a sample proxy server.

https://medium.com/media/51b014c05e552c933ef45e6f40aa8b74/href

You can connect to this proxy server by following the example in the Browser Settings section above. This will allow you to set a different proxy server for each new Page based on the Page’s User-Agent, and will also allow you to connect to proxies that require password authentication (which Puppeteer does not currently support).

Hopefully this helps some of you avoid the painful edge cases we’ve encountered with Puppeteer. Happy scraping!


Tips and Tricks for Web Scraping with Puppeteer was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: