Tips and Tricks for Web Scraping with Puppeteer
Running headless Chrome instances on the same server as your application code is generally a bad idea as CPU and RAM usage can be unpredictable. In order to avoid a spike in CPU usage from taking down your application server as well, it is a good idea to run headless Chrome on its own server. Luckily, this is incredibly easy with the Browserless library. Here are the settings we use in production:
These settings time out Chrome sessions after 5 minutes (this is to prevent stray sessions from running indefinitely and eventually crashing your server), and allow up to 5 sessions at any given time. 5 concurrent sessions seems to be a sweet spot that runs comfortably on a $5 Digital Ocean VPS.
There are a few browser-level Puppeteer settings you should know about to speed up your browser instances:
Because the Puppeteer library is still quite young and being very actively developed, some of these flags may be already on by default by the time you read this, basically these are sensible defaults that we’ve found in Github issues like this and this while debugging errors. They will ensure that you don’t run into the same cross platform and hard-to-debug memory errors that we ran into.
Scraping a web page requires creating a new Page (this is what Puppeteer calls creating a new browser tab), navigating to the correct page, and returning the HTML. Here are the Page-level settings we are using.
There are a few things to notice here. Puppeteer has a waitUntil option, that allows you to define when a page is finished loading. ‘networkidle2’ means that there are no more than 2 active requests open. This is a good setting because for some websites (e.g. websites using websockets) there will always be connections open, so using ‘networkidle0’ your connections will time out every time. Here is the full documentation for waitUntil. We then wait for an additional 3 seconds after there are only 2 active requests left to let the last two requests finish, and then return the HTML (after checking that the response status code is not an error).
When scraping at scale, you may not want to download all of the files on each web page, especially larger files like images. You can intercept requests by using the setRequestInterception command, and block requests that you don’t need to be making. You can see the documentation for Puppeteer resource types here. You can block any domain or subdomain just by adding it to the skippedResources list.
When scraping a large number of pages on a single website, it may be necessary to use a proxy service. One common issue with Puppeteer is that proxies can only be set at the Browser level, not the Page level, so each Page (browser tab) must use the same proxy. To use different proxies with each page, you will need to use the proxy-chain module. Because Puppeteer/Chromium have some issues with stripping headers, it is safest to use the User-Agent header which is reliably set on each request. Simply set up your proxy server to read the User-Agent from the request, and use a different proxy for each User-Agent. Here is a sample proxy server.
You can connect to this proxy server by following the example in the Browser Settings section above. This will allow you to set a different proxy server for each new Page based on the Page’s User-Agent, and will also allow you to connect to proxies that require password authentication (which Puppeteer does not currently support).
Hopefully this helps some of you avoid the painful edge cases we’ve encountered with Puppeteer. Happy scraping!