Web scraping has always been very difficult problem to solve. There are tools like Selenium and Beautiful Soup which have all been super helpful to use. But with the release of Google V8 engine and NodeJS things started to become even more developer-friendly.
In recent days there were few popular nodejs libraries in the scraping and screenshot generation e.g. CasperJs, PhantomJs, Cheerio etc. Puppeteer is the recent addition to this list.
Puppeteer is built on top of NodeJS which provides a high-level API to control headless Chrome over the DevTools Protocol supported by Google V8. It can also be configured to use non-headless Chrome in desktop mode.
In other words, Puppeteer could be remote programmatic control to Google Chrome and Chromium to achieve Content Scraping, Screenshot generation, HTML to PDF generation, automated testing and lots more.
In this post we are going to learn how to install and configure Puppeteer on top of Ubuntu 18.04 LTS by following few easy steps. We assume you don’t have NodeJS installed on your system, but if you already have then please skip to Step 2.
Update your system
sudo apt-get update
Install dependencies
sudo apt install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget
Even though Puppetteer does not actually display a GUI, the Chromium instance it uses still requires some of the libraries to draw a GUI and connect to the X11 server, even though that isn’t used in Puppetteer. One of those libraries is libxcb which provides the shared library libX11-xcb.so.1. You can fix this by installing the libx11-xcb1 package on most Debian-based systems.
However, as it is so often the case with missing shared libraries, once you install the one that is missing, there will be at least one other library missing after that. That’s why we need to install the large number of libraries listed above.
Install NodeJS
sudo apt install curl # install node 10.x repository to the system curl -sL https://deb.nodesource.com/setup_10.x | sudo -E bash - # download & install nodejs 10.x along with npm sudo apt install nodejs
Install puppeteer
$ mkdir your-project $ cd your-project $ npm install puppeteer
The above command should create a project directory for you and install puppeteer. Once the installation is done let’s write our first program to generate a screenshot for a given page.