Basic Web Scraping with Puppeteer

25 Jan 2018

Google's Puppeteer Node library is a fantastic tool that provides an API for operating a headless version of Chrome. It has many different uses but recently I have been enjoying it for web scraping. I have previously written web scrapers in Python and Ruby but I found the experience with puppeter much more satisfying as it allows the use of native DOM APIs and JavaScript which feels much more suitable for web scraping.

I wrote a script which will scrape the list of premium bond high value winners from National Savings and Investments website and allow filtering of these by winner location. This article will walk through the script and highlight some important features of Puppeteer for web scraping.

Opening a Web Page with Puppeteer

const puppeteer = require('puppeteer');

const PRIZE_CHECKER = 'https://www.nsandi.com/prize-checker';

(async () => {
  const browser = await puppeteer.launch({ headless: false});
  const page = await browser.newPage();
  await page.goto(PRIZE_CHECKER);
})();

By default the browser will be launched headless but it is possible to watch the interaction with the page by specifying headless to be false as above.

Interacting with the Page

Elements on the page can be targeted via CSS selector and interacted with. For example mouse and keyboard events can be simulated.

await page.click('#some-element');
await page.type('#some-other-element', 'input text');

As the high value winners is hidden within an accordion and the table is paginated I use simulated mouse clicks to open and traverse the table.

await page.click('#high-value-winners'); // open the table of winners

let moreResults = true;
const winners = [];

while (moreResults) {
  const newWinners = await page.evaluate(() => {
    // extract and return currently displayed winners
  });

  winners.push(...newWinners);

  // if the button to navigate to the next page of winners
  // is not disabled then click it to open the next page
  try {
    await page.click('#table-prizewinner_next:not(.disabled)');
  } catch (error) {
    moreResults = false;
  }
}

Extracting Data from the Page

To extract data from the page DOM methods such as querySelectorAll and functional methods such as map and filter are your friend.

page.evaluate allows you to inject a function to be run in the page's context, allowing you to use methods such as above to extract the data.

Here is the function I pass to page.evaluate:

() => {
  const rowNodeList = document.querySelectorAll('#table-prizewinner tr');
  const rowArray = Array.from(rowNodeList);

  return rowArray.slice(1).map(tr => {
    const dataNodeList = tr.querySelectorAll('td');
    const dataArray = Array.from(dataNodeList);
    const [ prizeValue, winningBond, holding, area, bondValue, purchased ] = dataArray.map(td => td.textContent);

    return {
      prizeValue,
      winningBond,
      holding,
      area,
      bondValue,
      purchased
    };
  })
}

It rips out a row from the winner's table, extracts the data elements, extracts their text and packs the data into a JavaScript object.

Determining Relevant Data using Regexes

Regexes are another powerful tool for extracting data during scraping. Using capture groups in particular can be an effective way to parse the content of a web page.

For this particular scraper I use a regex to test if a high value winner is located in an area of interest. I first create an array of area patterns from command line arguments.

const AREA_PATTERNS = process.argv.slice(2).map(location => new RegExp(location, 'i'));

I can then test the winner objects I created earlier against the patterns to filter out irrelevant winners.

winners.filter(winner => {
  return AREA_PATTERNS.some(pattern => pattern.test(winner.area))
});

The Finished Script

const puppeteer = require('puppeteer');

const PRIZE_CHECKER = 'https://www.nsandi.com/prize-checker';
const AREA_PATTERNS = process.argv.slice(2).map(location => new RegExp(location, 'i'));

(async () => {
  const browser = await puppeteer.launch({ headless: false});
  const page = await browser.newPage();
  await page.goto(PRIZE_CHECKER);
  await page.click('#high-value-winners');

  let moreResults = true;

  const winners = [];

  while (moreResults) {
    const newWinners = await page.evaluate(() => {
      const rowNodeList = document.querySelectorAll('#table-prizewinner tr');
      const rowArray = Array.from(rowNodeList);

      return rowArray.slice(1).map(tr => {
        const dataNodeList = tr.querySelectorAll('td');
        const dataArray = Array.from(dataNodeList);
        const [ prizeValue, winningBond, holding, area, bondValue, purchased ] = dataArray.map(td => td.textContent);

        return {
          prizeValue,
          winningBond,
          holding,
          area,
          bondValue,
          purchased
        };
      })
    });

    winners.push(...newWinners);

    try {
      await page.click('#table-prizewinner_next:not(.disabled)');
    } catch (error) {
      moreResults = false;
    }
  }

  const areaWinners = filterWinners(winners);

  outputWinners(areaWinners);

  await browser.close();
})();

const filterWinners = winners => {
  return winners.filter(winner => {
    return AREA_PATTERNS.some(pattern => pattern.test(winner.area))
  });
};

const outputWinners = winners => {
  if (winners.length) {
    console.log(JSON.stringify(winners, null, 2));
  } else {
    console.log('No winners in specified locations');
  }
};

Running the Script

If I was interested in high value winners in Scotland and London I could run the script like follows:

node index.js Scotland London

Tags: Node, Web