Screen scraping with x-ray

Screen scraping is, I think, something all web developers do but we rarely talk about it. Probably because it's some combination of boring and dirty– most of the time code is written that works for exactly one web page (or a bunch of pages with the same structure). And in all likelihood, that code only works in the short term. Nevertheless, I figured I'd write a little about X-ray which is the latest screen scraping tool I've been using.

I'm coming at this having used a few different screen scraping tools over the years. Mostly python-based (urllib, Requests, Scrapy) but also using other node-based libs like jsdom and cheerio. And of course I've tried regex to parse html, but that went nowhere.

Back to X-ray. It's on npm so getting it couldn't be much easier. The repo readme showcases some simple examples. The selector API examples are nice but in practice I find myself using url, scope, selector to pull what I want from a page.

Let's look at an example. Earlier today, I saw something on twitter about reservoirs in California so I started looking around. Like so many other Internet journeys, I ended up on wikipedia, specifically the List of largest reservoirs of California. Everything about reservoirs is in a table so I figured it would be a good time to throw it at X-ray and see how long it would take me to get a geojson file with a point for each reservoir. If you want to jump ahead, go look at the gist of the final script and also the resulting geojson.

I started by poking around the DOM in Chrome. I settled on 'table.wikitable.sortable tr' as the selector for my scope and then used an object to specify specify which pieces I wanted from each row:

[{
  name: 'td:nth-child(1)',
  county: 'td:nth-child(2)',
  coordinates: 'td:nth-child(3) span.geo-nondefault span span.geo',
  volumeAcreFeet: 'td:nth-child(4)',
  volumeKm: 'td:nth-child(5)',
  outflow: 'td:nth-child(6)',
  dam: 'td:nth-child(7)',
  image: 'td:nth-child(8) img@src'
}]

Each property value is a selector that will be applied to each element matching the scope specified as the second argument to X-ray. For every element that matches the scope, there will be one corresponding object in the result from X-ray with properties that have values that the result from each selector.

Once X-ray gets the page and processes it, a callback gets everything as an array of objects. Which is great! But there's still a little more to do before a script can spit out a geojson file:

  1. Converting coordinates specified as a single string to numbers.
  2. Make a point feature from each object.
  3. Make a feature collection.

Number one is easy enough. And two and three are easy as well with turf's point and feature collection modules. Once we've got a geojson structure, write a file and call it a day. If you missed it before, there's a gist of the code for this (all 40ish lines of it) and the geojson is in a gist as well.