Web Scraping: Unleash your Internet Viking

Speaker: Andrew Collier

Type: Tutorial

Room: Tugela Room

Time: Oct 04 (Wed), 14:00

Duration: 4:00

Web Scraping: Unleash your Internet Viking

Often the data you want is available somewhere on the internet. It might all be on one page (if you're lucky!) or distributed across many pages (possibly hundreds or thousands of pages!).

But you want those data consolidated locally. Not on a server in some distant land, but right here on your hardware. And in a convenient format. CSV or JSON, perhaps? Certainly not HTML!

What would Ragnar do? He'd go out, grab those data and bring them home.

The contemporary Internet Viking uses Web Scraping techniques to systematically extract information from web pages. This tutorial will demonstrate the process of web scraping. This is the battle plan:

  • Sharpening the Axe: Understanding of the structure of a HTML document.
  • Preparing the Longships: Using the DOM to select HTML elements.
  • Doing Battle: Manual extraction of data from a HTML document.
  • Stashing the Treasure: Storing data as CSV or JSON.
  • The Journey Home: Automated scraping with Scrapy.
  • Triumphant Return: Driving a browser using Selenium.

The first two components will be fairly brief, covering this material at a high level. We'll dig much deeper into the latter topics.

By the end of the tutorial you should be able to easily (and confidently) pillage and plunder large swathes of the internet.

Come along and make Ragnar proud. Tyr! Odin owns you all!

This tutorial will be suitable for Vikings with low to moderate levels of Python experience. We'll be working from a VirtualBox image to ensure that everybody has the same infrastructure and (hopefully) avoid most technical issues.