Scraping websites with Python

“Screen scraping” is a term I first bumped into over a decade ago (now I feel ancient and I’m only 26). In the context of the internet it means going to an external website and extracting data from it and turning that into something you can use in your own application or website.

Nowadays we might try to find an API that allows us to get just the data we need in a nice reusable format (XML, JSON, etc) but there are still far too many big websites that don’t make their data available in such formats. We need to turn back to the old ways and “scrape” the content out.

I’m choosing Python because it rocks. It’s easier than English to understand and it has some of the best libraries in existence. If you don’t have Python, get it… Once you have, we just need one library: BeautifulSoup. If you don’t have it, download and install that into your Python path. If you’re on Ubuntu, that’s just a sudo apt-get install python-beautifulsoup.

Then load up the Python console (at a command line: python) and fire off a couple of imports we’ll need:

import urllib2
from BeautifulSoup import BeautifulSoup

The data I’m going to focus on getting is the latest threads from the Ubuntu StackExchange support site. They do have an API but it’s a relatively clean page to learn from.

Let’s download the page using urllib2 and keep its source in a variable:

url = 'http://ubuntu.stackexchange.com/questions?sort=newest'
source = urllib2.urlopen(url)

If you just type source.read() in your Python console, you’ll now see a colossal mess of HTML. Now we need to see where the bit we want is within the document.

I turned to Firebug in Firefox. Using that you can see all the thread titles are <h3> tags, each containing <a> tags.

In the olden days I might have crafted a stodgy regular expression to look for the pattern but HTML just isn’t a good candidate for using regular expressions. HTML can be wildly variable and you end up making a complete hash of things… And one day the format changes the slightest bit and you’re left having to start from scratch. Even using an XML parser can spell disaster if the developer of the website hasn’t sanitised everything.

Today we fall back on a lovely little library called BeatifulSoup. This parses the data using a mix of XML parsing, regexes and plain string magic. If there’s viable HTML there, it’ll parse it and give it to you in a way you can deal with it. We’ll start by turning the source into a BeautifulSoup object:

soup = BeautifulSoup(source)

The most useful function in BeautifulSoup has to be .findAll(...). It returns a list of all the objects it finds matching your call. You can pass it a plain string like we’re about to or even pass it a regex if you have really strange requirements. Let’s get all the <h3> tags:

h3s = soup.findAll('h3')

This leaves us with a list of soup objects that still have link objects in. Assuming we want to also display the URL, lets scoot through them and print the title and link for each thread. As the first child element of each <h3> here is a <a> link, it is represented by .contentns[0]:

for h3 in h3s:
    title = h3.contents[0].string
    link = h3.contents[0]['href']
    print '%s\n%s\n' % (title, link)

This gives us the following:

Change font used in the tab-bar of TreeStyleTab
/questions/3732/change-font-used-in-the-tab-bar-of-treestyletab

From the time Ubuntu font is released, how long will it take for major websites(facebook, yahoo, google etc.) to support it?
/questions/3727/from-the-time-ubuntu-font-is-released-how-long-will-it-take-for-major-websitesf
...

If we just wanted the first five, we’d change the for-loop definition to:

for h3 in h3s[:5]:
    # ...

Remember that indexes in Python are (unless specified) zero-indexed (the first item has the index: 0). [:5] is shorthand for [0:5]. If we wanted items the second through sixth items, we’d use [1:6]. The last five items would be [-5:] (the -5 means “five back from the end” and nothing after the colon means “to the end”). List slicing its own topic ;)

Of course if you want to use those links on another website, you’d need to prefix them with https://ubuntu.stackexchange.com. But there you go. Quick. Simple. Screen scraping: done. And not a regex in sight.