Inspired by Tomasz Tunguz’s post on the underrated art of web crawling, I decided to crawl Svbtle, Dustin Curtis’s blog network/“magazine of the future” thingy. I chose Svbtle because, despite some of its misgivings, it’s generally an interesting source of content.
I’ll be using Scrapy for the first time. Scrapy, as I discovered, is a pretty awesome crawling package for Python – it definitely embraces a ‘batteries included’ moniker, and I’ll get into some of the details and tricky areas throughout the post.
1. What do I want to crawl?
Svbtle posts, like regular blog posts, aren’t exactly tricky to parse: they’ve got an author, a title, content, yadda yadda. I figured that I’d grab all of the discrete info that I could, to make my analysis a little more flexible down the road. I envisioned my end result as a .csv with a row for each post I could find – since I didn’t really set out with a definite analytical goal, I decided to err on the side of data:
- word count
- when it was published
- kudos (kudos, for the uninitiated, are Svbtle’s take on Likes/+1s/Retweets).
Plugging this into Scrapy is pretty easy: Scrapy takes an OOP-inspired approach to crawling, so defining what I want to find is as simple as creating a class for it:
class Article(Item): author = Field() url = Field() kudos = Field() word_count = Field() title = Field() published = Field()
Now, to actually populate these fields, we extend Scrapy’s
and, amongst a few other things, add a
parse_article method that takes a
Response object and spits out an
Article. Scrapy’s default way of handling
this is using
HtmlXPathSelector, which is basically jQuery selection with
some weird syntactic sugar. For example:
- Grabbing all the links on a page is
- Grabbing all the divs with a class of ‘foo’ is
- Grabbing all the second paragraphs of a given div is `\div\p‘.
Dustin did a lovely job of minimizing spaghetti code (at least outwardly) in Svbtle, and as a result writing selectors for all of our fields is easy:
def parse_article(self, response): x = HtmlXPathSelector(response) article = Article() article['url'] = response.url article['author'] = "/".join(response.url.split('/')[:3]) article['kudos'] = x.select("//div[@class='num']/text()").extract() words = x.select("//article[@class='post']").extract() article['word_count'] = len(words.split()) article['title'] = x.select("//title/text()").extract().split('|').strip() article['published'] = x.select("//time/text()").extract() return article
(Note: even if
select() only yields one result, that result’s still wrapped
in a tuple.)
So, if we get the HTML for a svbtle post, we can scrape all the good stuff and
save it as an
Article object! Yay! What’s next?
2. Where do I want to crawl?
Crawlers, in a way, work just like us: they need a jumping-off point (or root url) and they start madly clicking on hyperlinks from there. Ideally, our root url would be a link to every single Svbtle blog out there – unfortunately, no such page exists (Svbtle used to have a list, but it vanished a while back.)
Scrapy isn’t wonderful enough to automagically find every blog using the Svbtle backend – since blogs are hosted on individual domains (for example, Dustin’s blog is on dcurt.is, not dcurtis.svbtle.com), it’s going to be tricky to magically get a list of authors/blogs. For starters, let’s just use the main feed, which gives us twenty or so recent posts:
start_urls = "http://www.svbtle.com/feed"
start_urls is an attribute of
CrawlSpider – it takes either a string or
a list/tuple of strings.)
Lastly, we give the spider a rule – the Scrapy
a better explanation of rules than I ever could, but they’re essentially
regular expressions that denote whether or not the spider ‘clicks’ on a link.
These rules generally are based off of the
href of the link, but you can
also use XPaths like we did with parsing a minute ago to specify which DOM
elements to look at.
To keep things simple at first, we’re going to accept any link we find within the body of an article (crawling veterans will probably twitch a little bit at that) and point it towards our custom parsing method:
rules = [Rule( SgmlLinkExtractor( allow=r'.*', restrict_xpaths="//article" ), callback='parse_article', follow=True) ]
3. Firing up the web
And that’s all we have to do! That’s like, twenty lines of Python, and Scrapy handles all the gory details. We fire it up with the CLI:
scrapy runspider svbtle_scraper.py
And to output everything into a nice li’l CSV:
scrapy runspider svbtle_scraper.py -o scraped.csv -t csv
And now we wait!
4. Oh god what have I done
There’s a reason that the ‘web’ is such an apt metaphor for the internet:
while the average web page only has around 7.5 links, the massive breadth of
the internet explodes that number thanks to the powers of exponential growth.
(Quick napkin math:
7.5 ** 10 == 563 million)
To give you an anecdotal proof of this: I run my poor, naive spider for the first time and, as one might expect from such a SV-centric set of bloggers, one of the posts I crawl has a TechCrunch quote. And, of course, the quote is cited: so my poor, naive spider suddenly starts to crawl TechCrunch, looking furiously for something that resembles a Svbtle post – and, after ten minutes and a few hundred requests, I’m down the rabbithole of a massive stack of hyperlinks.
I furiously mash Ctrl+C. This approach isn’t going to work – and besides, even if I play whack-a-mole by excluding bigger domains (by adding another rule), this isn’t going to be a feasible solution.
I took a step back. I knew what I needed: a definitive list of Svbtle-backed sites. If I pass those all as a massive rule, I can force the spider to stay only on those sites, like an overprotective mother too nervous to let their kid on AOL without parental controls (that metaphor might betray my age to the audience, but whatever).
But how do I get such a list?
5. Doing things the dumb (and easy) way
I did some poking around and found a twitter stream that corresponds with Svbtle’s home page – and instead of 20 posts, it has 875! It’s probably not all of the authors ever, but it’s much more than I had before. Now how do I grab all of the sites quickly?
The elegant way: Use Twitter’s API and grab all of the tweets in JSON (dealing with OAuth and the hellish v1.1 API along the way), then parse the tweets for the URLs. Hope I don’t hit the usage cap along the way!
My way: Keep scrolling down until I load all 875 tweets and then save the HTML file and parse it with XPath.
You know what? My way took ninety seconds and six lines of code. I stand by it.
def get_svbtle_blogs(): plainhtml = '\n'.join(open('svbtlefeed.html').readlines()) # Gotta create a fake 'Response' object so Scrapy's utils can scan it: # for some reason, they can't handle basic `str`s. fake_response = TextResponse('http://www.svbtle.com') fake_response = fake_response.replace(body=plainhtml) x = HtmlXPathSelector(fake_response) all_links = x.select("//a[@class='twitter-timeline-link']/@title").extract() base_links = ["/".join(x.split('/')[:3]) for x in all_links] base_links = list(set(base_links)) return base_links
And we can plug this guy back into our extension of CrawlSpider – we’ll use it to both set a list of root urls AND restrict where our spider can go:
start_urls = get_svbtle_blogs() rules = [Rule( SgmlLinkExtractor( allow= get_svbtle_blogs(), ), callback='parse_article', follow=True) ]
6. Closing thoughts
This pretty much worked perfectly. It took me around fifteen minutes to successfully scrape fifteen hundredish posts; while there’s still some sanitation I need to do (Scrapy interpreted pagination in a relatively funky manner), I couldn’t be more pleased with the results. I highly recommend Scrapy: while you’ll have to do some tinkering on your own (the tutorial and documentation are a little out of date) it makes life much, much easier than trying to roll your own with Requests and BeautifulSoup.