Crawling Tumblr with multiprocessing

2014/12/20

Let’s say you want to crawl Tumblr.

I’m choosing Tumblr here because it’s a relatively interesting example: unlike Twitter, which is super transparent about its social graph, or Facebook, whose social graph is fairly dense across geographic spectra 1, there’s not that much known about the Tumblr social graph. Plus, getting follower and following lists programmatically is impossible for anyone besides yourself, barring OAuth (which doesn’t exactly jive with the whole crawling thing).

However, Tumblr does expose one thing: post lists (this makes sense, since all usernames expose them via username.tumblr.com anyway). Post lists contain a source_url field unless it’s original content 2, which sometimes is an external link but usually points back to someone else’s tumblr. So it’s not a complete guarantee, but it’s a pretty strong proxy.

So let’s plot out a general plan of attack:

Start with two lists: crawlable, which is a list of all the tumblrs we can check for posts; and crawled, which is just a list of all tumblrs we’ve come across.
Populate crawlable with a single root node to check for posts first – in this case, my tumblr should serve nicely.
Pop the first item in crawlable – so, to start, jmduke – and grab all 3 of its posts.
For each post we grab, extract the name of the tumblr (if it exists). If we have already looked up that tumblr before, ignore it. Otherwise, add it to crawlable and crawled.
If, at the end of this, crawlable is empty: yay! You’re done! Otherwise, go back to step 3.

Tumblr actually exposes PyTumblr, a first-party API library. Using it, we easily can write something like this:

def process_username(username):

    # Grab posts from tumblr.
    response = client.posts(username)

    # Sometimes the usernames are bad; let's just ignore them.
    if 'posts' not in response:
        logging.warning("Posts not found for name: {}".format(username))
        return

    posts = response['posts']
    for post in posts:

        # Only grab reblogs.
        if 'source_url' not in post:
            continue
        url = post['source_url']

        # Don't go circular.
        if username in url:
            continue

        # Sources can be external, so ignore those.
        if 'tumblr' not in url or "www.tumblr.com" in url:
            continue

        # Should look like "http://{username}.tumblr.com/{otherstuff}".
        # (Regex is probably a better way to do this.)
        try:
            new_name = url.split(".tumblr.com")[0]
            if "https" in new_name:
                new_name = new_name.split("https://")[1]
            elif "http" in new_name:
                new_name = new_name.split("http://")[1]
            if "www" == new_name[0:3]:
                new_name = new_name.split(".")[1]
        except:
            logging.warning("Can't find username in url: {}".format(url))
            continue

        if new_name not in crawled_names:
            # Add it to crawled_names immediately to make sure other threads don't try it.
            crawlable_names.add(new_name)
            crawled_names.add(new_name)
            logging.info("Found new username: {}".format(new_name))

Nothing too crazy, right? And all wee need is some scaffolding to continuously feed in usernames from crawlable_names:

if __name__ == "__main__":
    crawlable_names.add("jmduke")
    while len(crawled_names) < 5000:
        username = crawlable_names.pop()
        process_username(username)

    with open("tumblr_usernames.csv", "w") as outfile:
        outfile.write("\n".join(crawled_names).encode('utf-8'))

As you can see, I decided to stop at 5000 just to see how long it would take.

I run this script with the lovely cProfile module and discover:

[jmduke:~]$ python -m cProfile -s cumtime crawl_tumblr.py
81769 function calls (81279 primitive calls) in 423.316 seconds

Oof, that’s slow. The bottleneck here is clearly the API call, seeing as we’re not doing much of anything else: cProfile says we spend around 925 ms in api_helper.py per call, which is a total bummer.

No matter how we slice it, grabbing the post is gonna take around a second, and if we extrapolate from the small run we can only get around twelve usernames per second. That’s… unfortunate, especially if we want to build a substantial dataset of at least a million or so nodes.

So: what’s the solution?

Multithreading is fucking awesome

I’ve never really played around with multithreading in Python. I’ve heard grave mumblings about the GIL, and a general lack of support for concurrency, but nothing super positive – nor the real need to use it. But this seemed like as good as an opportunity to play around with it, right?

After all, each API call is fairly autonomous: once we have a couple dozen crawlable names, we should be able to spin up threads and make each call individually: even if we’re not speeding up the actual response time, we’re firing off enough requests to make up for it.

But wait. Don’t you need to worry about thread safety? How do you keep track of the crawled names across threads? Won’t this be crazy complicated?

Turns out – nah. There’s a bunch of scaffolding you need to build, but nothing too crazy – and all of it is first party.

First, we convert the set of crawlable names to a Queue, which is a synchronized queue class that fits our use case perfectly:

from queue import Queue
crawlable_names = Queue()

Queues behave very much like you’d expect, the only difference is the names of the methods we need:

instead of calling crawlable_names.add(username), we call crawlable_names.put(username)
instead of calling len(crawlable_names), we call crawlable_names.qsize()
Most importantly: instead of calling crawlable_names.pop(), we call crawlable_names.get()

Next, we have to make a worker method. To try and avoid diving too deep in multiprocessing jargon, this method is basically designed to be fired over and over by a thread until the thread terminates. So ours looks like this:

def worker():
    while True:
        username = crawlable_names.get()
        process_username(username)
        crawlable_names.task_done()

The only weird-ish thing here is the task_done() call, which is our way of telling the Queue that we’ve finished messing around with the last item we got and we’re ready to move on. The other missing piece is actually creating the threads:

def spin_threads():
    thread_count = 50
    for i in range(thread_count):
        thread = Thread(target=worker)
        thread.start()

Nothing too crazy here, either. We wrap it all together by rewriting the main invocation:

if __name__ == "__main__":
    try:
        crawlable_names.put("jmduke")
        spin_threads()
        while active_count() > 0:
            time.sleep(0.1)

Aaaand we give it a whirl, by using cProfile again:

[jmduke:~]$ python -m cProfile -s cumtime crawl_tumblr_multithreaded.py
81769 function calls (81279 primitive calls) in 15.423 seconds

Welp. Was legitimately not expecting it to be that effective: it doesn’t scale quite exactly, due to the additional overhead of managing the threads, but a 50x increase in threads brings with it a 28.2x increase in throughput. Pretty great.

Other cool jank to try

So, here is the final script – plus some convenience to let you actually terminate it with Ctrl+C. Run it, DDOS Tumblr, whatever.

I think there’s inherent value in being able to summon millions of usernames (whether for benign or nefarious means), but there are a number of things to do with this meager code:

You can easily rip out the whole Tumblr part of it altogether and replace it with your network of choice. Twitter comes to mind, due to how easy it is to crawl.
Right now, all it’s doing is printing out usernames, as opposed to actually constructing a graph of connections – which would be infinitely cooler.
At some point, there has to be diminishing returns on the number of threads we spin up, right? Since you can actually import cProfile within a script, it would be super-neat to figure that out programmatically.
If we were really trying to be speed demons: the bottleneck here is gonna be the number of threads on a single machine, since all of the queue synchronization is done offline. We could plug this into a queuing service – like SQS – to try and run it on a bunch of random hosts and scale that way as well.

(For my part: I’m going to try and grab around ~two million connections and see if I can get some Dunbar’s number insights from it. But more about that in 2015.)

Hope you liked this! Shoot me any questions you have in the comments.

If you’re in Seattle, you’ll be Facebook friends with lots of other people in Seattle, etc. etc. [return]
Spoiler alert: this is very rare. [return]
I say all, but in reality I’m only grabbing the first twenty or so, as Tumblr’s API paginates the posts. This is due solely to laziness on my part. [return]

Crawling Tumblr with multiprocessing

Multithreading is fucking awesome

Other cool jank to try

About the Author

Colophon

Crawling Tumblr with multiprocessing

Multithreading is fucking awesome

Other cool jank to try

About the Author

Keep in touch

Colophon