I’m a sucker for Reddit, warts and all – communities like /r/python and /r/nba (and, yes, /r/aww) keep me coming back even though I’ve sworn off the default subreddits for quite some time. One of their revenue streams comes from “Reddit Gold”, which you can buy for yourself or a fellow redditor for $4 – basically a kudos to reward a particularly insightful or hilarious comment, like below:

Truncated, but you get the point.

I thought it would be somewhat entertaining to see which subreddits are most generous with their gold.

(Plus, it’d give me a good reason to bust out my new scraping project, panther, which you can see here.)

Reddit exposes an [browsable feed of all gilded comments](“http://www.reddit.com/r/all/gilded"), and I thought it’d be easy enough to just crawl through that. Panther’s main function, prowl, takes three arguments – a starting url, a CSS selector (or list of selectors) from which to extract content, and a CSS selector (or list of selectors) from which to generate new links to scrape. You pass it all of this and it creates a generator which we can iterate through and create a histogram:

url = "http://www.reddit.com/r/all/gilded"
gilded_selector = ".gilded.comment .subreddit"
next_page_selector = ".nextprev a[rel='nofollow next']"
subreddits = {}

generator = panther.prowl(url, gilded_selector, next_page_selector, delay=3)
for results in []:
    for subreddit in results:
        subreddits[subreddit.text] = subreddits.get(subreddit.text, 0) + 1
print subreddits

And after a few minutes, we get:

{'kpop': 1, 'Smite': 1, 'AskVet': 1, 'golf': 1, 'MapPorn': 3, 'softwareswap': 1, 'whatsthisbug': 1, 'pettyrevenge': 1, 'excel': 2, 'CCW': 1, 'Fitness': 1, 'learndota2': 1, 'Poetry': 1, 'dataisugly': 1, 'quotes': 1, 'wsgy': 5, 'dataisbeautiful': 3, 'nihilism': 1, 'Random_Acts_Of_Amazon': 17, 'relationships': 5, 'GlobalOffensive': 2, 'learnpython': 1, 'freediving': 1, 'unixporn': 2, 'Cardinals': 1, 'harrypotter': 1, 'blog': 14, 'picrequests': 6, 'whowouldwin': 2, 'woahdude': 3, 'WTF': 13, 'bourbon': 1, 'soccer': 4, 'sweden': 1, 'AnimalsBeingDerps': 1, 'MensRights': 1, 'Reds': 1, 'exmormon': 7, 'NewOrleans': 1, 'photoshopbattles': 3, 'Guildwars2': 1, 'freegold': 23, # truncated because you get the point, it's a big dictionary}

> 962

Hm. That seems like an awfully small total – less than one thousand total? In disbelief, I reran it outputting the urls as they were being traversed, the last one being http://www.reddit.com/r/all/gilded?count=950&after=t1_ch8yajz 1 – which, sure enough, didn’t have a ‘next’ link. Ugh – while it gives a rough idea of the distribution of gold, it’s hardly an appropriate sample size.

I decided to approach the problem from a different angle. Instead of grabbing them from r/all/, we can view the gilded comments for a specific subreddit and crawl from there. So, first, we need a list of popular subreddits. We’ll get this using panther’s find() method, which takes a url and a CSS selector (or list thereof) and returns matches (basically prowl() without the crawling):

# Grab the top 125 subreddits.
url = "http://www.redditlist.com/"
links = panther.find(url, "#yw2 td:nth-child(2) a")
urls  = map(lambda a: a.get('href') + "gilded", links)

And now we throw that list of jumping-off points into panther:

generator = panther.prowl(urls, gilded_selector, next_page_selector, delay=3)
subreddits = {}
for results in generator:
    for subreddit in results:
        subreddits[subreddit.text] = subreddits.get(subreddit.text, 0) + 1

This – after more than a few minutes – has the desired effect!

from nvd3 import lineChart, discreteBarChart
import operator

chart = discreteBarChart(name="Reddit Gold", height=300, width=800)
sorted_subreddits = sorted(subreddits.iteritems(), key=operator.itemgetter(1))
chart.add_serie(name="gold", y=[a[1] for a in sorted_subreddits], x=[a[0] for a in sorted_subreddits])

It’s worth noting that we ran into the same weird thousand-comment limit: still, we get a very interesting distribution curve amongst the various subreddits. However, we don’t get a good idea of relative generosity: a 600,000-subscriber subreddit with 700 recorded gildings 2 is hardly as impressive as a subreddit with only 300 gildings but 50,000 subscribers.

Let’s solve this by grabbing the subscriber counts:

subscribers = {}
for subreddit in urls:
    name = subreddit.split("/")[-2] # whee, magic
    subscribers[name] = panther.find(subreddit, ".subscribers .number")[0].text
normalized = {x: subreddits[x] / float(subscribers[x].replace(",","")) for x in subscribers}

Because small numbers like 0.000804813960057776 are pretty lame to read, we’ll normalize the top value to one:

normalized = {x: normalized[x] / max(normalized.values) for x in normalized}

And then plotting it the same way as above, we get the following chart:

… And /r/circlejerk is the top subreddit. Of course.

Honorary shoutouts also go to /r/nba, /r/relationships, and /r/nfls 3 and jeers go to /r/books, /r/history, and /r/Art for being some of the stingiest donors. 4

Hope you enjoyed this quick little post! Even if it was just a reason for me to take my little library for a spin and dig up a little information about some of the communities, it was a lot of fun. I’m honestly surprised this isn’t an avenue Reddit hasn’t gone down before – considering the success of some of the friendly competitions they’ve thrown in the past, having a ‘war of donations’ of sorts would probably light a fire under a lot of these people.

  1. Pretty sure this is a transient URL, so you’re gonna have to take my word for it. [return]
  2. Is that a word? Let’s pretend that’s a word. [return]
  3. It stands to reason that people are most generous to strangers when they’re thinking about sports and love. But mainly sports. [return]
  4. I’d make a joke about them not having the cash to spare, being liberal arts subreddits, but /r/business only barely edged them out. [return]
Liked this post? Follow me!