Reddit API Tutorial Part 2: Getting all the submissions to any given subreddit

What’s up? Today we’re going to look at how to retrieve the stories (official term for submissions or selfposts) from any given subreddit. What we’re going to do is pretty simple, essentially just customizing a url with the proper subreddit and reading the JSON object returned. It’s going to be a pretty short one. I’m going to attach the login code I’ve written along with the code we’ve looked at today so that you can just copy and paste it into your IDE and start playing with it, right away. Just make sure you’ve got all the required module installed, mentioned here. ┬áHit the jump to get started!

Tinypaste Link for entire code

First we create a new function, called `subredditInfo` (feel free to choose better names) that takes the following arguments: ‘client’, which is simply a ‘requests.session’ object with the modhash saved in a cookie. We looked at how to make that session instance in an earlier post, so we won’t get into that again.

Next argument is ‘limit’ which limits the amount of stories reddit sends back to you. I believe the default it 25, but I haven’t checked. The reddit API says that you can only fetch 100 links at a time, so you’ll need to familiarize yourself with the ‘before’ and ‘after’ parameters in the docs.

The ‘sr’ argument is the subreddit from which we’re getting the stories from. The capitalization doesn’t seem to matter. Don’t forget that you can concatenate several subreddits at a time with ‘+’, so you could do “sr=’devblogs+loony’” if you wanted two at a time.

The ‘sorting’ argument is the way you’d like the API to sort the stories it sends back. You can choose ‘new’, ‘top’ or ‘hot’.

‘return_json’ is so our function knows whether to return the straight json response, or the list of stories. It just depends on what you want to use it for.

Finally, the ‘**kwargs‘ is for the unknown parameters you’d want to set, such as the aforementioned ‘before’ and ‘after’ parameters. Basically how those work is that you specify whether you want the stories before, or after, the id of the story you pass in. Say there was three stories, ‘a’, ‘b’, and, ‘c’. If you wanted ‘a’, you’d enter ‘before’ = ‘b’ as a parameter, and you’d get ‘a’. Likewise, if you pass the arugment ‘after’ = ‘b’, you’d get the ‘c’ story. It’s confusing, but if you play around with it, you’ll get the hang of it.

You can also pass in the time frame you’d like to get the stories from, either ‘hour’, ‘week’, ‘month’, ‘year’, or ‘all’ for all-time.

#----------------------------------------------------------------------
def subredditInfo(client, limit=25, sr='tankorsmash',
                  sorting='', return_json=False, **kwargs):

    """retrieves X (max 100) amount of stories in a subreddit\n
    'sorting' is whether or not the sorting of the reddit should be customized or not,
    if it is: Allowed passing params/queries such as t=hour, week, month, year or all"""
 

Here we set the parameters we’d like to send along with the URL. We first create a dict with the ‘limit’ string as a key and the ‘limit’ argument as a value. Then we update the dict (combine two dicts, overwriting any matching keys pairs) with the key word arguments we passed on when we called the function.

    #query to send
    parameters = {'limit': limit,}
    parameters.update(kwargs)

Then we build the url, filling in the proper subreddit and sorting method on the fly. By default the url is ‘http://www.reddit.com/r/tankorsmash/.json’, but you’re going to want to change that for your own purposes fairly quickly. Then we called the ‘get’ method of the ‘client’, which is simply a ‘requests.session’ instance, which makes an HTTP request to the URL. After that, we catch the HTML response, and turn the JSON response into a Python dict. Here I use the builtin json method, but you could also use the ‘json’ module for the exact same thing.

    url = r'http://www.reddit.com/r/{sr}/{top}.json'.format(sr=sr, top=sorting)
    r = client.get(url,params=parameters)
    print 'sent URL is', r.url
    j = r.json
	#j = json.loads(r.text) ## manual alternative

Here, we either return the raw json dict or return a list of stories, so it’s easier to iterate over and manipulate. There’s too many parts of the json response to go over here, but the most important parts are the ‘title’, ‘url’, ‘permalink’, ‘id’ and ‘author’. You can find each of those keys inside the list at j['data']['children'] which has the dict called ‘data’, which holds the key/value pairs you’re looking for. Say you were looking for the title of the first item in the returned json, you’d get it like this: ‘j['data']['children'][0]['data']['title']‘. If you ever get an IndexError, it’s probably because you’re trying to put a key value in, instead of an index ( an integer ).

    #return raw json
    if return_json:
        return j

    #or list of stories
    else:
        stories = []
        for story in j['data']['children']:
            #print story['data']['title']
            stories.append(story)

        return stories

And there you have it. Some of those paragraphs got a little long, but what can you do! Here’s the total code, just make sure you enter your own username and password.

import json
import requests
from pprint import pprint as pp2

#import os
#print os.getcwd()

#----------------------------------------------------------------------
def login(username, password):
    """logs into reddit, saves cookie"""

    print 'begin log in'
    #username and password
    UP = {'user': username, 'passwd': password, 'api_type': 'json',}
    headers = {'user-agent': '/u/TankorSmash\'s API python bot', }
    #POST with user/pwd
    client = requests.session(headers=headers)

    r = client.post('http://www.reddit.com/api/login', data=UP)

    #print r.text
    #print r.cookies

    #gets and saves the modhash
    j = json.loads(r.text)

    client.modhash = j['json']['data']['modhash']
    print '{USER}\'s modhash is: {mh}'.format(USER=username, mh=client.modhash)
    client.user = username
    def name():

        return '{}\'s client'.format(username)

    #pp2(j)

    return client

#----------------------------------------------------------------------
def subredditInfo(client, limit=25, sr='tankorsmash',
                  sorting='', return_json=False, **kwargs):
    """retrieves X (max 100) amount of stories in a subreddit\n
    'sorting' is whether or not the sorting of the reddit should be customized or not,
    if it is: Allowed passing params/queries such as t=hour, week, month, year or all"""

    #query to send
    parameters = {'limit': limit,}
    #parameters= defaults.copy()
    parameters.update(kwargs)

    url = r'http://www.reddit.com/r/{sr}/{top}.json'.format(sr=sr, top=sorting)
    r = client.get(url,params=parameters)
    print 'sent URL is', r.url
    j = json.loads(r.text)

    #return raw json
    if return_json:
        return j

    #or list of stories
    else:
        stories = []
        for story in j['data']['children']:
            #print story['data']['title']
            stories.append(story)

        return stories

client = login('USERNAME', 'PASSWORD')

j = subredditInfo(client, limit=1)

pp2(j)
Tagged , , , , . Bookmark the permalink.

6 Responses to Reddit API Tutorial Part 2: Getting all the submissions to any given subreddit

  1. ugc_commando says:

    did you mean for the headers variable to be part of the UP dict variable that is passed during login with the post method?

  2. karl says:

    I’m trying to get all the submisions of a subreddit, but i need to get around the 100 limit. I dont know how to use the other query parameters, such as the mentioned before/after.

    I tried this parameters = {‘limit’: limit,’before’:’1bdhcd’,} where 1bdhcd is the id of a submission.
    but that didnt seem to work. can you give me some further insights?

    also, how do I convert a date such as 1364788123 into something readable?

    • Tankor Smash says:

      For the date, you’ll have to use the module datetime or time to convert the seconds into a timestamp: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(1347517370)) Have you tried the full name for the submission? I believe its t3_ for a link, so in your case, it’d be t3_1bdhcd? Let me know if that works, and if it doesn’t I’ll take a closer look.

  3. Buck Wallander says:

    Great straight-forward tutorial, thank you for writing this up.

    I was having a problem receiving a proper JSON response after running the script. After looking over the Requests docs found that I had to make simple change of ‘r.json’ to ‘r.json()’ for it to return properly, in case anybody else runs into their code spitting out something like:

    bound method Response.json of

    • Tankor Smash says:

      Ah yes, requests got updated and changed it from an attribute to a method, thanks for the heads up!

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>