Imgur API part 2: Downloading a Gallery

This is the second installment of my Imgur API: ‘How to entire download Imgur Galleries’. Check out part 1 here in case you missed how to log into the API and upload an image!

I’m actually sort of cheating here, because we don’t actually need to use the API at all here, if we don’t want to. That is because we’re only going to deal with the galleries that are built from the images submitted to reddit. This means after you’re done with this tutorial, you’ll be able to just set the script to a given subreddits name, and grab all the images that have been submitted to /r/aww or /r/wallpapers.

I will be writing another tutorial soon for galleries and albums unrelated to reddit.com as well as grabbing gallery information, such as the title and other descriptive things like that, but that’s less related to the actual downloading of the gallery, which is what we’re interested in today!

Please note that I’m working with Python 2.7 on Windows 7 64-bit, so you might have to modify the code slightly to accommodate for your platform, or OS.

Anyways, hit the jump to get started!

Edit: reddit user: easttntoppedtree caught that it maxes out at 56 images, so you’ll have to add /page/PAGENUMBERHERE.json to the end of the url to get the next 56 images like so: http://imgur.com/r/scarlettjohansson/top/page/1.json’ while keeping in mind that 0 (zero) is a valid page number

Tinypaste link for the full working code, as seen at the bottom of the page

First things first: the imports. We are going to import the usual networking suspects, requests, json, as well as the ever useful pprint module. New to us this time is the datetime module, which is a handy way to handle time related things in Python. We’ll just be using it to grab todays date and time. Finally, there’s the os module, which Python uses to interact with the proper OS for your computer. You can do helpful things like detect whether or not a folder exists and create one, or check which path Python is currently running in. It’s a very powerful module; so again, we’ll only be briefly using it to create a folder to hold our images in.

import requests
import json
from pprint import pprint
import datetime
import os

Here is where we’ll have our constants for the script, such as how many images we want to download, and which subreddit’s gallery we will take the images from.

##Set constants for script
DL_LIMIT = 5
SUBREDDIT = ‘scarlettjohansson’

Here, we use the requests and json module to make a GET request to the url customized to fit our SUBREDDIT value that set above. Once we get the response from the site, we transform the raw JSON object into a Python Dictionary, which is something we can manipulate a lot more effectively within Python. Make sure that you are using the text attribute of the response, instead of the content or raw response data.

Edit: May 29th 2012: easttntoppedtree caught that it maxes out at 56 images, so you’ll have to add /page/PAGENUMBERHERE.json to the end of the url to get the next 56 images like so: http://imgur.com/r/scarlettjohansson/top/page/1.json’ while keeping in mind that 0 (zero) is a valid page number

##Download and load the JSON information for the Gallery
#get json object from imgur gallery. can be appended with /month or /week for
# more recent entries
r = requests.get(r'http://imgur.com/r/{sr}/top.json'.format(sr=SUBREDDIT)
#creates a python dict from the JSON object
j = json.loads(r.text)

This is just for your own uses, to see exactly what the response was. You can use this to determine whether imgur is over-capacity or if the URL was set incorrectly. For now, I’ve commented it out, since it should be working fine for now.

#prints the dict, if necessary. Used for debug mainly
 #pprint(j)

Now, we extract the List of images in the JSON dict we had just created. You can check out the layout of the dictionary we created from the JSON by uncommenting the `pprint` line above

get the list of images from j['gallery']
image_list = j['gallery']

Some more flavour text, so we can confirm the amount of images in the gallery. It checks the amount of objects in the list, using the len builtin Python function

#print the number of images found
print len(image_list), 'images found in the gallery'

More debugging options, here for you to examine the content of the first image found in the  list we had just created, which is found at index 0, because, as you know, lists begin at index 0 instead of 1.

#debugging, examine the first image in the gallery, confirm no errors
pprint(image_list[0])

Now, we want to create a folder in which we can fit all the images we are going to be downloading in a minute. I like putting them in timestamped folders, but you can easily change it to be called the name of the subreddit, or anything else.

Here, we use the `datetime` module to fetch the current time, in a format specific to `datetime`

#get the time object for today
folder = datetime.datetime.today()

So that means we need to turn it into a printable string we can use to name our folder, so we run the str builtin function on it, which does exactly what we want.

#turn it into a printable string
string_folder = str(folder)

Then, since some characters cannot be used as a folder name, we need to remove them. We use the string’s function, called replace to remove the colon character, and replace it with a folder-friendly character, the period.

#replace some illegal chars
legal_folder = string_folder.replace(':', '.')

Now, we use the mkdir function from the `os` module to create a folder using a legal string we just created. Remember that unless you specify otherwise, the folder will be created in the same location the script is running.

#create the folder using the name legal_folder
os.mkdir(str(legal_folder))

Next, we need to extract the name and the type of image each image is. So we create an empty list, in which we’ll put a 2-item tuple, which will contain the name and extension of the file, which we’ll use for downloading and saving the image.

#list of pairs containing the image name and file extension
image_pairs = []

At each index of the list of images we have created, we’ll find a dict filled will miscellaneous information about the image, such as its size, and how many times it was downloaded. All we’re interesting in though, is the hash and the ext keys and the value. So for every image dictionary in the list, we take the hash and ext keys and take associate values, and append them both to the newest list for later.

#extract image and file extension from dict
for image in image_list:
  #get the raw image name
  img_name = image['hash']
  #get the image extension(jpg, gif etc)
  img_ext = image['ext']
  #append pair to list
  image_pairs.append((img_name, img_ext))

Next, we need to download the images from the website. We do that by substituting the name and ext of the image into the URL template below; but we don’t want to surpass our pre-set download limit, in case there’s a bandwidth limit, you don’t want to hammer Imgur’s server, so we need to keep track of the number of images we grab.

So first, we set a temporary variable to keep track of the number of images we’ve grabbed.

#current image number, for looping limits
current = 0

Then we start a loop that stops when current is equal to or greater than the DL_LIMIT we set at the beginning of the file

#run download loop, until DL_LIMIT is reached
for name, ext in image_pairs:
  #so long as we haven't hit the download limit:
  if current < DL_LIMIT:

Then, we fill the URL template with the name and extension of the image on the site.

    #this is the image URL location
    url = r'http://imgur.com/{name}{ext}'.format(name=name, ext=ext)
    #print the image we are currently downloading
    print 'Current image being downloaded:', url

Next, we have to download the actual image, instead of the JSON that is referencing it. We do that by once again using the requests module to create a GET request to the URL we’ve filled in and then saving the response.

    #download the image data
    response = requests.get(url)
    #set the file location
    path = r'./{fldr}/{name}{ext}'.format(fldr=legal_folder,
                                          name=name,
                                          ext=ext)

Then we create a file object, at the path location, and set the file object to ‘write binary’ instead of the default ‘read’ because we need to make sure we are writing, for one thing, but also in binary mode with binary data to a file, rather than strings (think 0s and 1s instead of ‘abcs’). This is the same reason we use the response.content attribute, instead of response.text

#open the file object in write binary mode
    fp = open(path, 'wb')
    #perform the write operation
    fp.write(response.content)

To finish off the for loop we close the file object we opened to write the image to disk, as well as increase the current image count, so we can make sure we don’t download too many images.

#close the file
    fp.close()
    #advance the image count
    current += 1

Finally, we close off with some flavour text, just to let the user know we’ve successfully ran the script.

#print off a completion string
print 'Finished downloading {cnt} images to {fldr}!'.format(cnt=current,
fldr=legal_folder)

As usual, here’s the full code, that you should be able to run on your PC just fine.

import requests
import json
from pprint import pprint
import datetime
import os

##Set constants for script

DL_LIMIT = 5

##Download and load the JSON information for the Gallery

#get json object from imgur gallery. can be appended with /month or /week for
# more recent entries
r = requests.get(r'http://imgur.com/r/scarlettjohansson/top.json')
#creates a python dict from the JSON object
j = json.loads(r.text)

#prints the dict, if necessary. Used for debug mainly
#pprint(j)

#get the list of images from j['gallery']
image_list = j['gallery']

#print the number of images found
print len(image_list), 'images found in the gallery'

#debugging, examine the first image in the gallery, confirm no errors
pprint(image_list[0])

## Create a dynamically named folder

#get the time object for today
folder = datetime.datetime.today()
#turn it into a printable string
string_folder = str(folder)
#replace some illegal chars
legal_folder = string_folder.replace(':', '.')
#create the folder using the name legal_folder
os.mkdir(str(legal_folder))

## Extract image info from the gallery

#list of pairs containing the image name and file extension
image_pairs = []
#extract image and file extension from dict
for image in image_list:
    #get the raw image name
    img_name = image['hash']
    #get the image extension(jpg, gif etc)
    img_ext = image['ext']
    #append pair to list
    image_pairs.append((img_name, img_ext))

## Download images from imgur.com

#current image number, for looping limits
current = 0
#run download loop, until DL_LIMIT is reached
for name, ext in image_pairs:
    #so long as we haven't hit the download limit:
    if current < DL_LIMIT:
        #this is the image URL location
        url = r'http://imgur.com/{name}{ext}'.format(name=name, ext=ext)
        #print the image we are currently downloading
        print 'Current image being downloaded:', url

        #download the image data
        response = requests.get(url)
        #set the file location
        path = r'./{fldr}/{name}{ext}'.format(fldr=legal_folder,
                                              name=name,
                                              ext=ext)
        #open the file object in write binary mode
        fp = open(path, 'wb')
        #perform the write operation
        fp.write(response.content)
        #close the file
        fp.close()
        #advance the image count
        current += 1

#print off a completion string
print 'Finished downloading {cnt} images to {fldr}!'.format(cnt=current,
                                                            fldr=legal_folder) 

10 thoughts on “Imgur API part 2: Downloading a Gallery”

  1. When i copy the script on the bottom i get this

    Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)] on win32
    Type “copyright”, “credits” or “license()” for more information.
    >>> ================================ RESTART ================================
    >>>

    Traceback (most recent call last):
    File “C:\Users\BlackThought\Desktop\SJ.py”, line 1, in
    import requests
    ImportError: No module named requests
    >>>

  2. I get an error at:
    image_list = j[‘gallery’]
    ‘gallery’ apprently doesn’t form part of the j data structure which is made up of: status, data, success.
    Is there an alternative way to get the number of images in the gallery?

    Thanks! and awesome code by the way!

    1. Thanks!

      It looks like Imgur might’ve changed their json layout with their API3 changes, so I’m not exactly sure right off the bat. It looks like you’d have to register an API and make a call to the ‘http://api.imgur.com/models/album’ URL and parse the `images_count` response, but that’s specifically for Albums, and not Subreddit Galleries. Actually no, here http://api.imgur.com/endpoints/gallery it looks like if you made the API request to the subreddit gallery, it’d just act the same way as it would for Albums.

      So make a GET request to https://api.imgur.com/3/gallery/r/scarlettjohannson and then look for the ‘image_count’ attribute. I’m not sure if that’s the total images or just the images on the page though. Please let me know if you work it out! Sorry I couldn’t help more.

      1. Imgur API 3, took a while to figure out. and I am WAAAAY rusty at python so digging down was a little confusing for me, maybe some one can elaborate better?

        The following is a down and dirty way to get a list of image links from an imgur subreddit gallery, I am sure it is nearly identical for a regular gallery as well.

        header= {"Content-Type": "text", "Authorization": "Client-ID " + CLIENT_ID}
        r = requests.get('https://api.imgur.com/3/gallery/r/pics/top.json', headers=header)

        j = json.loads(r.text)
        for image in j[u'data']:
        print(image['link'])

  3. Hello! I am trying to grab images form a subreddit using your script. I am not experienced in Python, when I run the script it opens a command window but nothing happens and after a second or two it closes. I installed pip and requests. Any advice? Thanks

    1. Try running it in a command prompt, or having a raw_input() (or input() if you’re on Python 3) at the end of the file so that the script waits for you before closing.

      There’s a good chance it’s an exception that’s getting thrown and not caught, so make sure you’re not running into any trouble like that.

      Consider using iPython’s ipdb for debugging!

  4. is there any reason you are setting the DL_LIMIT = 5
    instead of letting it download the entire list of images?
    the API v3 returns 10 for just top.json and 56 for top/page#.json

    1. Sorry dude, I dunno how I missed this. I honestly can’t remember why I did that, it was likely just a preferential choice to save on space or something. Did you end up getting this to work?

    2. It will always return the number of “top” results (in this case 10) if you specify “top.json”. You need to check whether there are any results for top/page#.json.

      Try something like this:
      pageNum = 0
      r = requests.get(r’http://imgur.com/r/{sr}/top/page/’.format(sr=SUBREDDIT) + str(pageNum) + ‘.json’)
      j = json.loads(r.text)#creates a python dict from the JSON object
      data = j[‘data’] # Sets data to see if it contains any information
      print data
      while data:
      r = requests.get(r’http://imgur.com/r/{sr}/top/page/’.format(sr=SUBREDDIT) + str(pageNum) + ‘.json’)
      j = json.loads(r.text)#creates a python dict from the JSON object
      #prints the dict, if necessary. Used for debug mainly
      if debug:
      pprint(j)
      data = j[‘data’] # Sets data to see if it contains any information
      downloadImages(data)
      pageNum += 1

      This will check to see if your “data” list has any contents before you try to download the images using the “downloadImages()” using the code from the example.

Leave a Reply

Your email address will not be published.