This is the second installment of my Imgur API: ‘How to entire download Imgur Galleries’. Check out part 1 here in case you missed how to log into the API and upload an image!
I’m actually sort of cheating here, because we don’t actually need to use the API at all here, if we don’t want to. That is because we’re only going to deal with the galleries that are built from the images submitted to reddit. This means after you’re done with this tutorial, you’ll be able to just set the script to a given subreddits name, and grab all the images that have been submitted to /r/aww or /r/wallpapers.
I will be writing another tutorial soon for galleries and albums unrelated to reddit.com as well as grabbing gallery information, such as the title and other descriptive things like that, but that’s less related to the actual downloading of the gallery, which is what we’re interested in today!
Please note that I’m working with Python 2.7 on Windows 7 64-bit, so you might have to modify the code slightly to accommodate for your platform, or OS.
Anyways, hit the jump to get started!
Edit: reddit user: easttntoppedtree caught that it maxes out at 56 images, so you’ll have to add /page/PAGENUMBERHERE.json to the end of the url to get the next 56 images like so: http://imgur.com/r/scarlettjohansson/top/page/1.json’ while keeping in mind that 0 (zero) is a valid page number
Tinypaste link for the full working code, as seen at the bottom of the page
First things first: the imports. We are going to import the usual networking suspects, requests, json, as well as the ever useful pprint module. New to us this time is the datetime module, which is a handy way to handle time related things in Python. We’ll just be using it to grab todays date and time. Finally, there’s the os module, which Python uses to interact with the proper OS for your computer. You can do helpful things like detect whether or not a folder exists and create one, or check which path Python is currently running in. It’s a very powerful module; so again, we’ll only be briefly using it to create a folder to hold our images in.
import requests import json from pprint import pprint import datetime import os
Here is where we’ll have our constants for the script, such as how many images we want to download, and which subreddit’s gallery we will take the images from.
##Set constants for script DL_LIMIT = 5 SUBREDDIT = ‘scarlettjohansson’
Here, we use the requests and json module to make a GET request to the url customized to fit our SUBREDDIT value that set above. Once we get the response from the site, we transform the raw JSON object into a Python Dictionary, which is something we can manipulate a lot more effectively within Python. Make sure that you are using the text attribute of the response, instead of the content or raw response data.
Edit: May 29th 2012: easttntoppedtree caught that it maxes out at 56 images, so you’ll have to add /page/PAGENUMBERHERE.json to the end of the url to get the next 56 images like so: http://imgur.com/r/scarlettjohansson/top/page/1.json’ while keeping in mind that 0 (zero) is a valid page number
##Download and load the JSON information for the Gallery #get json object from imgur gallery. can be appended with /month or /week for # more recent entries r = requests.get(r'http://imgur.com/r/{sr}/top.json'.format(sr=SUBREDDIT) #creates a python dict from the JSON object j = json.loads(r.text)
This is just for your own uses, to see exactly what the response was. You can use this to determine whether imgur is over-capacity or if the URL was set incorrectly. For now, I’ve commented it out, since it should be working fine for now.
#prints the dict, if necessary. Used for debug mainly #pprint(j)
Now, we extract the List of images in the JSON dict we had just created. You can check out the layout of the dictionary we created from the JSON by uncommenting the `pprint` line above
get the list of images from j['gallery'] image_list = j['gallery']
Some more flavour text, so we can confirm the amount of images in the gallery. It checks the amount of objects in the list, using the len builtin Python function
#print the number of images found print len(image_list), 'images found in the gallery'
More debugging options, here for you to examine the content of the first image found in the list we had just created, which is found at index 0, because, as you know, lists begin at index 0 instead of 1.
#debugging, examine the first image in the gallery, confirm no errors pprint(image_list[0])
Now, we want to create a folder in which we can fit all the images we are going to be downloading in a minute. I like putting them in timestamped folders, but you can easily change it to be called the name of the subreddit, or anything else.
Here, we use the `datetime` module to fetch the current time, in a format specific to `datetime`
#get the time object for today folder = datetime.datetime.today()
So that means we need to turn it into a printable string we can use to name our folder, so we run the str builtin function on it, which does exactly what we want.
#turn it into a printable string string_folder = str(folder)
Then, since some characters cannot be used as a folder name, we need to remove them. We use the string’s function, called replace to remove the colon character, and replace it with a folder-friendly character, the period.
#replace some illegal chars legal_folder = string_folder.replace(':', '.')
Now, we use the mkdir function from the `os` module to create a folder using a legal string we just created. Remember that unless you specify otherwise, the folder will be created in the same location the script is running.
#create the folder using the name legal_folder os.mkdir(str(legal_folder))
Next, we need to extract the name and the type of image each image is. So we create an empty list, in which we’ll put a 2-item tuple, which will contain the name and extension of the file, which we’ll use for downloading and saving the image.
#list of pairs containing the image name and file extension image_pairs = []
At each index of the list of images we have created, we’ll find a dict filled will miscellaneous information about the image, such as its size, and how many times it was downloaded. All we’re interesting in though, is the hash and the ext keys and the value. So for every image dictionary in the list, we take the hash and ext keys and take associate values, and append them both to the newest list for later.
#extract image and file extension from dict for image in image_list: #get the raw image name img_name = image['hash'] #get the image extension(jpg, gif etc) img_ext = image['ext'] #append pair to list image_pairs.append((img_name, img_ext))
Next, we need to download the images from the website. We do that by substituting the name and ext of the image into the URL template below; but we don’t want to surpass our pre-set download limit, in case there’s a bandwidth limit, you don’t want to hammer Imgur’s server, so we need to keep track of the number of images we grab.
So first, we set a temporary variable to keep track of the number of images we’ve grabbed.
#current image number, for looping limits current = 0
Then we start a loop that stops when current is equal to or greater than the DL_LIMIT we set at the beginning of the file
#run download loop, until DL_LIMIT is reached for name, ext in image_pairs: #so long as we haven't hit the download limit: if current < DL_LIMIT:
Then, we fill the URL template with the name and extension of the image on the site.
#this is the image URL location url = r'http://imgur.com/{name}{ext}'.format(name=name, ext=ext) #print the image we are currently downloading print 'Current image being downloaded:', url
Next, we have to download the actual image, instead of the JSON that is referencing it. We do that by once again using the requests module to create a GET request to the URL we’ve filled in and then saving the response.
#download the image data response = requests.get(url) #set the file location path = r'./{fldr}/{name}{ext}'.format(fldr=legal_folder, name=name, ext=ext)
Then we create a file object, at the path location, and set the file object to ‘write binary’ instead of the default ‘read’ because we need to make sure we are writing, for one thing, but also in binary mode with binary data to a file, rather than strings (think 0s and 1s instead of ‘abcs’). This is the same reason we use the response.content attribute, instead of response.text
#open the file object in write binary mode fp = open(path, 'wb') #perform the write operation fp.write(response.content)
To finish off the for loop we close the file object we opened to write the image to disk, as well as increase the current image count, so we can make sure we don’t download too many images.
#close the file fp.close() #advance the image count current += 1
Finally, we close off with some flavour text, just to let the user know we’ve successfully ran the script.
#print off a completion string print 'Finished downloading {cnt} images to {fldr}!'.format(cnt=current, fldr=legal_folder)
As usual, here’s the full code, that you should be able to run on your PC just fine.
import requests import json from pprint import pprint import datetime import os ##Set constants for script DL_LIMIT = 5 ##Download and load the JSON information for the Gallery #get json object from imgur gallery. can be appended with /month or /week for # more recent entries r = requests.get(r'http://imgur.com/r/scarlettjohansson/top.json') #creates a python dict from the JSON object j = json.loads(r.text) #prints the dict, if necessary. Used for debug mainly #pprint(j) #get the list of images from j['gallery'] image_list = j['gallery'] #print the number of images found print len(image_list), 'images found in the gallery' #debugging, examine the first image in the gallery, confirm no errors pprint(image_list[0]) ## Create a dynamically named folder #get the time object for today folder = datetime.datetime.today() #turn it into a printable string string_folder = str(folder) #replace some illegal chars legal_folder = string_folder.replace(':', '.') #create the folder using the name legal_folder os.mkdir(str(legal_folder)) ## Extract image info from the gallery #list of pairs containing the image name and file extension image_pairs = [] #extract image and file extension from dict for image in image_list: #get the raw image name img_name = image['hash'] #get the image extension(jpg, gif etc) img_ext = image['ext'] #append pair to list image_pairs.append((img_name, img_ext)) ## Download images from imgur.com #current image number, for looping limits current = 0 #run download loop, until DL_LIMIT is reached for name, ext in image_pairs: #so long as we haven't hit the download limit: if current < DL_LIMIT: #this is the image URL location url = r'http://imgur.com/{name}{ext}'.format(name=name, ext=ext) #print the image we are currently downloading print 'Current image being downloaded:', url #download the image data response = requests.get(url) #set the file location path = r'./{fldr}/{name}{ext}'.format(fldr=legal_folder, name=name, ext=ext) #open the file object in write binary mode fp = open(path, 'wb') #perform the write operation fp.write(response.content) #close the file fp.close() #advance the image count current += 1 #print off a completion string print 'Finished downloading {cnt} images to {fldr}!'.format(cnt=current, fldr=legal_folder)
When i copy the script on the bottom i get this
Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)] on win32
Type “copyright”, “credits” or “license()” for more information.
>>> ================================ RESTART ================================
>>>
Traceback (most recent call last):
File “C:\Users\BlackThought\Desktop\SJ.py”, line 1, in
import requests
ImportError: No module named requests
>>>
Hey Yessir,
You’re getting an ImportError, which means that your Python isn’t able to find the module named Requests on your PC. Is it possible that you haven’t installed it? Just in case, take a look here: http://docs.python-requests.org/en/latest/user/install/
I get an error at:
image_list = j[‘gallery’]
‘gallery’ apprently doesn’t form part of the j data structure which is made up of: status, data, success.
Is there an alternative way to get the number of images in the gallery?
Thanks! and awesome code by the way!
Thanks!
It looks like Imgur might’ve changed their json layout with their API3 changes, so I’m not exactly sure right off the bat. It looks like you’d have to register an API and make a call to the ‘http://api.imgur.com/models/album’ URL and parse the `images_count` response, but that’s specifically for Albums, and not Subreddit Galleries. Actually no, here http://api.imgur.com/endpoints/gallery it looks like if you made the API request to the subreddit gallery, it’d just act the same way as it would for Albums.
So make a GET request to https://api.imgur.com/3/gallery/r/scarlettjohannson and then look for the ‘image_count’ attribute. I’m not sure if that’s the total images or just the images on the page though. Please let me know if you work it out! Sorry I couldn’t help more.
Imgur API 3, took a while to figure out. and I am WAAAAY rusty at python so digging down was a little confusing for me, maybe some one can elaborate better?
The following is a down and dirty way to get a list of image links from an imgur subreddit gallery, I am sure it is nearly identical for a regular gallery as well.
header= {"Content-Type": "text", "Authorization": "Client-ID " + CLIENT_ID}
r = requests.get('https://api.imgur.com/3/gallery/r/pics/top.json', headers=header)
j = json.loads(r.text)
for image in j[u'data']:
print(image['link'])
Hello! I am trying to grab images form a subreddit using your script. I am not experienced in Python, when I run the script it opens a command window but nothing happens and after a second or two it closes. I installed pip and requests. Any advice? Thanks
Try running it in a command prompt, or having a raw_input() (or input() if you’re on Python 3) at the end of the file so that the script waits for you before closing.
There’s a good chance it’s an exception that’s getting thrown and not caught, so make sure you’re not running into any trouble like that.
Consider using iPython’s ipdb for debugging!
is there any reason you are setting the DL_LIMIT = 5
instead of letting it download the entire list of images?
the API v3 returns 10 for just top.json and 56 for top/page#.json
Sorry dude, I dunno how I missed this. I honestly can’t remember why I did that, it was likely just a preferential choice to save on space or something. Did you end up getting this to work?
It will always return the number of “top” results (in this case 10) if you specify “top.json”. You need to check whether there are any results for top/page#.json.
Try something like this:
pageNum = 0
r = requests.get(r’http://imgur.com/r/{sr}/top/page/’.format(sr=SUBREDDIT) + str(pageNum) + ‘.json’)
j = json.loads(r.text)#creates a python dict from the JSON object
data = j[‘data’] # Sets data to see if it contains any information
print data
while data:
r = requests.get(r’http://imgur.com/r/{sr}/top/page/’.format(sr=SUBREDDIT) + str(pageNum) + ‘.json’)
j = json.loads(r.text)#creates a python dict from the JSON object
#prints the dict, if necessary. Used for debug mainly
if debug:
pprint(j)
data = j[‘data’] # Sets data to see if it contains any information
downloadImages(data)
pageNum += 1
This will check to see if your “data” list has any contents before you try to download the images using the “downloadImages()” using the code from the example.