How to scrape an ImageBam gallery for images with 30 lines of Python

Right off the bat, I want to show you the results of this scraping, to give you a bit of motivation. Anyways, thanks to requests and BeautifulSoup, this is made trivially easy. Enough talking, let’s get down to the code! Don’t forget that as usual I’ll include the full source code at the bottom of the post.

Import the required modules

#make the required import
import bs4
import requests

I’m not actually sure if it matters at all, but I like to think saving the cookies across requests make them that much less suspicious of someone scraping their images. Of course, someone making 90 calls in a minute isn’t normal either, but hey, what can you do.

#create a session object that'll allow us to track the cookies across the session and raise
# too much suspicion
s = requests.session()

These two lines are the URLs we’ll use to build a proper URL. The base of the URL is the domain, and it’s not going to change, so we can add that dynamically to the limb URL which will change, depending on what the next button’s URL tells us.

#the base URL we'll append the the 'limb' URL we'll find in the 'next' buttons
base_url = r'http://www.imagebam.com'
limb_url = r'/image/c6b0cf70777810/'

Here we set up an empty list to store the found image URLs to. You can turn the list into a set later if you’re not sure if you’ve got duplicate URLs, since sets cannot have multiple identical elements. The done boolean is for toggling when we’re all done finding URLs

#a list to hold all the image URLs we'll find
image_links = []

# a few booleans to help our loop out
done = False

Below, we’re going to loop until we’ve set the done boolean to true, so we can scrape as many image URLs as we can from the gallery. You can see how we simply add two strings together to make a new URL, but this is actually not as efficient as using String.format, but since it’s only going to be looped a few times, the extra milliseconds saved won’t add up to much of a difference. We use the session object we created earlier and make a GET request to the built URL.

#keep looping until we're 'done' which means that our 'next' button is leading us in circles
while not done:

    next_found = False

    #combine base and limb URLs to create a full URL we can use
    url = base_url + limb_url
    #make the GET requests to the full URL we've just built
    r = s.get(url)

Now that we’ve got a response object from the s.get() call, we’re going to turn the response.content, which is HTML, into a BeautifulSoup object so that we can search through it and pull out the elements we’d like. In our case, it’s the class ‘buttonblue’.

#create a BeautifulSoup Object that we can parse for data
    soup = bs4.BeautifulSoup(r.content)
    #find all the elements with 'buttonblue' as a class
    link_elems = soup.findAll(attrs={'class': 'buttonblue'})

This big old set of code loops over each element returned by the soup.findAll() call, and looks inside each text tag to see if they contain either ‘save’ or ‘next’. If it’s ‘save’ then we append the text tag to our big list of URLs. If it’s ‘next’, we know it’s a button and we want to GET that url next.

#loop over  the list of elements
    for link in link_elems:

        #use this boolean to see if we've found a next button
        next_found = False

        #if 'save' is in the text area for the link, we'll add it to our list
        if 'save' in link.text:
            image_links.append(link['href'])
            print 'saving this link:', link['href']
        #or if 'next' is in the text, we'll treat that as the next url to parse
        elif 'next' in link.text:
            limb_url = link['href']
            print 'found', limb_url
            next_found = True

If we haven’t found a next button in this page, we can safely assume that we’ve reached the end of the gallery, and then we print out the contents, with which you can upload to Imgur, or just download right to your PC.

#if we haven't found a next button, we can assume we're at the end of a gallery
    if not next_found:
            print 'else, so were done for this page'
            done = True

#print the list of image links
print image_links

…And we’re done! Simple as that. Now you’ve got a big list of valid URLs that you can upload to Imgur or any other content hosting service for your perusal. Don’t forget to check out my Imgur tutorials on how to upload images from their API using python!

Below is the full source code

#make the required import
import bs4
import requests

#create a session object that'll allow us to track the cookies across the session and raise
# too much suspicion
s = requests.session()

#the base URL we'll append the the 'limb' URL we'll find in the 'next' buttons
base_url = r'http://www.imagebam.com'
limb_url = r'/image/c6b0cf70777810/'

#a list to hold all the image URLs we'll find
image_links = []

# a few booleans to help our loop out
done = False

#keep looping until we're 'done' which means that our 'next' button is leading us in circles
while not done:

    next_found = False

    #combine base and limb URLs to create a full URL we can use
    url = base_url + limb_url
    #make the GET requests to the full URL we've just built
    r = s.get(url)

    #create a BeautifulSoup Object that we can parse for data
    soup = bs4.BeautifulSoup(r.content)
    #find all the elements with 'buttonblue' as a class
    link_elems = soup.findAll(attrs={'class': 'buttonblue'})

    #loop over  the list of elements
    for link in link_elems:

        #use this boolean to see if we've found a next button
        next_found = False

        #if 'save' is in the text area for the link, we'll add it to our list
        if 'save' in link.text:
            image_links.append(link['href'])
            print 'saving this link:', link['href']
        #or if 'next' is in the text, we'll treat that as the next url to parse
        elif 'next' in link.text:
            limb_url = link['href']
            print 'found', limb_url
            next_found = True

    #if we haven't found a next button, we can assume we're at the end of a gallery
    if not next_found:
            print 'else, so were done for this page'
            done = True

#print the list of image links
print image_links

2 thoughts on “How to scrape an ImageBam gallery for images with 30 lines of Python”

daGrevis says:

March 16, 2013 at 6:15 pm

God, Python is so beautiful!

1. Tankor Smash says:
  
  March 25, 2013 at 12:28 pm
  
  I know right! Plus, on a shallow level, the WPSyntax has a really nice color scheme that makes comments and code look pretty nice too.

Tankor Smash's Blog

How to scrape an ImageBam gallery for images with 30 lines of Python

2 thoughts on “How to scrape an ImageBam gallery for images with 30 lines of Python”

Leave a Reply to daGrevis Cancel reply

A blog about gamedev and vim