How to delete spam comments on your WordPress blog, with Python via the Wordress API

In this tutorial we’ll be covering how to delete all those spam comments you’re getting posted to your
blog using the wordpress api via it’s fantastic Python wrapper, python-wordpress-xmlrpc, and make sure you’ve enabled the XMLRPC API on your blog.

How this script works is that it pulls all the comments that are in the pending
queue and then if it contains any words from the list of words that are likely spam words,
like ‘ambien’ or ‘oakley’, then take care of by deleting or marking them as spam. Here’s the full PasteBin of the script, and as usual, it’s also included in full at the end of this post.
Nothing too tricky, let’s get started!

So first up, the imports. Unlike most of my other tutorials here, we’re only using the
wordpress_xmlrpc library and then the builtin Counter class.

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods.posts import *
from wordpress_xmlrpc.methods.users import *
from wordpress_xmlrpc.methods.comments import *
from wordpress_xmlrpc.methods.pages import *

from collections import Counter

This is a list of words I’ve built based off the official wordpress spam list,
but then augmented using the top_20_common_words function I’ve added to the
bottom of this post. As I mentioned earlier, any comment that contains any of
the these words in the comment content, author or author email strings will be
removed.

DANGER_WORDS = ["glasses", "longchamp", "oakleys", "oakley", "-online", "4u", "adipex", "advicer", "baccarrat", "blackjack", "bllogspot", "booker", "byob", "car-rental-e-site", "car-rentals-e-site", "carisoprodol", "casino", "casinos", "chatroom", "cialis", "coolcoolhu", "coolhu", "credit-card-debt", "credit-report-4u", "cwas", "cyclen", "cyclobenzaprine", "dating-e-site", "day-trading", "debt-consolidation", "debt-consolidation-consultant", "discreetordering", "duty-free", "dutyfree", "equityloans", "fioricet", "flowers-leading-site", "freenet-shopping", "freenet", "gambling-", "hair-loss", "health-insurancedeals-4u", "homeequityloans", "homefinance", "holdem", "holdempoker", "holdemsoftware", "holdemtexasturbowilson", "hotel-dealse-site", "hotele-site", "hotelse-site", "incest", "insurance-quotesdeals-4u", "insurancedeals-4u", "jrcreations", "levitra", "macinstruct", "mortgage-4-u", "mortgagequotes", "online-gambling", "onlinegambling-4u", "ottawavalleyag", "ownsthis", "palm-texas-holdem-game", "paxil", "penis", "pharmacy", "phentermine", "poker-chip", "poze", "pussy", "rental-car-e-site", "ringtones", "roulette", "shemale", "shoes", "slot-machine", "texas-holdem", "thorcarlson", "top-site", "top-e-site", "tramadol", "trim-spa", "ultram", "valeofglamorganconservatives", "viagra", "vioxx", "xanax", "zolus", ]
DRUG_WORDS = ["ambien", "vuitton", "ambien", "viagra", "cialis", "drug", "hydrocodone", "klonopin", "pill", "withdrawal", "ativan", 'valium', 'clomid', 'rel="nofollow">buy', 'valium', 'xanax', 'marcjacobs', 'watches', 'discount', 'tadalafil', 'premature', 'ejaculation']
RISKY_WORDS = ["sex", "free", "online"]

This is a function that makes a call to the wordpress API that can get all the
comments back from either every post on the blog, or just the one specified by
the post_id. By default, I want to get all the comments throughout the entire
blog, but you might not necessarily want that if you’re just testing it out, or
any other reason.

def get_all_comments_per_post(client, post_id=""):
    """
    returns a single key dict with the comments inside
    """
    data = {'filter' : post_id,
            'number': 2000,
            'status': 'hold'}

    resp = client.call(GetComments(data))

    return resp

This is a function that takes a comment_id of the comment you’d like to delete,
and then uses the client to make the DeleteComment api call to actually delete
it on the remove server. Pretty simple.

def delete_comment(client, comment_id):
    data = {'comment_id':int(comment_id)}
    resp = client.call(DeleteComment(comment_id))
    

This is the function where most of the work happens. It takes a list of the
words we want to make sure the comments don’t contain, like ‘ambien’ and
‘viagra’, and then compare the comment content, author, and author email
against them. If they do contain those words, delete ‘em.

You’ll see that the author_email and author_url are surrounded by brackets.
That’s because they’re strings and you’re not able to simply add a string to a
list but you’re able to add one list to another. Remeber that “any
string”.split() returns a list of strings.

def delete_comments_containing(client, list_of_words):
    print 'searching for comments'
    comments = get_all_comments_per_post(wp)

    print 'search complete'
    print 'count:', len(comments)
    for comment in comments:
        #list of words we're going to look for the spam words in:
        # the comment content, author, and author email
        words_to_match = []
        words_to_match += comment.content.split()
        words_to_match += comment.author.split()
        words_to_match += [comment.author_email]
        words_to_match += [comment.author_url]

        #for each word to match for spam words, if any of them are in the dangerous
        # words, delete the comment, and break out of the for-loop, so we don't try to delete
        # the same comment a few times. This also saves a small amount of time per comment
        # so you're not checking the same comment a ton of times
        for word in words_to_match:
            if word.lower() in list_of_words or \
               any([l_word for l_word in list_of_words if word in l_word]):
                print "DANGER FROM", comment.author.encode(errors="replace"), "DELETING with ID", comment.id
                delete_comment(wp, comment.id)
                break
            

This is a function that you can use yourself to grab all the popular words in
the comments on your blog to determine which ones are the most likely to be
spam. There’s the obvious ones like viagra and ambien, but then glasses and
other words like that. Be careful you don’t cast too wide of a net, so as to
not delete the wrong comment. Luckily, in wordpress, you can always retrieve
the comments you’ve marked as deleted

def top_20_common_words(comments):
    """
    bonus method for getting the most common words in the comments you pass
    into it, which is useful when you're trying to build a list of all the spammy
    words to auto parse out of your blog

    returns all words longer than 3 characters
    """

    all_words = []
    for comment in comments:
        for word in comment.content.split():
            all_words.append(word)

    word_counter = Counter(all_words)

    return [word[0] for word in word_counter.most_common() if len(word[0]) >= 4][:20]

Finally, there’s the last bit of code that kicks off the whole shebang. Note
that you’ll need to have the XMLRPC api enabled in your wordpress settings, and
you’ll need to make sure that the username and password matches yours.

wp = Client(r'http://www.YOURBLOGURL.com/xmlrpc.php', 'ADMIN_USERNAME', 'ADMIN_PASSWORD')
resp = wp.call(GetCommentStatusList())
print "All Possible Comment Statuses:", resp

wordlist = DANGER_WORDS + DRUG_WORDS
wordlist = [word.lower() for word in wordlist]

print 'Searching for Comments...'
delete_comments_containing(wp, wordlist)
print 'Search Complete!'

Here’s the script in it’s entirety:

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods.posts import *
from wordpress_xmlrpc.methods.users import *
from wordpress_xmlrpc.methods.comments import *
from wordpress_xmlrpc.methods.pages import *

from collections import Counter

DANGER_WORDS = ["glasses", "longchamp", "oakleys", "oakley", "-online", "4u", "adipex", "advicer", "baccarrat", "blackjack", "bllogspot", "booker", "byob", "car-rental-e-site", "car-rentals-e-site", "carisoprodol", "casino", "casinos", "chatroom", "cialis", "coolcoolhu", "coolhu", "credit-card-debt", "credit-report-4u", "cwas", "cyclen", "cyclobenzaprine", "dating-e-site", "day-trading", "debt-consolidation", "debt-consolidation-consultant", "discreetordering", "duty-free", "dutyfree", "equityloans", "fioricet", "flowers-leading-site", "freenet-shopping", "freenet", "gambling-", "hair-loss", "health-insurancedeals-4u", "homeequityloans", "homefinance", "holdem", "holdempoker", "holdemsoftware", "holdemtexasturbowilson", "hotel-dealse-site", "hotele-site", "hotelse-site", "incest", "insurance-quotesdeals-4u", "insurancedeals-4u", "jrcreations", "levitra", "macinstruct", "mortgage-4-u", "mortgagequotes", "online-gambling", "onlinegambling-4u", "ottawavalleyag", "ownsthis", "palm-texas-holdem-game", "paxil", "penis", "pharmacy", "phentermine", "poker-chip", "poze", "pussy", "rental-car-e-site", "ringtones", "roulette", "shemale", "shoes", "slot-machine", "texas-holdem", "thorcarlson", "top-site", "top-e-site", "tramadol", "trim-spa", "ultram", "valeofglamorganconservatives", "viagra", "vioxx", "xanax", "zolus", ]
DRUG_WORDS = ["ambien</a>", "vuitton", "ambien", "viagra", "cialis", "drug", "hydrocodone", "klonopin", "pill", "withdrawal", "ativan", 'valium', 'clomid', 'rel="nofollow">buy', 'valium</a>', 'xanax', 'marcjacobs', 'watches', 'discount', 'tadalafil', 'premature', 'ejaculation']
RISKY_WORDS = ["sex", "free", "online"]

def get_all_comments_per_post(client, post_id=""):
    """
    returns a single key dict with the comments inside
    """
    data = {'filter' : post_id,
            'number': 2000,
            'status': 'hold'}

    resp = client.call(GetComments(data))

    return resp

def delete_comment(client, comment_id):
    data = {'comment_id':int(comment_id)}
    resp = client.call(DeleteComment(comment_id))

def delete_comments_containing(client, list_of_words):
    print 'searching for comments'
    comments = get_all_comments_per_post(wp)

    print 'search complete'
    print 'count:', len(comments)
    for comment in comments:
        words = []

        words += comment.content.split()
        words += comment.author.split()
        words += [comment.author_email]
        words += [comment.author_url]
        for word in words:
            if word.lower() in list_of_words or \
            any([l_word for l_word in list_of_words if word in l_word]):
                print "DANGER FROM", comment.author.encode(errors="replace"), "DELETING with ID", comment.id
                delete_comment(wp, comment.id)
                break

def top_20_common_words(comments):
    """
    bonus method for getting the most common words in the comments you pass
    into it, which is useful when you're trying to build a list of all the spammy
    words to auto parse out of your blog

    returns all words longer than 3 characters
    """

    all_words = []
    for comment in comments:
        for word in comment.content.split():
            all_words.append(word)

    word_counter = Counter(all_words)

    return [word[0] for word in word_counter.most_common() if len(word[0]) >= 4][:20]

wp = Client(r'http://www.YOURBLOGURL.com/xmlrpc.php', 'ADMIN_USERNAME', 'ADMIN_PASSWORD')
resp = wp.call(GetCommentStatusList())
print "All Possible Comment Statuses:", resp

wordlist = DANGER_WORDS + DRUG_WORDS
wordlist = [word.lower() for word in wordlist]

print 'Searching for Comments...'
delete_comments_containing(wp, wordlist)
print 'Search Complete!'

reddit API Part 1: Logging In

Welcome to the first part of my reddit API tutorial for Python 2.7! In this short tutorial we will just focus on signing in to reddit’s API so we can interact with it later.

Hopefully you’ve read the introduction on the modules we’ll be using found here, so if you’re a beginner, you won’t be that lost.

Before we start, I am just going to give you a brief overview of what we are going to do: create a python DICT that has your reddit account name and password in it, so that we can send it to the API with our request. Then, armed with our modhash that we received from the API, we can move on to interacting with reddit, which we’ll check out in the next part of this tutorial!

Tinypaste of the entire code as seen at the bottom of the page

Hit the jump for how to login to the reddit’s API. As usual, the full code will be shown at the end of this page.

(more…)

Imgur API part 2: Downloading a Gallery

This is the second installment of my Imgur API: ‘How to entire download Imgur Galleries’. Check out part 1 here in case you missed how to log into the API and upload an image!

I’m actually sort of cheating here, because we don’t actually need to use the API at all here, if we don’t want to. That is because we’re only going to deal with the galleries that are built from the images submitted to reddit. This means after you’re done with this tutorial, you’ll be able to just set the script to a given subreddits name, and grab all the images that have been submitted to /r/aww or /r/wallpapers.

I will be writing another tutorial soon for galleries and albums unrelated to reddit.com as well as grabbing gallery information, such as the title and other descriptive things like that, but that’s less related to the actual downloading of the gallery, which is what we’re interested in today!

Please note that I’m working with Python 2.7 on Windows 7 64-bit, so you might have to modify the code slightly to accommodate for your platform, or OS.

Anyways, hit the jump to get started!

Edit: reddit user: easttntoppedtree caught that it maxes out at 56 images, so you’ll have to add /page/PAGENUMBERHERE.json to the end of the url to get the next 56 images like so: http://imgur.com/r/scarlettjohansson/top/page/1.json’ while keeping in mind that 0 (zero) is a valid page number

Tinypaste link for the full working code, as seen at the bottom of the page

(more…)

Imgur API part 1: getting an Anonymous Key and uploading an image

This first part will focus on creating an anonymous API (Application programming interface) key for your script to use, and then upload an image to imgur.com.  For this, and any following tutorals, I will be using Windows 7×64, with Python 2.7.

You will need to create an anonymous API key at http://imgur.com/register/api_anon, it’s really simple, just feed the name of your app, so the dudes over at Imgur know what your intentions are, and then your personal information and finally, the reCaptcha, to verify you’re not a bot. The next page will give you your developer API key; this is important. This key will allow you to interact with the Imgur API. Since this is the limited Anonymous API, you will only have access to basic functions, like uploading images from your computer or from another website, and getting gallery and image information. Luckily, that’s enough for our purposes here.

The documentation for the Imgur API that we’ll be using today is found here: http://api.imgur.com/resources_anon, and the requests module documentation is here http://docs.python-requests.org/en/latest/user/quickstart/. Now, open up your favourite text editor for coding. I like using Wing IDE Professional,  but I’m sure notepad++ and any other one would work just as nicely.  This is where the fun part begins. Hit the jump!

Tinypaste of the full working code!

(more…)

Python API Basics

I have set of imports that I like to have at the top of every API client I write, some are built in, but some are not. I’ll just quickly go over them before you get started on the fun stuff. This post is just to list off the modules you might need. If you’re on 64 bit OSes, you might need to take a look here for compatible packages, I know I did. For then entirety of the tutorials, I’ll be using Windows 7×64, with Python 2.7, so maybe you’ll have a different experience with Python 3, and on a different OS.*

First off, requests module is essential to everything we’ll be doing within the following tutorials. It handles all the hard-to-understand stuff, like POST, PUT, and GET requests, as well as cookies, so we don’t have to worry about it much at all; and ideal Python module.

Secondly, pprint is a nice thing to have, since we’ll be dealing a lot with dictionaries, and the regular print statement doesn’t print it off in the most readable of formats. Instead of printing everything in a single line, it linewraps, as well as uses separate lines where doing so would help with legibility. For example: a dictionary would be printed like this normally:

dictionary = {'a': 1, 'b': 2, 'c': 3}

but with pprint it’d look something more like this:

dictionary = {'a' : 1,
              'b' : 2,
              'c' : 3' }

which is amazing when you have huge dictionaries that you need to visually parse.

Thirdly, the json module is another essential one, as the APIs will often feed us JSON data. JSONs are essentially dictionaries writtien in Javascript. We’ll take the request data from the requests module, and feed it into the json.loads function and it’ll return a native python dict that we can manipulate like it’s not even a big deal. It’s pretty great.

Here’s an example of all three modules working together:

#imports
from pprint import pprint
import requests
import json

r = requests.get(r'http://www.reddit.com/user/tankorsmash/about/.json')
#print r.text  #raw text response as a string

j = json.loads(r.text)  #turn the json response into a python dict
#print j  #now it's a python dict

pprint(j)  #here's the final respone, printed out nice an readable format

There you have it,  the basic python modules I’ll be using throughout the next few tutorials.

*thanks to EuphoriaForAll for the Python Version inclusion here