How to delete spam comments on your WordPress blog, with Python via the Wordress API

In this tutorial we’ll be covering how to delete all those spam comments you’re getting posted to your
blog using the wordpress api via it’s fantastic Python wrapper, python-wordpress-xmlrpc, and make sure you’ve enabled the XMLRPC API on your blog.

How this script works is that it pulls all the comments that are in the pending
queue and then if it contains any words from the list of words that are likely spam words,
like ‘ambien’ or ‘oakley’, then take care of by deleting or marking them as spam. Here’s the full PasteBin of the script, and as usual, it’s also included in full at the end of this post.
Nothing too tricky, let’s get started!

So first up, the imports. Unlike most of my other tutorials here, we’re only using the
wordpress_xmlrpc library and then the builtin Counter class.

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods.posts import *
from wordpress_xmlrpc.methods.users import *
from wordpress_xmlrpc.methods.comments import *
from wordpress_xmlrpc.methods.pages import *

from collections import Counter

This is a list of words I’ve built based off the official wordpress spam list,
but then augmented using the top_20_common_words function I’ve added to the
bottom of this post. As I mentioned earlier, any comment that contains any of
the these words in the comment content, author or author email strings will be
removed.

DANGER_WORDS = ["glasses", "longchamp", "oakleys", "oakley", "-online", "4u", "adipex", "advicer", "baccarrat", "blackjack", "bllogspot", "booker", "byob", "car-rental-e-site", "car-rentals-e-site", "carisoprodol", "casino", "casinos", "chatroom", "cialis", "coolcoolhu", "coolhu", "credit-card-debt", "credit-report-4u", "cwas", "cyclen", "cyclobenzaprine", "dating-e-site", "day-trading", "debt-consolidation", "debt-consolidation-consultant", "discreetordering", "duty-free", "dutyfree", "equityloans", "fioricet", "flowers-leading-site", "freenet-shopping", "freenet", "gambling-", "hair-loss", "health-insurancedeals-4u", "homeequityloans", "homefinance", "holdem", "holdempoker", "holdemsoftware", "holdemtexasturbowilson", "hotel-dealse-site", "hotele-site", "hotelse-site", "incest", "insurance-quotesdeals-4u", "insurancedeals-4u", "jrcreations", "levitra", "macinstruct", "mortgage-4-u", "mortgagequotes", "online-gambling", "onlinegambling-4u", "ottawavalleyag", "ownsthis", "palm-texas-holdem-game", "paxil", "penis", "pharmacy", "phentermine", "poker-chip", "poze", "pussy", "rental-car-e-site", "ringtones", "roulette", "shemale", "shoes", "slot-machine", "texas-holdem", "thorcarlson", "top-site", "top-e-site", "tramadol", "trim-spa", "ultram", "valeofglamorganconservatives", "viagra", "vioxx", "xanax", "zolus", ]
DRUG_WORDS = ["ambien", "vuitton", "ambien", "viagra", "cialis", "drug", "hydrocodone", "klonopin", "pill", "withdrawal", "ativan", 'valium', 'clomid', 'rel="nofollow">buy', 'valium', 'xanax', 'marcjacobs', 'watches', 'discount', 'tadalafil', 'premature', 'ejaculation']
RISKY_WORDS = ["sex", "free", "online"]

This is a function that makes a call to the wordpress API that can get all the
comments back from either every post on the blog, or just the one specified by
the post_id. By default, I want to get all the comments throughout the entire
blog, but you might not necessarily want that if you’re just testing it out, or
any other reason.

def get_all_comments_per_post(client, post_id=""):
    """
    returns a single key dict with the comments inside
    """
    data = {'filter' : post_id,
            'number': 2000,
            'status': 'hold'}

    resp = client.call(GetComments(data))

    return resp

This is a function that takes a comment_id of the comment you’d like to delete,
and then uses the client to make the DeleteComment api call to actually delete
it on the remove server. Pretty simple.

def delete_comment(client, comment_id):
    data = {'comment_id':int(comment_id)}
    resp = client.call(DeleteComment(comment_id))
    

This is the function where most of the work happens. It takes a list of the
words we want to make sure the comments don’t contain, like ‘ambien’ and
‘viagra’, and then compare the comment content, author, and author email
against them. If they do contain those words, delete ‘em.

You’ll see that the author_email and author_url are surrounded by brackets.
That’s because they’re strings and you’re not able to simply add a string to a
list but you’re able to add one list to another. Remeber that “any
string”.split() returns a list of strings.

def delete_comments_containing(client, list_of_words):
    print 'searching for comments'
    comments = get_all_comments_per_post(wp)

    print 'search complete'
    print 'count:', len(comments)
    for comment in comments:
        #list of words we're going to look for the spam words in:
        # the comment content, author, and author email
        words_to_match = []
        words_to_match += comment.content.split()
        words_to_match += comment.author.split()
        words_to_match += [comment.author_email]
        words_to_match += [comment.author_url]

        #for each word to match for spam words, if any of them are in the dangerous
        # words, delete the comment, and break out of the for-loop, so we don't try to delete
        # the same comment a few times. This also saves a small amount of time per comment
        # so you're not checking the same comment a ton of times
        for word in words_to_match:
            if word.lower() in list_of_words or \
               any([l_word for l_word in list_of_words if word in l_word]):
                print "DANGER FROM", comment.author.encode(errors="replace"), "DELETING with ID", comment.id
                delete_comment(wp, comment.id)
                break
            

This is a function that you can use yourself to grab all the popular words in
the comments on your blog to determine which ones are the most likely to be
spam. There’s the obvious ones like viagra and ambien, but then glasses and
other words like that. Be careful you don’t cast too wide of a net, so as to
not delete the wrong comment. Luckily, in wordpress, you can always retrieve
the comments you’ve marked as deleted

def top_20_common_words(comments):
    """
    bonus method for getting the most common words in the comments you pass
    into it, which is useful when you're trying to build a list of all the spammy
    words to auto parse out of your blog

    returns all words longer than 3 characters
    """

    all_words = []
    for comment in comments:
        for word in comment.content.split():
            all_words.append(word)

    word_counter = Counter(all_words)

    return [word[0] for word in word_counter.most_common() if len(word[0]) >= 4][:20]

Finally, there’s the last bit of code that kicks off the whole shebang. Note
that you’ll need to have the XMLRPC api enabled in your wordpress settings, and
you’ll need to make sure that the username and password matches yours.

wp = Client(r'http://www.YOURBLOGURL.com/xmlrpc.php', 'ADMIN_USERNAME', 'ADMIN_PASSWORD')
resp = wp.call(GetCommentStatusList())
print "All Possible Comment Statuses:", resp

wordlist = DANGER_WORDS + DRUG_WORDS
wordlist = [word.lower() for word in wordlist]

print 'Searching for Comments...'
delete_comments_containing(wp, wordlist)
print 'Search Complete!'

Here’s the script in it’s entirety:

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods.posts import *
from wordpress_xmlrpc.methods.users import *
from wordpress_xmlrpc.methods.comments import *
from wordpress_xmlrpc.methods.pages import *

from collections import Counter

DANGER_WORDS = ["glasses", "longchamp", "oakleys", "oakley", "-online", "4u", "adipex", "advicer", "baccarrat", "blackjack", "bllogspot", "booker", "byob", "car-rental-e-site", "car-rentals-e-site", "carisoprodol", "casino", "casinos", "chatroom", "cialis", "coolcoolhu", "coolhu", "credit-card-debt", "credit-report-4u", "cwas", "cyclen", "cyclobenzaprine", "dating-e-site", "day-trading", "debt-consolidation", "debt-consolidation-consultant", "discreetordering", "duty-free", "dutyfree", "equityloans", "fioricet", "flowers-leading-site", "freenet-shopping", "freenet", "gambling-", "hair-loss", "health-insurancedeals-4u", "homeequityloans", "homefinance", "holdem", "holdempoker", "holdemsoftware", "holdemtexasturbowilson", "hotel-dealse-site", "hotele-site", "hotelse-site", "incest", "insurance-quotesdeals-4u", "insurancedeals-4u", "jrcreations", "levitra", "macinstruct", "mortgage-4-u", "mortgagequotes", "online-gambling", "onlinegambling-4u", "ottawavalleyag", "ownsthis", "palm-texas-holdem-game", "paxil", "penis", "pharmacy", "phentermine", "poker-chip", "poze", "pussy", "rental-car-e-site", "ringtones", "roulette", "shemale", "shoes", "slot-machine", "texas-holdem", "thorcarlson", "top-site", "top-e-site", "tramadol", "trim-spa", "ultram", "valeofglamorganconservatives", "viagra", "vioxx", "xanax", "zolus", ]
DRUG_WORDS = ["ambien</a>", "vuitton", "ambien", "viagra", "cialis", "drug", "hydrocodone", "klonopin", "pill", "withdrawal", "ativan", 'valium', 'clomid', 'rel="nofollow">buy', 'valium</a>', 'xanax', 'marcjacobs', 'watches', 'discount', 'tadalafil', 'premature', 'ejaculation']
RISKY_WORDS = ["sex", "free", "online"]

def get_all_comments_per_post(client, post_id=""):
    """
    returns a single key dict with the comments inside
    """
    data = {'filter' : post_id,
            'number': 2000,
            'status': 'hold'}

    resp = client.call(GetComments(data))

    return resp

def delete_comment(client, comment_id):
    data = {'comment_id':int(comment_id)}
    resp = client.call(DeleteComment(comment_id))

def delete_comments_containing(client, list_of_words):
    print 'searching for comments'
    comments = get_all_comments_per_post(wp)

    print 'search complete'
    print 'count:', len(comments)
    for comment in comments:
        words = []

        words += comment.content.split()
        words += comment.author.split()
        words += [comment.author_email]
        words += [comment.author_url]
        for word in words:
            if word.lower() in list_of_words or \
            any([l_word for l_word in list_of_words if word in l_word]):
                print "DANGER FROM", comment.author.encode(errors="replace"), "DELETING with ID", comment.id
                delete_comment(wp, comment.id)
                break

def top_20_common_words(comments):
    """
    bonus method for getting the most common words in the comments you pass
    into it, which is useful when you're trying to build a list of all the spammy
    words to auto parse out of your blog

    returns all words longer than 3 characters
    """

    all_words = []
    for comment in comments:
        for word in comment.content.split():
            all_words.append(word)

    word_counter = Counter(all_words)

    return [word[0] for word in word_counter.most_common() if len(word[0]) >= 4][:20]

wp = Client(r'http://www.YOURBLOGURL.com/xmlrpc.php', 'ADMIN_USERNAME', 'ADMIN_PASSWORD')
resp = wp.call(GetCommentStatusList())
print "All Possible Comment Statuses:", resp

wordlist = DANGER_WORDS + DRUG_WORDS
wordlist = [word.lower() for word in wordlist]

print 'Searching for Comments...'
delete_comments_containing(wp, wordlist)
print 'Search Complete!'

Imgur API part 3: OAuth2

It’s been a while since I wrote the last one, but a user on reddit asked how to get OAuth working with Imgur, and I couldn’t resist giving it one more shot. Thanks to my recent experience with the Netflix API, even though it was in C# rather than in Python, I was able to wrangle a quick and easy script for getting started.

Here’s the PasteBin with the complete code, thought it’s shown at the bottom of the page, as usual. (more…)

How to scrape an ImageBam gallery for images with 30 lines of Python

Right off the bat, I want to show you the results of this scraping, to give you a bit of motivation. Anyways, thanks to requests and BeautifulSoup, this is made trivially easy. Enough talking, let’s get down to the code! Don’t forget that as usual I’ll include the full source code at the bottom of the post.

(more…)

Reddit API Tutorial Part 2: Getting all the submissions to any given subreddit

What’s up? Today we’re going to look at how to retrieve the stories (official term for submissions or selfposts) from any given subreddit. What we’re going to do is pretty simple, essentially just customizing a url with the proper subreddit and reading the JSON object returned. It’s going to be a pretty short one. I’m going to attach the login code I’ve written along with the code we’ve looked at today so that you can just copy and paste it into your IDE and start playing with it, right away. Just make sure you’ve got all the required module installed, mentioned here.  Hit the jump to get started!

Tinypaste Link for entire code

(more…)

reddit API Part 1: Logging In

Welcome to the first part of my reddit API tutorial for Python 2.7! In this short tutorial we will just focus on signing in to reddit’s API so we can interact with it later.

Hopefully you’ve read the introduction on the modules we’ll be using found here, so if you’re a beginner, you won’t be that lost.

Before we start, I am just going to give you a brief overview of what we are going to do: create a python DICT that has your reddit account name and password in it, so that we can send it to the API with our request. Then, armed with our modhash that we received from the API, we can move on to interacting with reddit, which we’ll check out in the next part of this tutorial!

Tinypaste of the entire code as seen at the bottom of the page

Hit the jump for how to login to the reddit’s API. As usual, the full code will be shown at the end of this page.

(more…)