How to delete spam comments on your WordPress blog, with Python via the Wordress API

In this tutorial we’ll be covering how to delete all those spam comments you’re getting posted to your
blog using the wordpress api via it’s fantastic Python wrapper, python-wordpress-xmlrpc, and make sure you’ve enabled the XMLRPC API on your blog.

How this script works is that it pulls all the comments that are in the pending
queue and then if it contains any words from the list of words that are likely spam words,
like ‘ambien’ or ‘oakley’, then take care of by deleting or marking them as spam. Here’s the full PasteBin of the script, and as usual, it’s also included in full at the end of this post.
Nothing too tricky, let’s get started!

So first up, the imports. Unlike most of my other tutorials here, we’re only using the
wordpress_xmlrpc library and then the builtin Counter class.

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods.posts import *
from wordpress_xmlrpc.methods.users import *
from wordpress_xmlrpc.methods.comments import *
from wordpress_xmlrpc.methods.pages import *

from collections import Counter

This is a list of words I’ve built based off the official wordpress spam list,
but then augmented using the top_20_common_words function I’ve added to the
bottom of this post. As I mentioned earlier, any comment that contains any of
the these words in the comment content, author or author email strings will be
removed.

DANGER_WORDS = ["glasses", "longchamp", "oakleys", "oakley", "-online", "4u", "adipex", "advicer", "baccarrat", "blackjack", "bllogspot", "booker", "byob", "car-rental-e-site", "car-rentals-e-site", "carisoprodol", "casino", "casinos", "chatroom", "cialis", "coolcoolhu", "coolhu", "credit-card-debt", "credit-report-4u", "cwas", "cyclen", "cyclobenzaprine", "dating-e-site", "day-trading", "debt-consolidation", "debt-consolidation-consultant", "discreetordering", "duty-free", "dutyfree", "equityloans", "fioricet", "flowers-leading-site", "freenet-shopping", "freenet", "gambling-", "hair-loss", "health-insurancedeals-4u", "homeequityloans", "homefinance", "holdem", "holdempoker", "holdemsoftware", "holdemtexasturbowilson", "hotel-dealse-site", "hotele-site", "hotelse-site", "incest", "insurance-quotesdeals-4u", "insurancedeals-4u", "jrcreations", "levitra", "macinstruct", "mortgage-4-u", "mortgagequotes", "online-gambling", "onlinegambling-4u", "ottawavalleyag", "ownsthis", "palm-texas-holdem-game", "paxil", "penis", "pharmacy", "phentermine", "poker-chip", "poze", "pussy", "rental-car-e-site", "ringtones", "roulette", "shemale", "shoes", "slot-machine", "texas-holdem", "thorcarlson", "top-site", "top-e-site", "tramadol", "trim-spa", "ultram", "valeofglamorganconservatives", "viagra", "vioxx", "xanax", "zolus", ]
DRUG_WORDS = ["ambien", "vuitton", "ambien", "viagra", "cialis", "drug", "hydrocodone", "klonopin", "pill", "withdrawal", "ativan", 'valium', 'clomid', 'rel="nofollow">buy', 'valium', 'xanax', 'marcjacobs', 'watches', 'discount', 'tadalafil', 'premature', 'ejaculation']
RISKY_WORDS = ["sex", "free", "online"]

This is a function that makes a call to the wordpress API that can get all the
comments back from either every post on the blog, or just the one specified by
the post_id. By default, I want to get all the comments throughout the entire
blog, but you might not necessarily want that if you’re just testing it out, or
any other reason.

def get_all_comments_per_post(client, post_id=""):
    """
    returns a single key dict with the comments inside
    """
    data = {'filter' : post_id,
            'number': 2000,
            'status': 'hold'}

    resp = client.call(GetComments(data))

    return resp

This is a function that takes a comment_id of the comment you’d like to delete,
and then uses the client to make the DeleteComment api call to actually delete
it on the remove server. Pretty simple.

def delete_comment(client, comment_id):
    data = {'comment_id':int(comment_id)}
    resp = client.call(DeleteComment(comment_id))
    

This is the function where most of the work happens. It takes a list of the
words we want to make sure the comments don’t contain, like ‘ambien’ and
‘viagra’, and then compare the comment content, author, and author email
against them. If they do contain those words, delete ’em.

You’ll see that the author_email and author_url are surrounded by brackets.
That’s because they’re strings and you’re not able to simply add a string to a
list but you’re able to add one list to another. Remeber that “any
string”.split() returns a list of strings.

def delete_comments_containing(client, list_of_words):
    print 'searching for comments'
    comments = get_all_comments_per_post(wp)

    print 'search complete'
    print 'count:', len(comments)
    for comment in comments:
        #list of words we're going to look for the spam words in:
        # the comment content, author, and author email
        words_to_match = []
        words_to_match += comment.content.split()
        words_to_match += comment.author.split()
        words_to_match += [comment.author_email]
        words_to_match += [comment.author_url]

        #for each word to match for spam words, if any of them are in the dangerous
        # words, delete the comment, and break out of the for-loop, so we don't try to delete
        # the same comment a few times. This also saves a small amount of time per comment
        # so you're not checking the same comment a ton of times
        for word in words_to_match:
            if word.lower() in list_of_words or \
               any([l_word for l_word in list_of_words if word in l_word]):
                print "DANGER FROM", comment.author.encode(errors="replace"), "DELETING with ID", comment.id
                delete_comment(wp, comment.id)
                break
            

This is a function that you can use yourself to grab all the popular words in
the comments on your blog to determine which ones are the most likely to be
spam. There’s the obvious ones like viagra and ambien, but then glasses and
other words like that. Be careful you don’t cast too wide of a net, so as to
not delete the wrong comment. Luckily, in wordpress, you can always retrieve
the comments you’ve marked as deleted

def top_20_common_words(comments):
    """
    bonus method for getting the most common words in the comments you pass
    into it, which is useful when you're trying to build a list of all the spammy
    words to auto parse out of your blog

    returns all words longer than 3 characters
    """

    all_words = []
    for comment in comments:
        for word in comment.content.split():
            all_words.append(word)

    word_counter = Counter(all_words)

    return [word[0] for word in word_counter.most_common() if len(word[0]) >= 4][:20]

Finally, there’s the last bit of code that kicks off the whole shebang. Note
that you’ll need to have the XMLRPC api enabled in your wordpress settings, and
you’ll need to make sure that the username and password matches yours.

wp = Client(r'http://www.YOURBLOGURL.com/xmlrpc.php', 'ADMIN_USERNAME', 'ADMIN_PASSWORD')
resp = wp.call(GetCommentStatusList())
print "All Possible Comment Statuses:", resp

wordlist = DANGER_WORDS + DRUG_WORDS
wordlist = [word.lower() for word in wordlist]

print 'Searching for Comments...'
delete_comments_containing(wp, wordlist)
print 'Search Complete!'

Here’s the script in it’s entirety:

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods.posts import *
from wordpress_xmlrpc.methods.users import *
from wordpress_xmlrpc.methods.comments import *
from wordpress_xmlrpc.methods.pages import *

from collections import Counter

DANGER_WORDS = ["glasses", "longchamp", "oakleys", "oakley", "-online", "4u", "adipex", "advicer", "baccarrat", "blackjack", "bllogspot", "booker", "byob", "car-rental-e-site", "car-rentals-e-site", "carisoprodol", "casino", "casinos", "chatroom", "cialis", "coolcoolhu", "coolhu", "credit-card-debt", "credit-report-4u", "cwas", "cyclen", "cyclobenzaprine", "dating-e-site", "day-trading", "debt-consolidation", "debt-consolidation-consultant", "discreetordering", "duty-free", "dutyfree", "equityloans", "fioricet", "flowers-leading-site", "freenet-shopping", "freenet", "gambling-", "hair-loss", "health-insurancedeals-4u", "homeequityloans", "homefinance", "holdem", "holdempoker", "holdemsoftware", "holdemtexasturbowilson", "hotel-dealse-site", "hotele-site", "hotelse-site", "incest", "insurance-quotesdeals-4u", "insurancedeals-4u", "jrcreations", "levitra", "macinstruct", "mortgage-4-u", "mortgagequotes", "online-gambling", "onlinegambling-4u", "ottawavalleyag", "ownsthis", "palm-texas-holdem-game", "paxil", "penis", "pharmacy", "phentermine", "poker-chip", "poze", "pussy", "rental-car-e-site", "ringtones", "roulette", "shemale", "shoes", "slot-machine", "texas-holdem", "thorcarlson", "top-site", "top-e-site", "tramadol", "trim-spa", "ultram", "valeofglamorganconservatives", "viagra", "vioxx", "xanax", "zolus", ]
DRUG_WORDS = ["ambien</a>", "vuitton", "ambien", "viagra", "cialis", "drug", "hydrocodone", "klonopin", "pill", "withdrawal", "ativan", 'valium', 'clomid', 'rel="nofollow">buy', 'valium</a>', 'xanax', 'marcjacobs', 'watches', 'discount', 'tadalafil', 'premature', 'ejaculation']
RISKY_WORDS = ["sex", "free", "online"]

def get_all_comments_per_post(client, post_id=""):
    """
    returns a single key dict with the comments inside
    """
    data = {'filter' : post_id,
            'number': 2000,
            'status': 'hold'}

    resp = client.call(GetComments(data))

    return resp

def delete_comment(client, comment_id):
    data = {'comment_id':int(comment_id)}
    resp = client.call(DeleteComment(comment_id))

def delete_comments_containing(client, list_of_words):
    print 'searching for comments'
    comments = get_all_comments_per_post(wp)

    print 'search complete'
    print 'count:', len(comments)
    for comment in comments:
        words = []

        words += comment.content.split()
        words += comment.author.split()
        words += [comment.author_email]
        words += [comment.author_url]
        for word in words:
            if word.lower() in list_of_words or \
            any([l_word for l_word in list_of_words if word in l_word]):
                print "DANGER FROM", comment.author.encode(errors="replace"), "DELETING with ID", comment.id
                delete_comment(wp, comment.id)
                break

def top_20_common_words(comments):
    """
    bonus method for getting the most common words in the comments you pass
    into it, which is useful when you're trying to build a list of all the spammy
    words to auto parse out of your blog

    returns all words longer than 3 characters
    """

    all_words = []
    for comment in comments:
        for word in comment.content.split():
            all_words.append(word)

    word_counter = Counter(all_words)

    return [word[0] for word in word_counter.most_common() if len(word[0]) >= 4][:20]

wp = Client(r'http://www.YOURBLOGURL.com/xmlrpc.php', 'ADMIN_USERNAME', 'ADMIN_PASSWORD')
resp = wp.call(GetCommentStatusList())
print "All Possible Comment Statuses:", resp

wordlist = DANGER_WORDS + DRUG_WORDS
wordlist = [word.lower() for word in wordlist]

print 'Searching for Comments...'
delete_comments_containing(wp, wordlist)
print 'Search Complete!'

Leave a Reply

Your email address will not be published.