How to delete spam comments on your WordPress blog, with Python via the Wordress API

In this tutorial we’ll be covering how to delete all those spam comments you’re getting posted to your
blog using the wordpress api via it’s fantastic Python wrapper, python-wordpress-xmlrpc, and make sure you’ve enabled the XMLRPC API on your blog.

How this script works is that it pulls all the comments that are in the pending
queue and then if it contains any words from the list of words that are likely spam words,
like ‘ambien’ or ‘oakley’, then take care of by deleting or marking them as spam. Here’s the full PasteBin of the script, and as usual, it’s also included in full at the end of this post.
Nothing too tricky, let’s get started!

So first up, the imports. Unlike most of my other tutorials here, we’re only using the
wordpress_xmlrpc library and then the builtin Counter class.

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods.posts import *
from wordpress_xmlrpc.methods.users import *
from wordpress_xmlrpc.methods.comments import *
from wordpress_xmlrpc.methods.pages import *

from collections import Counter

This is a list of words I’ve built based off the official wordpress spam list,
but then augmented using the top_20_common_words function I’ve added to the
bottom of this post. As I mentioned earlier, any comment that contains any of
the these words in the comment content, author or author email strings will be
removed.

DANGER_WORDS = ["glasses", "longchamp", "oakleys", "oakley", "-online", "4u", "adipex", "advicer", "baccarrat", "blackjack", "bllogspot", "booker", "byob", "car-rental-e-site", "car-rentals-e-site", "carisoprodol", "casino", "casinos", "chatroom", "cialis", "coolcoolhu", "coolhu", "credit-card-debt", "credit-report-4u", "cwas", "cyclen", "cyclobenzaprine", "dating-e-site", "day-trading", "debt-consolidation", "debt-consolidation-consultant", "discreetordering", "duty-free", "dutyfree", "equityloans", "fioricet", "flowers-leading-site", "freenet-shopping", "freenet", "gambling-", "hair-loss", "health-insurancedeals-4u", "homeequityloans", "homefinance", "holdem", "holdempoker", "holdemsoftware", "holdemtexasturbowilson", "hotel-dealse-site", "hotele-site", "hotelse-site", "incest", "insurance-quotesdeals-4u", "insurancedeals-4u", "jrcreations", "levitra", "macinstruct", "mortgage-4-u", "mortgagequotes", "online-gambling", "onlinegambling-4u", "ottawavalleyag", "ownsthis", "palm-texas-holdem-game", "paxil", "penis", "pharmacy", "phentermine", "poker-chip", "poze", "pussy", "rental-car-e-site", "ringtones", "roulette", "shemale", "shoes", "slot-machine", "texas-holdem", "thorcarlson", "top-site", "top-e-site", "tramadol", "trim-spa", "ultram", "valeofglamorganconservatives", "viagra", "vioxx", "xanax", "zolus", ]
DRUG_WORDS = ["ambien", "vuitton", "ambien", "viagra", "cialis", "drug", "hydrocodone", "klonopin", "pill", "withdrawal", "ativan", 'valium', 'clomid', 'rel="nofollow">buy', 'valium', 'xanax', 'marcjacobs', 'watches', 'discount', 'tadalafil', 'premature', 'ejaculation']
RISKY_WORDS = ["sex", "free", "online"]

This is a function that makes a call to the wordpress API that can get all the
comments back from either every post on the blog, or just the one specified by
the post_id. By default, I want to get all the comments throughout the entire
blog, but you might not necessarily want that if you’re just testing it out, or
any other reason.

def get_all_comments_per_post(client, post_id=""):
    """
    returns a single key dict with the comments inside
    """
    data = {'filter' : post_id,
            'number': 2000,
            'status': 'hold'}

    resp = client.call(GetComments(data))

    return resp

This is a function that takes a comment_id of the comment you’d like to delete,
and then uses the client to make the DeleteComment api call to actually delete
it on the remove server. Pretty simple.

def delete_comment(client, comment_id):
    data = {'comment_id':int(comment_id)}
    resp = client.call(DeleteComment(comment_id))
    

This is the function where most of the work happens. It takes a list of the
words we want to make sure the comments don’t contain, like ‘ambien’ and
‘viagra’, and then compare the comment content, author, and author email
against them. If they do contain those words, delete ‘em.

You’ll see that the author_email and author_url are surrounded by brackets.
That’s because they’re strings and you’re not able to simply add a string to a
list but you’re able to add one list to another. Remeber that “any
string”.split() returns a list of strings.

def delete_comments_containing(client, list_of_words):
    print 'searching for comments'
    comments = get_all_comments_per_post(wp)

    print 'search complete'
    print 'count:', len(comments)
    for comment in comments:
        #list of words we're going to look for the spam words in:
        # the comment content, author, and author email
        words_to_match = []
        words_to_match += comment.content.split()
        words_to_match += comment.author.split()
        words_to_match += [comment.author_email]
        words_to_match += [comment.author_url]

        #for each word to match for spam words, if any of them are in the dangerous
        # words, delete the comment, and break out of the for-loop, so we don't try to delete
        # the same comment a few times. This also saves a small amount of time per comment
        # so you're not checking the same comment a ton of times
        for word in words_to_match:
            if word.lower() in list_of_words or \
               any([l_word for l_word in list_of_words if word in l_word]):
                print "DANGER FROM", comment.author.encode(errors="replace"), "DELETING with ID", comment.id
                delete_comment(wp, comment.id)
                break
            

This is a function that you can use yourself to grab all the popular words in
the comments on your blog to determine which ones are the most likely to be
spam. There’s the obvious ones like viagra and ambien, but then glasses and
other words like that. Be careful you don’t cast too wide of a net, so as to
not delete the wrong comment. Luckily, in wordpress, you can always retrieve
the comments you’ve marked as deleted

def top_20_common_words(comments):
    """
    bonus method for getting the most common words in the comments you pass
    into it, which is useful when you're trying to build a list of all the spammy
    words to auto parse out of your blog

    returns all words longer than 3 characters
    """

    all_words = []
    for comment in comments:
        for word in comment.content.split():
            all_words.append(word)

    word_counter = Counter(all_words)

    return [word[0] for word in word_counter.most_common() if len(word[0]) >= 4][:20]

Finally, there’s the last bit of code that kicks off the whole shebang. Note
that you’ll need to have the XMLRPC api enabled in your wordpress settings, and
you’ll need to make sure that the username and password matches yours.

wp = Client(r'http://www.YOURBLOGURL.com/xmlrpc.php', 'ADMIN_USERNAME', 'ADMIN_PASSWORD')
resp = wp.call(GetCommentStatusList())
print "All Possible Comment Statuses:", resp

wordlist = DANGER_WORDS + DRUG_WORDS
wordlist = [word.lower() for word in wordlist]

print 'Searching for Comments...'
delete_comments_containing(wp, wordlist)
print 'Search Complete!'

Here’s the script in it’s entirety:

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods.posts import *
from wordpress_xmlrpc.methods.users import *
from wordpress_xmlrpc.methods.comments import *
from wordpress_xmlrpc.methods.pages import *

from collections import Counter

DANGER_WORDS = ["glasses", "longchamp", "oakleys", "oakley", "-online", "4u", "adipex", "advicer", "baccarrat", "blackjack", "bllogspot", "booker", "byob", "car-rental-e-site", "car-rentals-e-site", "carisoprodol", "casino", "casinos", "chatroom", "cialis", "coolcoolhu", "coolhu", "credit-card-debt", "credit-report-4u", "cwas", "cyclen", "cyclobenzaprine", "dating-e-site", "day-trading", "debt-consolidation", "debt-consolidation-consultant", "discreetordering", "duty-free", "dutyfree", "equityloans", "fioricet", "flowers-leading-site", "freenet-shopping", "freenet", "gambling-", "hair-loss", "health-insurancedeals-4u", "homeequityloans", "homefinance", "holdem", "holdempoker", "holdemsoftware", "holdemtexasturbowilson", "hotel-dealse-site", "hotele-site", "hotelse-site", "incest", "insurance-quotesdeals-4u", "insurancedeals-4u", "jrcreations", "levitra", "macinstruct", "mortgage-4-u", "mortgagequotes", "online-gambling", "onlinegambling-4u", "ottawavalleyag", "ownsthis", "palm-texas-holdem-game", "paxil", "penis", "pharmacy", "phentermine", "poker-chip", "poze", "pussy", "rental-car-e-site", "ringtones", "roulette", "shemale", "shoes", "slot-machine", "texas-holdem", "thorcarlson", "top-site", "top-e-site", "tramadol", "trim-spa", "ultram", "valeofglamorganconservatives", "viagra", "vioxx", "xanax", "zolus", ]
DRUG_WORDS = ["ambien</a>", "vuitton", "ambien", "viagra", "cialis", "drug", "hydrocodone", "klonopin", "pill", "withdrawal", "ativan", 'valium', 'clomid', 'rel="nofollow">buy', 'valium</a>', 'xanax', 'marcjacobs', 'watches', 'discount', 'tadalafil', 'premature', 'ejaculation']
RISKY_WORDS = ["sex", "free", "online"]

def get_all_comments_per_post(client, post_id=""):
    """
    returns a single key dict with the comments inside
    """
    data = {'filter' : post_id,
            'number': 2000,
            'status': 'hold'}

    resp = client.call(GetComments(data))

    return resp

def delete_comment(client, comment_id):
    data = {'comment_id':int(comment_id)}
    resp = client.call(DeleteComment(comment_id))

def delete_comments_containing(client, list_of_words):
    print 'searching for comments'
    comments = get_all_comments_per_post(wp)

    print 'search complete'
    print 'count:', len(comments)
    for comment in comments:
        words = []

        words += comment.content.split()
        words += comment.author.split()
        words += [comment.author_email]
        words += [comment.author_url]
        for word in words:
            if word.lower() in list_of_words or \
            any([l_word for l_word in list_of_words if word in l_word]):
                print "DANGER FROM", comment.author.encode(errors="replace"), "DELETING with ID", comment.id
                delete_comment(wp, comment.id)
                break

def top_20_common_words(comments):
    """
    bonus method for getting the most common words in the comments you pass
    into it, which is useful when you're trying to build a list of all the spammy
    words to auto parse out of your blog

    returns all words longer than 3 characters
    """

    all_words = []
    for comment in comments:
        for word in comment.content.split():
            all_words.append(word)

    word_counter = Counter(all_words)

    return [word[0] for word in word_counter.most_common() if len(word[0]) >= 4][:20]

wp = Client(r'http://www.YOURBLOGURL.com/xmlrpc.php', 'ADMIN_USERNAME', 'ADMIN_PASSWORD')
resp = wp.call(GetCommentStatusList())
print "All Possible Comment Statuses:", resp

wordlist = DANGER_WORDS + DRUG_WORDS
wordlist = [word.lower() for word in wordlist]

print 'Searching for Comments...'
delete_comments_containing(wp, wordlist)
print 'Search Complete!'

How to compile Libtcod and C++, mirrored from Ysgard.net

Source: http://www.ysgard.net/2011/06/visual-studio-2010-c-express-and-libtcod/

So I was playing with the Doryen Library, aka libtcod, the other day.  What a fantastic little library.  I had just finished a small demo for displaying how influence maps works (more for my own education, as I’m a pretty inexperienced programmer and doing is how I learn) and was pretty pleased with the result.   I’m using Visual Studio Express because I use Windows and happen to like it a bit more than Code::Blocks.

Okay, so I don’t need that debug window anymore.  Go into the linker properties, specify the subsystem as Windows app, compile…. and BAM!

MSVCRTD.lib(crtexew.obj) : error LNK2019: unresolved external symbol _WinMain@16 referenced in function ___tmainCRTStartup
C:\Sigil\VS\InfluenceMap\Debug\InfluenceMap.exe : fatal error LNK1120: 1 unresolved externals

Hmm, okay what’s going on here?  Check the libtcod documentation… no, there’s nothing here about this.  Why is this happening?

The answer to this lies in the way the program is run.  Before, I was compiling the program as a console app.  For this, the standard C++ entry point, main, was sufficient.

int main(int argc, char** argv)

But the moment I specified the program as a Windows application, the rules changed.  To use the Windows api, I need a Windows entry point, which is defined like this:

int APIENTRY WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nCmdShow)

But the samples I’ve seen don’t define this entry point.  They define a standard main().  So what am I missing?  As it turns out, I’m missing SDL!

libtcod is an SDL library, and SDL has its own way of mapping main() to WinMain().  But in order to make this happen, you also need to link in SDL and pull in the SDL.h header.

Here’s how to set up a Visual Studio project with both libtcod and SDL.  You can then write and test your application using the console window, and when the time is right, switch to a native windows app without any issues!

Here’s the exact procedure I followed.

  1. Download and extract the libtcod library for Visual Studio to wherever you place your libraries.  In my case, I extracted it to C:\Sigil\VS\libs, which resulted in the folder C:\Sigil\VS\libs\libtcod-1.5.1.
  2. Download and extract the SDL development libraries for Visual Studio.  You want the latest one available, I used SDL-devel-1.2.14-VC8.zip.  I extracted it to C:\Sigil\VS\libs\SDL-1.2.14.
  3. Open Visual Studio Express.  Create an Empty project for your new libtcod application.
  4. Right click on your project in the Solution Explorer and select Properties.
  5. Go to Configuration Properties -> VC++ Directories.
  6. Click on Include Directories, and from the little drop-down box on the right, select <Edit…>
  7. Click on the little folder icon to add a new folder to your include directories, specify the directory for your libtcod library (for me, C:\Sigil\VS\libs\libtcod-1.5.1\include).
  8. Do the same for SDL (mine was C:\Sigil\VS\libs\SDL-1.2.14\include).
  9. Click OK.
  10. Now do the same for the Library Directories, specifying the folders for your libtcod and SDL lib directories.
  11. Now go to Configuration Properties -> Linker -> Input.  Click on Additional Dependencies, click on the drop-down box, select <Edit…>.  Add the following dependencies, one per line: libtcod-VS.lib, SDL.lib, SDLmain.lib.  Click on OK.
  12. There’s one final step left.  You need to pull in the main SDL header file.  At the top of your main file, add #include <SDL.h>

You’re now set!  Go ahead and develop your application as normal.  Whenever you’re ready (or just want to test it) you can set your linker to produce a native Windows application.  To do this, you need to go into the Properties windows again, and select Configuration Properties -> Linker -> System.  In the right-hand pane, click on SubSystem and select Windows (/SUBSYSTEM:WINDOWS) from the drop-down menu.

Build your application again, and you should now be enjoying some console-free Windows goodness!

Imgur API part 3: OAuth2

It’s been a while since I wrote the last one, but a user on reddit asked how to get OAuth working with Imgur, and I couldn’t resist giving it one more shot. Thanks to my recent experience with the Netflix API, even though it was in C# rather than in Python, I was able to wrangle a quick and easy script for getting started.

Here’s the PasteBin with the complete code, thought it’s shown at the bottom of the page, as usual. (more…)

How to sort 56k complex objects quickly.

This is not something I will go indepth for, because the actual implementation
will probably change for your project AND I just want to get this out there
quickly.

So my issue was that I was trying to sort 56 000 objects quickly, for the sake
of pagination, using LINQs .Take and .Skip methods. Say I wanted the 30th page
of items and I wanted to have 15 items per page, I’d Skip the first 29 pages
worth of items ( 29 * 15) and then .Take the next 15 items. It’d look like

db.Table.Skip(435).Take(15).ToList();

but was running into issues when I tried to do that because you need to use
OrderBy first on the set of objects, in order to be able to skip the first ones,
since they’re not in any reliable order to .Skip in the first place.

db.Table.OrderBy(item => item.ID).Skip(435).Take(15).ToList();

I’m sure with more simple objects, just adding the OrderBy call would be enough
to do the trick, but my objects were a wrapper around 4 or 5 other objects.
Definitely not a good idea, but it’s what I’ve got! So anyway, the issue was
that it was taking over 3 seconds for just 56k objects. I tried not only
Orderby, but Take’ing the 30 pages and GetRange’ing the last page worth. That
was actually a small improvement initially, since I only ever tested it on the
first page, which meant that there was only 30 or so objects being pulled and
dealt with.

But as you might imagine, as I got to page 1000, there was a Timeout exception
naturally, as it was trying to deal with 15000 items in memory. So this very
last thing I tried, I didn’t really expect it to work for me, since I was
already sorting by ID:

I .Select’ed only the IDs for the objects I wanted to pull, rather that the
entire object. Simple right? What that means is that once I was ready to list
only the 15 items I wanted to show, I didn’t have to OrderBy all the objects and
then Skip and Take from that, I only had to deal with simple integers instead,
and then find the associated ids later with a line that looks something like the
following, with a healthy mix of .Where and .Any:

db.Table.Where( item =>
                sorted_item_ids.Any(
                    item_id =>
                       item_id ==
                     item.deeper_object.item_id)) .ToList();
                

In the likely case that that’s not as clear as it seems, it’s looking to match
any objects from db.Table with any of the item_ids from the sorted_item_ids
list, which is an array of ints, and then finally turns the resulting IQueryable
into a List<table_object>.

And there you have it! I’m not sure this post makes any sense as I wrote it
pretty quickly, so maybe I’ll come back to it later and clean it up. (Fat
chance, right?)

Netflix Catalog January update

Took a few  weeks off, first for school, and then for christmas. Back only a few days, and I’ve learned the basics about Cookies and Filters in ASP.NET MVC. They’re both pretty simple, even moreso than I thought. Instead of rewriting a lot of the stuff I read, I’ll just link to the StackOverflow answers that I used directly.

For the cookies, it’s here, where it shows both how to add and delete (via setting the Expiry of the cookie to yesterday) a cookie. I thought that was pretty neat. Basically all it is is a number that the browser stores, and then the server has to do something with that.

Which brings me to Filters, with a SO answer in two parts: Here and the offical docs, which are very surprisingly handy this time. Filters just run before and after certain events, like Authorization, Actions, Results and Exceptions. I’ve only played with “OnResultExecuting” which mean before the ViewResult has been finished being processed, since if I change the ViewBag in the filter, the returned view with contain the modified data.

Other than that, I’ve run into a strange issue which I’m really hoping to solve: the catalog index I got the from Netflix API is incomplete! It’s got about 56k titles on there, individual shows included as separate entries (something I still have to sort out) and I can’t find certain titles, like Dexter and The 4400, which are both available for streaming off Netflix.com. I’ve made a post asking for clarification here but looking at the post, it seems like I’ll need to cross post into the “Help me” subforum, rather than the “API forum” forum.

My buddy is still working on the Rotten Tomatoes thing, which has taken a few months now. He hasn’t made much progress as far entering things into the database, but then I realized that without the huge /catalog/titles/streaming resource, I’d probably be in the same spot he is. I’ll have to remind him about the OMDB tool, and how we can just pull the RottenTomatoes data from there instead. Then it’ll be a matter of associating the Netflix movie with the RottenTomatoes movie. That’ll be tricky.

Anyway wanted to make a post here because  it’s been a while, and it doesn’t get any easier to write these. And although I now manage to get about 1000 people to the site a month, I realized 75%+ of them are here for the tutorials I have written. I really should keep working on those… If you ever have something you’d like to learn about, please hit me up!

You should follow me on twitter @tankorsmash to hear me complain about the official docs, or how All That Remains is dead