SIGPIPE 13

Blog Spam Filtering Ideas

August 11th, 2009

I have previously detailed how I fight comment spam using a JavaScript challenge.

I host two blogs, a wiki, and a ticket system, all targets for spam, so I have since generalized the system by using mod_rewrite to redirect all POSTs without a cookie to a page which uses JavaScript to set this cookie and resubmit the request (which is then no longer catched by mod_rewrite due to the cookie being set). This means “blocking” spam doesn’t require a plug-in written specifically for the particular web application.

Despite this JS challenge some spam still gets through, and that’s what this post is about.

Spam Not Caught

Until recently I deleted all spam that fooled the JS challenge but I did look through the logs for some of them to look for patterns, and until recently it was hard to find one since:

  1. The log entries look like a human (and it might be), e.g. loading all images and CSS and generally taking a few minutes from first hit to the POST.
  2. The comment says “thanks for this article”, “very interesting”, or something along those lines. I.e. just one or two lines of praise (and often slight variations).
  3. Even the URL provided (for author) can look very non-spammy (including the landing page).

Patterns

Given the above, I started to consider other things that could be used for filtering and arrived at the following list:

  1. While comment has no spammy content it has also no actual content, so comparing it to legit content (the previous comments + post) might provide a score for how likely it is a valid comment.
  2. User agent generally include Windows, yet all the cool kids are on Mac :)
  3. IP is often from Central America, sometimes Eastern Europe.
  4. URL referrer is often Google (searching for “blog” or similar).
  5. Comment is often added to an old post. While legit comments are also added to old posts, these tend to be long.

As said above, so far I have just thrown spam comments away, so I don’t have a corpus to test the above on, so take it only as anecdotal.

I plan to archive all spam from this point on so that I can later experiement with filters based on the above list. Though my other project might not allow me to get around to experiment with this anytime soon, perhaps someone else will find inspiration in the above.

[by Allan Odgaard]


7 Responses to “Blog Spam Filtering Ideas”

  1. Brett Says:
    August 11th, 2009 at 21:50

    Why not use something like DISQUS?

  2. Allan Odgaard Says:
    August 11th, 2009 at 22:03

    Where’s the fun in that? ;)

    Though spam isn’t that big a problem (maybe a few comments per week) and I definitely like to control “my” content. For example I may want to merge this and the main blog, things like that gets harder when you “outsource”.

  3. Brad Fults Says:
    August 12th, 2009 at 00:24

    I like the idea of comparing the deviously spammy comments to longer, known ham comments. You might achieve a threshold where even non-spam, but otherwise vacuous comments are rejected. I wouldn't be sad to miss those.

    Looking at the referring URL also seems promising if they include search terms that you can blacklist.

    The other [perhaps obvious] question: why not use Akismet?

  4. Allan Odgaard Says:
    August 12th, 2009 at 08:46

    Akismet is nice but no silver bullet. It also has a problem with these types of spam and it can flag false positives (which for me, makes it a bad choice as the only filter, since I am not interested in manually checking thousands of flagged comments for false positives).

  5. Elio Grieco Says:
    August 18th, 2009 at 15:48

    I have been thinking for a while that the easiest way to combat spam with current technologies would be a three layer approach.

    1. Spell check
    2. Grammar check
    3. Bayesian filter

    If the comment/email has too many spelling or grammar errors then you probably don't want the content anyway. If it is well formed then a Bayesian filter should provide excellent filtering after a training period.

    As always, allowing users to register and thus white list certain posts can help ensure that any false positives likely won't be high value from trusted users.

  6. Jonathan Harlap Says:
    August 28th, 2009 at 09:41

    Eric: The issue with your layered approach is that it assumes that all posters are fluent English writers. Not to be too denigrating, but have you actually seen the quality of writing prevalent amongst programmers? Even many native English speaking programmers I've had the pleasure of working with or teaching seem to have lost the ability to write coherent and grammatically correct sentences ;)

    I think Brad's idea, even if it loses meaningless comments, is great. Obviously, how desirable this is depends on the blogger's goal with his blog – if he aims to build a community of fawning followers, of which Allen seems to have many, then he might not wish to discard the vapid "we love you Allan!" comments :P

  7. Daniel Samuels Says:
    October 13th, 2009 at 09:25

    Try a hidden email field, spambots will fill it in, real people won't. Although if humans are the ones adding the comments then it won't work..


Leave a Reply