Blog Spam Filtering Ideas
I host two blogs, a wiki, and a ticket system, all targets for spam, so I have since generalized the system by using
mod_rewrite due to the cookie being set). This means “blocking” spam doesn’t require a plug-in written specifically for the particular web application.
Despite this JS challenge some spam still gets through, and that’s what this post is about.
Spam Not Caught
Until recently I deleted all spam that fooled the JS challenge but I did look through the logs for some of them to look for patterns, and until recently it was hard to find one since:
- The log entries look like a human (and it might be), e.g. loading all images and CSS and generally taking a few minutes from first hit to the POST.
- The comment says “thanks for this article”, “very interesting”, or something along those lines. I.e. just one or two lines of praise (and often slight variations).
- Even the URL provided (for author) can look very non-spammy (including the landing page).
Given the above, I started to consider other things that could be used for filtering and arrived at the following list:
- While comment has no spammy content it has also no actual content, so comparing it to legit content (the previous comments + post) might provide a score for how likely it is a valid comment.
- User agent generally include Windows, yet all the cool kids are on Mac :)
- IP is often from Central America, sometimes Eastern Europe.
- URL referrer is often Google (searching for “blog” or similar).
- Comment is often added to an old post. While legit comments are also added to old posts, these tend to be long.
As said above, so far I have just thrown spam comments away, so I don’t have a corpus to test the above on, so take it only as anecdotal.
I plan to archive all spam from this point on so that I can later experiement with filters based on the above list. Though my other project might not allow me to get around to experiment with this anytime soon, perhaps someone else will find inspiration in the above.