SIGPIPE 13

Fighting Comment Spam

September 25th, 2005

Update 2007-07-17: Since I installed the JS challenge almost two years ago it has blocked 83,837 POSTs. Roughly a dozen spam POSTs did defeat the challenge. Looking at the access log for these they do seem to be from actual humans (based on the initial hit having a google referrer, all resources (CSS and images) being fetched, and the delay from last GET to the POST), but it could also be a cleverly scripted browser (not sure of the “economy” of either though).

Recently I've received a lot of comment spam, which is fake comments posted to a blog or wiki (for me once every hour) with the purpose of increasing the page rank for a website.

Looking at the comment spam I have received, I see that more than 90% of the IP addresses are unique (infected Windows machines used as proxies?) so for the challenge I decided to run sha-1 on the visitors IP (plus a constant) and ask for that back when he submits the form.

So to start we need to generate the value pair:

<?php
 
    $ip   = $_SERVER['REMOTE_ADDR'];
    $data = sha1($ip . "secret");
    $pair = sscanf($data, "%20s%20s");
 
?>

The $pair variable is now an array with two strings derived from the visitors IP, and we'll use them as name and value in the submission, but set them using JavaScript. In the simplest case that'd be:

<script>
    function challenge () {
        var elm = document.getElementById("response");
        elm.value = "<?= $pair[1] ?>";
    }
</script>
 
<form … onSubmit="challenge()">
    <input id="response" type="hidden"
        name="<?= $pair[0] ?>" value="0">
</form>

This is laughable simple, but currently 100% effective. So let's analyze it a bit (in the unlikely event that spammers target this particular challenge).

The name/value pair is constructed from the visitors IP plus a secret word from the server, so there's no way to have it pre-generated, and since it's unique for each individual visitor, re-use of the name/value pair is not possible among visitors. For the individual visitor, to be able to re-use the name/value pair he must first “calculate” it, but if the visitor can do the initial calculation, he can do them all, so I see no reason to give out different name/value pairs to the same visitor (e.g. based on time).

I find it unlikely that spammers will start to run JavaScript, since that would open up for JavaScript traps, i.e. creating fake blogs with onSubmit scripts that perform infinite loops or similar to waste resources on the spammers machine.

Though with our current very simple script, there's the possibility to just check for a script matching the pattern, and grab the value from it.

To counter this, we need to make the JavaScript more complex (so that the value can't be harvested directly) and probably obfuscate it a bit (by using random names instead of fixed names for variables, the function name, and the ID of the input element (these can be totally random, since these are only used client-side)).

It's easier to make the script complex if instead of treating the value as a string, we treat it as a number, since then we can use math to calculate it. First let us grab the first four (hexadecimal) numbers from the string, and use these:

<?php
 
    $num = sscanf($pair[1], "%02x%02x%02x%02x");
 
?>

We can now do something like:

<script>
    function challenge () {
        var value = <?= $num[0] ?> * <?= $num[1] ?> +
            <?= $num[2] ?> * <?= $num[3] ?>;
 
        var elm = document.getElementById("response").value = value;
    }
</script>

It's still relatively simple, so to introduce a random element in the calculation, we can factorize each number and use that instead. I.e. instead of putting 23 we can put 2 * 10 + 3, but do the factorization randomly and recursive. Here's a PHP function to do that:

function scramble ($a)
{
    $factors = array(2, 3, 4, 5, 6, 7, 8, 9);
    $f = $factors[array_rand($factors)];
    $m = floor($a / $f);
    $r = $a - $m * $f;
    $m = $m < 10 ? $m : scramble($m);
    return "($m * $f + $r)";
}

If we then write:

var value =
    <?= scramble($num[0]) ?> *
    <?= scramble($num[1]) ?> +
    <?= scramble($num[2]) ?> *
    <?= scramble($num[3]) ?>;

The result could look like:

var value =
    (((1 * 9 + 1) * 3 + 0) * 4 + 3) *
    ((5 * 2 + 0) * 4 + 3) +
    ((1 * 8 + 4) * 2 + 0) *
    (((1 * 8 + 7) * 5 + 3) * 2 + 0);

There's no way to harvest the value, but one could grab the entire calculation and pipe it through bc or similar. To avoid this I've introduced two additional complications. The first is the use of temporary variables in the JavaScript and the second is that instead of outputting $m * $f, I output $func($m) where $func is a randomly named JavaScript function which will perform the actual multiplication.

That way, to harvest the value, one would have to deal with variables, functions, and math calculations. Basically one would have to write a language parser, and additionally recognize the randomized script (so to run it only on that).

The script I ended up writing is conceptually what's outlined above, but ended up a tad more complex (although I wanted simplicity). An example of the function generated is below. Instead of generating random names, I have an array with all the names of the constellations and start by doing a random permutation of this array.

function camelopardalis () {
    function microscopium(x) { return 5*x; }
    function scutum(x)       { return 3*x; }
    function lepus(x)        { return 6*x; }
    function pavo(x)         { return 9*x; }
    function perseus(x)      { return 4*x; }
    function norma(x)        { return 2*x; }
    function chamaeleon(x)   { return 7*x; }
    function centaurus(x)    { return 8*x; }
    var apus = 42, hydra = 4+microscopium(norma(norma(8))),
        grus = 3+microscopium(2+lepus(2)),
        crux = perseus(2+scutum(1+scutum(6))),
        corvus = 3+microscopium(scutum(5+7)),
        cetus = 4+chamaeleon(1+9),
        pegasus = 1+lepus(5), delphinus = 2+7,
        pictor = 3+microscopium(6+chamaeleon(3));
    var cygnus = delphinus * pictor + crux * cetus + 
        pegasus * corvus + hydra * grus + 0;
    document.getElementById('pyxis').value = cygnus;
}

If you're using WordPress, then there already is a plugin (hashcash) which does something similar.

[by Allan Odgaard]


25 Responses to “Fighting Comment Spam”

  1. Spam Says:
    September 30th, 2005 at 09:26

    Testing the anti-spam thing.

  2. Spam Says:
    September 30th, 2005 at 09:27

    Doesn't seem to work.

  3. Spam Says:
    September 30th, 2005 at 09:28

    What do I have to do to trigger the filter? Put a URL like this great one — http://spammers.org/ — in there?

  4. Allan Odgaard Says:
    September 30th, 2005 at 14:59

    As long as you have JavaScript enabled, you can post whatever you want.

    Blacklisting doesn't work because spammers just post some generic praising phrase and a link — so the only way to blacklist is by disallowing people to post links.

    The JS challenge seems to work very well (stopped >100 comment spams since I added it a few days ago)

  5. Test Says:
    May 23rd, 2006 at 17:20

    test

  6. Testy Says:
    May 31st, 2006 at 20:55

    Does it work :(

  7. Allan Odgaard Says:
    May 31st, 2006 at 21:51

    I actually switched to Akismet on this blog since I wanted to test that out as well.

    But I run only my JS challenge on the macromates blog, and it has stopped 3,152 spam comments since I installed it (~13 per day.) I haven’t kept an exact count on how many spams made it through the JS challenge, but I think in the range of 8-20 (i.e. not many.)

    So it has turned out to be pretty effective. As for those comments who did make it through, judging from the apache access log, these are either scriptet browsers, or manual work. One even had a referrer from Google on the first hit to the blog (indicating manual work.)

  8. JuandeSant Says:
    June 13th, 2006 at 23:59

    I'm using Spam Karma 2 for my personal weblog, and I'm really, really content with it. I guess my spam level is nowhere near yours, but I no longer have to baby sit the comments. And it can offer an optional easy CAPTCHA for those comments who fall just below the threshold.

  9. Joakim Nygård Says:
    July 14th, 2006 at 22:49

    This is most excellent. So very simple, yet effective, and without bothering the poster at all.

  10. Grahame Bowland Says:
    July 22nd, 2006 at 18:38

    I was thinking about how to make this proof against someone implementing an absolutely trivial Javascript interpreter. You could easily make the calculation depend on values accessed in the Javascript code through the DOM. Possibly with nesting, names, accessing through the various globals arrays.. etc. Of course, your whole scheme is defeated if someone just scripts WebKit (or whatever) to submit the form :-)

  11. Allan Odgaard Says:
    July 22nd, 2006 at 19:31

    Grahame: Just FYI here is the source for the generated script: http://pastie.caboo.se/5611.

    While it’s not that complex to parse such script, it does require some work to write an interpreter for it.

    I am not exactly sure what DOM stuff you want to rely on. The script (for me at least) needs to be independent of page content as I add it to different pages (from different webapps.)

    As for the spammer scripting WebKit: it’s a possibility, but as I mentioned it will open up for exploits (targetting the spammer.)

    Imagine for example a “Click here if you want to be blacklisted” link hidden by using CSS and which destination page uses XmlHttpRequest to send the clients IP address to the server for blacklisting.

    Imagine even such blacklists were shared :)

  12. Lee Rogers Says:
    September 15th, 2006 at 11:31

    Are you sure 23063 about this?!?

  13. Brian Says:
    July 16th, 2007 at 09:22

    test

  14. Eoghan Murray Says:
    July 16th, 2007 at 09:31

    Nice work.

    This will fail if the spam bot is actually a browser, e.g. IE automated through COM.

  15. Haacked Says:
    July 16th, 2007 at 16:25

    I did something like this a while ago: http://haacked.com/archive/2006/09/26/Lightweight_Invisible_CAPTCHA_Validator_Control.aspx

    Instead of a hash of the IP, I just had javascript solve a simple addition problem. For maximum usability, if you have javascript turned off, you can see the actual math problem and solve it yourself.

    Unless you're Hotmail, it's not worth the spammers time to defeat such a simple test. It doesn't fit into the economy of scale.

    Of course, this does nothing to thwart PingBack/TrackBack spam. ;)

  16. Tiago Serafim Says:
    July 16th, 2007 at 17:34

    This approach shouldn´t work if you´re using wp-cache, since it´ll cache the unique string.

  17. wr Says:
    July 17th, 2007 at 00:53

    testing

  18. moof Says:
    July 21st, 2007 at 02:15

    testing…

  19. Sven Fuchs Says:
    September 23rd, 2007 at 11:29

    Actually I've been pretty happy with a pure CSS-based solution for months now. This is packed as a Mephisto plugin but of course the technique can be applied to any blogging engine.

    Mephisto Plugin: Inverse Captcha for Comments

  20. moof Says:
    September 26th, 2007 at 20:56

    testing

  21. Jordan Sissel Says:
    January 7th, 2008 at 06:58

    I've been using a javascript spam deterrent on my own blog more than a year ago, and I've been happier ever since. I've only ever had about 3 spams get through, and 33779 rejections. The only spams that got through I'm resonably certain were done by humans, or by bots I've never seen before. I'm pretty sure, though, that they were humans, or we'd see more javascript spam detterants failing.

  22. Anonymous Says:
    April 27th, 2008 at 23:28

    gffgf

  23. SIGPIPE 13 » Blog Archive » Blog Spam Filtering Ideas Says:
    August 11th, 2009 at 11:50

    […] have previously detailed how I fight comment spam using a JavaScript […]

  24. Anonymous Says:
    September 24th, 2009 at 15:00

    testing

  25. f Says:
    July 10th, 2013 at 21:07

    f


Leave a Reply