SIGPIPE 13

Programming, automation, algorithms, macOS, and more.

Fighting Comment Spam

Update 2007-07-17: Since I installed the JS challenge almost two years ago it has blocked 83,837 POSTs. Roughly a dozen spam POSTs did defeat the challenge. Looking at the access log for these they do seem to be from actual humans (based on the initial hit having a google referrer, all resources (CSS and images) being fetched, and the delay from last GET to the POST), but it could also be a cleverly scripted browser (not sure of the “economy” of either though).

Recently I’ve received a lot of comment spam, which is fake comments posted to a blog or wiki (for me once every hour) with the purpose of increasing the page rank for a website.

Looking at the comment spam I have received, I see that more than 90% of the IP addresses are unique (infected Windows machines used as proxies?) so for the challenge I decided to run sha-1 on the visitors IP (plus a constant) and ask for that back when he submits the form.

So to start we need to generate the value pair:

<?php

    $ip = $_SERVER['REMOTE_ADDR'];
    $data = sha1($ip . "secret");
    $pair = sscanf($data, "%20s%20s");

?>

The $pair variable is now an array with two strings derived from the visitors IP, and we’ll use them as name and value in the submission, but set them using JavaScript. In the simplest case that’d be:

<script>
    function challenge () {
        var elm = document.getElementById("response");
        elm.value = "<?= $pair[1] ?>";
    }
</script>

<form … onSubmit="challenge()">
    <input id="response" type="hidden"
        name="<?= $pair[0] ?>" value="0">
</form>

This is laughable simple, but currently 100% effective. So let’s analyze it a bit (in the unlikely event that spammers target this particular challenge).

The name/value pair is constructed from the visitors IP plus a secret word from the server, so there’s no way to have it pre-generated, and since it’s unique for each individual visitor, re-use of the name/value pair is not possible among visitors. For the individual visitor, to be able to re-use the name/value pair he must first “calculate” it, but if the visitor can do the initial calculation, he can do them all, so I see no reason to give out different name/value pairs to the same visitor (e.g. based on time).

I find it unlikely that spammers will start to run JavaScript, since that would open up for JavaScript traps, i.e. creating fake blogs with onSubmit scripts that perform infinite loops or similar to waste resources on the spammers machine.

Though with our current very simple script, there’s the possibility to just check for a script matching the pattern, and grab the value from it.

To counter this, we need to make the JavaScript more complex (so that the value can’t be harvested directly) and probably obfuscate it a bit (by using random names instead of fixed names for variables, the function name, and the ID of the input element (these can be totally random, since these are only used client-side)).

It’s easier to make the script complex if instead of treating the value as a string, we treat it as a number, since then we can use math to calculate it. First let us grab the first four (hexadecimal) numbers from the string, and use these:

<?php

    $num = sscanf($pair[1], "%02x%02x%02x%02x");

?>

We can now do something like:

<script>
    function challenge () {
        var value = <?= $num[0] ?> * <?= $num[1] ?> +
            <?= $num[2] ?> * <?= $num[3] ?>;

        var elm = document.getElementById("response").value = value;
    }
</script>

It’s still relatively simple, so to introduce a random element in the calculation, we can factorize each number and use that instead. I.e. instead of putting 23 we can put 2 * 10 + 3, but do the factorization randomly and recursive. Here’s a PHP function to do that:

function scramble ($a)
{
    $factors = array(2, 3, 4, 5, 6, 7, 8, 9);
    $f = $factors[array_rand($factors)];
    $m = floor($a / $f);
    $r = $a - $m * $f;
    $m = $m < 10 ? $m : scramble($m);
    return "($m * $f + $r)";
}

If we then write:

var value =
    <?= scramble($num[0]) ?> *
    <?= scramble($num[1]) ?> +
    <?= scramble($num[2]) ?> *
    <?= scramble($num[3]) ?>;

The result could look like:

var value =
    (((1 * 9 + 1) * 3 + 0) * 4 + 3) *
    ((5 * 2 + 0) * 4 + 3) +
    ((1 * 8 + 4) * 2 + 0) *
    (((1 * 8 + 7) * 5 + 3) * 2 + 0);

There’s no way to harvest the value, but one could grab the entire calculation and pipe it through bc or similar. To avoid this I’ve introduced two additional complications. The first is the use of temporary variables in the JavaScript and the second is that instead of outputting $m * $f, I output $func($m) where $func is a randomly named JavaScript function which will perform the actual multiplication.

That way, to harvest the value, one would have to deal with variables, functions, and math calculations. Basically one would have to write a language parser, and additionally recognize the randomized script (so to run it only on that).

The script I ended up writing is conceptually what’s outlined above, but ended up a tad more complex (although I wanted simplicity). An example of the function generated is below. Instead of generating random names, I have an array with all the names of the constellations and start by doing a random permutation of this array.

function camelopardalis () {
    function microscopium(x) { return 5*x; }
    function scutum(x) { return 3*x; }
    function lepus(x) { return 6*x; }
    function pavo(x) { return 9*x; }
    function perseus(x) { return 4*x; }
    function norma(x) { return 2*x; }
    function chamaeleon(x) { return 7*x; }
    function centaurus(x) { return 8*x; }
    var apus = 42, hydra = 4+microscopium(norma(norma(8))),
        grus = 3+microscopium(2+lepus(2)),
        crux = perseus(2+scutum(1+scutum(6))),
        corvus = 3+microscopium(scutum(5+7)),
        cetus = 4+chamaeleon(1+9),
        pegasus = 1+lepus(5), delphinus = 2+7,
        pictor = 3+microscopium(6+chamaeleon(3));
    var cygnus = delphinus * pictor + crux * cetus + 
        pegasus * corvus + hydra * grus + 0;
    document.getElementById('pyxis').value = cygnus;
}

If you’re using WordPress, then there already is a plugin (hashcash) which does something similar.

{{ numberOfCommentsTitle }}

{{ submitComment.success }}

Error Posting Comment

{{ submitComment.error }}