SIGPIPE 13

Cuckoo Hashing

August 18th, 2009

The Achilles’ heel of hashing is collision: When we want to insert a new value into the hash table and the slot is already filled, we use a fallback strategy to find another slot, for example linear probing.

The fallback strategy can affect lookup time since we need to do the same probing when a lookup results in an entry with wrong key, turning the nice O(1) time complexity into (worst case) O(n).

Of course the O(n) time is pessimistic as we will rehash to a larger table size when we reach a certain threshold, though from a theoretical point of view an intriguing approach to handling collisions is cuckoo hashing which guarantees O(1) lookup time (insertion can still be worse).

Quoting the Wikipedia page:

The basic idea is to use two hash functions instead of only one. This provides two possible locations in the hash table for each key.

When a new key is inserted, a greedy algorithm is used: The new key is inserted in one of its two possible locations, “kicking out”, that is, displacing, any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there, until a vacant position is found, or the procedure enters an infinite loop. In the latter case, the hash table is rebuilt in-place using new hash functions.

This means that no additional probing is required during lookup as an element will always be in one of the two slots given by the hash functions (if it is in the table).

In practice linear probing with proper thresholds and a good hash function may perform better (due to locality of reference) plus insertion using cuckoo hashing can be more expensive (as we do more memory writes on collisions than linear probing), still, I love the theoretical property of this collision strategy :)

[by Allan Odgaard]


6 Responses to “Cuckoo Hashing”

  1. Martin Pfeiffer Says:
    August 19th, 2009 at 18:12

    Use bucket hashing instead, it is one of the best strategies for hashing cause it uses dynamic resizing of the table (extending the table cause of collition in O(1)).

  2. Allan Odgaard Says:
    August 19th, 2009 at 18:39

    Martin: I am not sure what “bucket hashing” refers to, but a good implementation of pretty much all the schemes will dynamically resize the table, despite that (quoting Wikipedia):

    […] if 2500 keys are hashed into a million buckets, even with a perfectly uniform random distribution, according to the birthday paradox there is a 95% chance of at least two of the keys being hashed to the same slot

    So you will still get collisions.

    As for what is the best strategy, the Rubinius contributors did experiments and documented their results.

  3. Martin Pfeiffer Says:
    August 21st, 2009 at 22:27

    Sure you get collisions, that is the problem of every hash strategy. But the question is how long will the chain be if you get the collision. To get the best worst case runtime at hashing it is a good way to use red black trees to manage buckets, cause they have a O(log n) for insert and delete in worst case. So if you have worst case scenario for hashing (e.g. you get 1 as hash result for every item you insert in the table) you can still go over the chain in O(log n) and in a realistic scenario with a good hash function you get O(1) for every operation. This realistic scenario is much more likely cause you dont have only a million buckets, you have 2^n buckets for a n-bit hash function. So with a good hash function you should get amazing result. Sure this method isnt easy to implement cause insert and delete for red black trees is not trivial and some sort of too much for just hashing put from the theoretic point of view it should be the optimum in run time.

    The problem of cuckoo hashing is that it depend on the load factor of the table. If the load factor go over 50 % the insert is not longer at O(1) it is at O(n) in worst case. The problem is that if you have a very large table (e.g. 10 million keys) and must rebuild the table, then this takes time… Lets say you use the hash table in a web application for storing some data, then maybe 1 million users get there results very fast, but the next ones causes the table to rebuild, and the web application is freezing for seconds or minutes.

    So to come to a point: With cucko hashing you go very fast if everything is ok, but from time to time there is a very large rebuilt of the table. Bucket hashing is maybe slighty slower (dont know it exactly) but its very constant in time consuming over the time.

    So it is as with every algorithm or data structure: It depend on in which situation you want to use it and the theoretical point of you always looks at infinity and I dont think you want to built a hash table at this size :>

    I hope you get my point of view in my horrible english and maybe found it interesting or useful. So have a nice weekend and keep on you good work.

  4. Allan Odgaard Says:
    August 22nd, 2009 at 09:07

    Martin: I think I get your point, but also, I think you may have read my post rather lightly ;)

    Hash algorithms have expected O(1) time for insert and lookup and worst-case O(n). You can make a hybrid (like you propose) and get that worst-case down to O(lg n) — though this a) is then no longer a pure hash algorithm, and b) still does not guarantee O(1) insert and lookup.

    Cuckoo hashing does guarantee O(1) lookup time and still have expected O(1) insert time (with O(n) as worst-case).

    This does not mean that cuckoo hashing is a better choice (in practice), it is a theoretical improvement which I found interesting and motivated me to write this post! It was not a plug to sell cuckoo hashing, as I already say in the post: linear probing (with proper threshold) is likely a better choice both for insert and lookup.

    Hope that makes my article’s point more clear.

  5. Joseph Lenox Says:
    August 14th, 2014 at 17:58

    One of the uses for a cuckoo-hash table as opposed to linear probing is that (especially with a stash), you can get good performance in a heavily multi-threaded environment that supports atomic exchange.

    References: http://dl.acm.org/citation.cfm?id=1618500 http://gradworks.umi.com/34/82/3482095.html (dissertation) And Chapter 4 of "GPU Computing Gems – Jade Edition"

  6. Joseph Lenox Says:
    August 14th, 2014 at 18:00

    Sorry for the extra post — forgot to add a CUDA implementation.


Leave a Reply