## Data Structures: Hash Functions and Tables (Part 1)

In the past I've written about cryptographically-strong hash functions. But today I'd like to start a series of posts on *simple hash functions*. The difference between a cryptographically-strong hash function and a simple one is primarily around the reversibility of the result. This must be "hard" in the crypto world, but when the hash function is used for other purposes - like consistent mapping - the reversibility condition is not so imperative.

A **hash function** is a function which consistently maps a data value, called the *key*, to an array (hash table) *index*. The standard practice is to feed a variety of input values (the keys) into a hash function and use the resulting array indices to store the key values in the array. By "consistently" we mean the function always produces the same result for a given input value. In mathematical terms - *deterministic*.

A *hash table* is a data structure that internally employs a hash function to efficiently map keys to values.

In the digaram below, the blue boxes on the left represent the keys whilst the boxes on the right hand side represent the hash table. The green slots are indices occupied by the blue keys.

Ideally we want the function used to be consistent, fast and *injective (one-to-one)* however this is not always possible. To see this consider *h* to be a hash function that operates on a set of keys, **A**, and projects onto a set of key values, **B**. If **|A|** > **|B|** then *h* cannot be injective over **A**.

When the function maps two different key values to the same index, we have what is termed a *collision *and the function is classified as *non-injective*. The array slots available to the hash function are referred to as *buckets *because collisions means multiple items could end up in each slot, hence the collective name "bucket". As well, the array indices returned are commonly referred to as *bucket indices.* The diagram below shows a hash with a collision on the keys, "Barry" and "Sally". In this case the hash table uses a simple linked list to point to all the items that hash to the given value. This is called *hashing with chaining*. The astute reader will be wondering if a linked list is the most efficient data structure to use inside the bucket - it certainly isn't since you need to walk the list to find a given item which leads to O(n) worst-case running time if *n* is the number of items in the bucket. However the number of bucket items is, on average, very small in most cases so whilst hash tables inside hash tables perform better, simple linked lists are easier for illustration purposes.

When collisions happens, the amount of work needed to subsequently lookup the value in the hash table increases since we now have multiple items in the bucket index. This is not desireable hence we prefer hash functions that minimise the number of collisions. Techniques used to resolve hash collisions involve *linear probing*, *quadratic probing*, *double hashing* and *chaining*. These methods effectively find another vacant slot in the hash table using retry loops or trickery with modular arithmetic.

**The primary benefit of a well-behaved (injective) hash function is the speed of lookup that results from the constructed hash table.**. The *expected* time for any operation is a constant, O(1) which makes it extremely appealling for tasks such as database indexing, de-duping lists, checking a set for inclusion, etc. In fact, several other data structures use hash functions under the covers to weave their magic. These include **dictionaries **/ **associative arrays** / **maps**, and **bloom filters**.

As an example, here's some C# code to de-dup a generic list in-place using a hash table.

public static void Main() { var names = new List<string> { "Damien", "Alex", "Oscar", "Alex", "Benny", "Bjorn", "Oscar", "Oscar","Agnita" }; DeDup(names); foreach(var name in names) Console.WriteLine(name); } // order-preserving, quick, in-place generic list de-duping public static void DeDup<T>(List<T> input) { var ht = new Hashtable(); int i = 0; foreach(var name in input.ToArray()) { var hash = name.GetHashCode(); if (!ht.ContainsKey(hash)) { ht.Add(name.GetHashCode(), name); i++; } else input.RemoveAt(i); } }

In order to minimise collisions the output values from the hash function should, ideally, be *uniformly distributed *over the output range. Statistical techniques like the *Chi-squared test* can be used to test this using a null hypothesis, H_{o}: the function output values follow a uniform distribution.

**Perfect Hash function**

If there is no duplicate keys in the input set and the hash function results in no bucket having more than 1 entry, the hash function is said to be a *perfect hash function*. In other words, a perfect hash function is *injective *on a fixed, known set of inputs, and therefore collision-free. A perfect hash function is ideal when the input set is large but static, and there are many lookups of the data. i.e read-intensive, mostly-static data.

**Minimal Perfect Hash function**

A perfect hash function for *n* keys is said to be **minimal** if its range consists of *n **consecutive* integers, usually from 0 to *n*−1. Besides providing single-step lookup, a minimal perfect hash function are extremely space-efficient as they produce a compact hash table without any vacant slots. Minimal perfect hash functions are much harder to find than perfect ones with a wider range. The diagram below shows a minimal perfect hash function.

**Cuckoo Hashing**

When one hash function isn't enough how about using two? This is the idea behind *cuckoo hashing*. You employ 2 hash functions and, for each key, insert the item into the slot given by the first hash function. If that slot is already taken then you kick out the item in that slot and relocate it to the slot given by the second hash function. If that slot is taken you do the same thing again until a vacant slot is found. This can lead to an infinite loop in which case the hash table detects this and rebuilds itself in-place using new hash functions that hopefully lead to a better result.

The cute thing about cuckoo hashing is that the *worst-case* for lookup operations is constant time, O(1). This is much better than having *expected* time being constant. Furthermore, as long as the load factor is less than 50% [i.e the capacity of the hash table is more than twice the number of keys] insertions have expected constant time. So it does require a very sparse hash table to get good time on insertions, hence it is either fast and takes a lot of space, or slow and uses space efficiently - but never both.

If you're wondering about the name "cuckoo", it comes from the nesting habits of a European Cuckoo which throws out the occupants of the nest to make room for itself.

**Bloom Filters**

If two hash functions aren't enough, how about using three, or more? Believe it or not, this is what *bloom filters* do. You start off with an empty bit array of, say, *n* bits, all set to 0. You also have *k* hash functions which map keys to array indices randomly and uniformly. The idea being, when you need to **add** an item to the filter you hash it's key with all *k* hash functions and then put a 1 in each of the positions returned. The diagram below shows how three values {x, y, z} have been mapped into the bit array, whilst *w* is the item being queried.

So what use is this? Well Bloom filters are used to test if an element is a member of a set (you can add elements to a set but not easily remove them unless you introduce counters in the array). To do a **query** you pass the key of the data under consideration through the same *k* hash functions and check if ALL of the resulting array positions are equal to 1. If all positions are 1 it means that, either the item has already been added, OR the bits were all set by insertions of other items. Thus it is a *probabilistic* data structure. There is a chance of *false positives*, but if items are never removed there is NO chance of a *false negative*. The likelihood of false positives **reduces** as the size of the bit array increases, and **increases** as the number of inserted elements increases.

They are space-efficient and, because they employ hash functions, exhibit constant time for both insertion and lookup operations. Here the "lookup" operation is checking for inclusion in the set. As well, the *k* hashing that need to be taken can be done in parallel. As well, multiple Bloom filters of the same size can be combined using *union* and *intersection* operations implemented efficiently with bitwise OR/AND functions.

So what values of *k* - number of hash functions, and *n* - size of bit array should we chose for a given number of inserted items, *s*? Mathematical analysis has shown that the optimal value of *k* that minimises the probability of false positives is given by:

When this is done the expected false positive rate is less than 1%. Also note that since the function above is expressed in terms of *s*, the number of inserted items, it follows that *n*, the size of the array used for the Bloom filter, needs to expand as *s* increases.

In summary, a Bloom filter is a simple space-efficient randomised data structure for representing a set in order to support membership/inclusion queries. They can lead to false positives but the space savings often outweigh this drawback when the probability of error is controlled.

Note that if the set of possible values is small and many of them might be included in the set then other data structures like plain-and-simple bit arrays are more effective since they are deterministic.

In Part 2 of this series we'll move on from the background theory to look at implementation/usage details in .Net.

*29 Apr 2009*
*Damien Wintour*
2 comments