Uncategorized

Data Structures: Hash Functions and Tables (Part 2)

In Part 1 of this series we reviewed some fundamental theory of hash functions and how they are used to create hash tables. With the preliminary theory aside it's time to examine the practical use of hash tables in .Net.

System.Collections.Hashtable
Prior to the introduction of generic collections in .Net 2.0 the non-generic Hashtable class was used to create hash tables. Because it's non-generic you suffer all the usual boxing and casting side-effects so it's much better to use generic data structure. It should also be noted that this data structure uses re-hashing for collision resolution and employs a series of hash functions of the following form:

Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1)))] % hashsize

Re-hashing provides better collision avoidance than either linear or quadratic probing because it is less likely to lead to clustering. As well, it is important that the different hash functions used do not result in the same slot being assigned for a given key. Mathematically this can be guaranteed by using a hashsize that is prime. It is for this reason that the size is always chosen to be a prime number. In fact, to ensure the load factor (more on this later) is kept under control the Hashtable expands to the next prime number that is bigger than twice the current number of buckets.

private void expand()
{
    int prime = HashHelpers.GetPrime(this.buckets.Length * 2);
    this.rehash(prime);
}

Interestingly, if Reflector is to be believed, Microsoft employs an internal lookup table of known prime numbers and supplements this with a fairly naive primality test.

internal static bool IsPrime(int candidate)
{
    if ((candidate & 1) == 0)
    {
        return (candidate == 2);
    }
    int num = (int) Math.Sqrt((double) candidate);
    for (int i = 3; i <= num; i += 2)
    {
        if ((candidate % i) == 0)
        {
            return false;
        }
    }
    return true;
}

System.Collections.Generic.Dictionary<TKey,TValue>
This is a better data structure to use because both the key and value are generically typed which gives you better type safety than the non-generic Hashtable or Dictionary class since they effectively treat everything as a System.Object.

Unlike the Hashtable, this data structure uses chaining for conflict resolution. Recall that with probing, in the event of a collision another slot in the list of buckets is tried. With re-hashing the hash is recomputed, and that new slot is tried. With chaining, however, a secondary data structure is utilized to hold any collisions. Specifically, each slot in the Dictionary has an array of elements that map to that bucket. In the event of a collision, the colliding element is prepended to the bucket's list.

Adding and deleting items is pretty straight forward. You must remember that you cannot add the same key twice - they need to be unique otherwise you'll get an ArgumentException, and you cannot give a null key value. Although it is intuitive to use the .Add() method to add items to the dictionary it is usually better to use the indexer of the dictionary since this method uses create-or-replace semantics which means you avoid exceptions if you assign a value to a key that already exists. Here's an example in C# to illustrate these concepts. The dictionary is used to store the age of a group of friends. As such the choice of the key, the friends' name, isn't great but we use it here only for illustrative purposes.

            // creating the dictionary
            var ht = new Dictionary<string, int>(50);

            // populating the dictionary
            ht.Add("Darren",27);
            ht.Add("Amanda",25);
            ht.Add("Oliver",2);
            ht.Add("Benny",65);
            ht.Add("Bjorn", 69);
            ht.Add("Agnita",57);
            //ht.Add("Agnita",53);    // throws an ArgumentException since keys must be unique
            ht.Add("Allison",27);

            // removing items from the dictionary
            ht.Remove("Allison");
            ht.Remove("Allison");   // does NOT throw an exception if it's not found

            // a better way to add items to the dictionary
            ht["Bobby"] = 45;
            ht["Cassandra"] = 27;
            ht["Felicity"] = 19;
            ht["Bjorn"] = 69;       // won't throw an exception even though key already exists

Regardless of the technique you use to add items the cost is the same: O(1) if the item count is less than the capacity, and O(n) - where n is the count - if you have hit the capacity of the dictionary. The reason for this is that the dictionary needs to be completely rebuilt when you hit the capacity. There is a read-only property called Count on the dictionary which allows you to (cheaply) check the item count of the dictionary at any time. However, the only time you can specify the capacity of the dictionary is in the constructor! The attractive feature of hash tables - near-constant time lookup - is actually dependent on the load factor of the hash table. The load factor is defined as count/capacity. To ensure near-constant time lookup speed it's best to keep the load factor below 70%. However, unlike the Hashtable which allows you to specify the load factor (which is actually scaled back to a fraction of 0.72), the generic Dictionary only allows you to specify the capacity.

Iterating over all the elements in the dictionary is trivial and unsurprising since the collection implements IEnumerable.

            // iterating over the dictionary is easy since it implements IEnumerable<T>
            foreach(KeyValuePair<string,int> kvp in ht)
            {
                Console.WriteLine("Key = {0}; Value={1}", kvp.Key, kvp.Value);
            }

Finding items in a dictionary is done in one of four ways: ContainsKey(), ContainsValue(), TryGetValue() and the indexer. The two Contains* methods return a boolean value indicating if the specified key or value is in the dictionary. These methods never throw exceptions. Fetching values using the indexer is fine if you are sure the key is in the dictionary. If it is not the call throws a KeyNotFoundException that you must catch. Throwing and catching exceptions is generally considered to be expensive and therefore not best practice. Instead, use the TryGetValue() method that returns a boolean value indicating if the key was found and it if was the method populates the output parameter you passed in. Here's some code to demonstrate all four approaches:

            // querying the dictionary
            Console.WriteLine("Have someone called Benny: {0}", ht.ContainsKey("Benny"));
            Console.WriteLine("Have someone aged 65: {0}", ht.ContainsValue(65));
            Console.WriteLine("Have someone aged 160: {0}", ht.ContainsValue(160));

            // if you don't know if the key is in the dictionary...
            int age;
            if (ht.TryGetValue("Bobby", out age))
                Console.WriteLine("Bobby's age is {0}",age  );

            // getting the value associated with a key using the indexer
            Console.WriteLine(ht["Bobby"]);
            Console.WriteLine(ht["Bobby2"]);        // throws KeyNotFoundException

You may be tempted to use a SortedDictionary<TKey,TValue> instead of a dictionary thinking you get an extra feature - ordering - that you don't get in the generic dictionary. DON'T do this! The SortedDictionary is an entirely different beast. Internally it uses a binary search tree, not a Hashtable. Thus it has running time ~ O(log n) for find operations whereas all hash table have close to constant time, O(1), find operations, depending on the load factor and the quality of the hash function used.

Uncategorized

Data Structures: Hash Functions and Tables (Part 1)

In the past I've written about cryptographically-strong hash functions. But today I'd like to start a series of posts on simple hash functions. The difference between a cryptographically-strong hash function and a simple one is primarily around the reversibility of the result. This must be "hard" in the crypto world, but when the hash function is used for other purposes - like consistent mapping - the reversibility condition is not so imperative.

A hash function is a function which consistently maps a data value, called the key, to an array (hash table) index. The standard practice is to feed a variety of input values (the keys) into a hash function and use the resulting array indices to store the key values in the array. By "consistently" we mean the function always produces the same result for a given input value. In mathematical terms - deterministic.

A hash table is a data structure that internally employs a hash function to efficiently map keys to values.

In the digaram below, the blue boxes on the left represent the keys whilst the boxes on the right hand side represent the hash table. The green slots are indices occupied by the blue keys.

Ideally we want the function used to be consistent, fast and injective (one-to-one) however this is not always possible. To see this consider h to be a hash function that operates on a set of keys, A, and projects onto a set of key values, B. If |A| > |B| then h cannot be injective over A.

When the function maps two different key values to the same index, we have what is termed a collision and the function is classified as non-injective. The array slots available to the hash function are referred to as buckets because collisions means multiple items could end up in each slot, hence the collective name "bucket". As well, the array indices returned are commonly referred to as bucket indices. The diagram below shows a hash with a collision on the keys, "Barry" and "Sally". In this case the hash table uses a simple linked list to point to all the items that hash to the given value. This is called hashing with chaining. The astute reader will be wondering if a linked list is the most efficient data structure to use inside the bucket - it certainly isn't since you need to walk the list to find a given item which leads to O(n) worst-case running time if n is the number of items in the bucket. However the number of bucket items is, on average, very small in most cases so whilst hash tables inside hash tables perform better, simple linked lists are easier for illustration purposes.

When collisions happens, the amount of work needed to subsequently lookup the value in the hash table increases since we now have multiple items in the bucket index. This is not desireable hence we prefer hash functions that minimise the number of collisions. Techniques used to resolve hash collisions involve linear probing, quadratic probing, double hashing and chaining. These methods effectively find another vacant slot in the hash table using retry loops or trickery with modular arithmetic.

The primary benefit of a well-behaved (injective) hash function is the speed of lookup that results from the constructed hash table.. The expected time for any operation is a constant, O(1) which makes it extremely appealling for tasks such as database indexing, de-duping lists, checking a set for inclusion, etc.  In fact, several other data structures use hash functions under the covers to weave their magic. These include dictionaries / associative arrays / maps, and bloom filters.

As an example, here's some C# code to de-dup a generic list in-place using a hash table.

        public static void Main()
        {
            var names = new List<string> { "Damien", "Alex", "Oscar", "Alex", "Benny", "Bjorn", "Oscar", "Oscar","Agnita" };

            DeDup(names);
            foreach(var name in names)
                Console.WriteLine(name);
        }

        // order-preserving, quick, in-place generic list de-duping
        public static void DeDup<T>(List<T> input)
        {
            var ht = new Hashtable();

            int i = 0;
            foreach(var name in input.ToArray())
            {
                var hash = name.GetHashCode();
                if (!ht.ContainsKey(hash))
                {
                    ht.Add(name.GetHashCode(), name);
                    i++;
                }
                else
                    input.RemoveAt(i);
            }
        }

In order to minimise collisions the output values from the hash function should, ideally, be uniformly distributed over the output range. Statistical techniques like the Chi-squared test can be used to test this using a null hypothesis, Ho: the function output values follow a uniform distribution.

Perfect Hash function

If there is no duplicate keys in the input set and the hash function results in no bucket having more than 1 entry, the hash function is said to be a perfect hash function. In other words, a perfect hash function is injective on a fixed, known set of inputs, and therefore collision-free. A perfect hash function is ideal when the input set is large but static, and there are many lookups of the data. i.e read-intensive, mostly-static data.

Minimal Perfect Hash function

A perfect hash function for n keys is said to be minimal if its range consists of consecutive integers, usually from 0 to n−1. Besides providing single-step lookup, a minimal perfect hash function are extremely space-efficient as they produce a compact hash table without any vacant slots. Minimal perfect hash functions are much harder to find than perfect ones with a wider range. The diagram below shows a minimal perfect hash function.

Cuckoo Hashing

When one hash function isn't enough how about using two? This is the idea behind cuckoo hashing. You employ 2 hash functions and, for each key, insert the item into the slot given by the first hash function. If that slot is already taken then you kick out the item in that slot and relocate it to the slot given by the second hash function. If that slot is taken you do the same thing again until a vacant slot is found. This can lead to an infinite loop in which case the hash table detects this and rebuilds itself in-place using new hash functions that hopefully lead to a better result.

The cute thing about cuckoo hashing is that the worst-case for lookup operations is constant time, O(1). This is much better than having expected time being constant. Furthermore, as long as the load factor is less than 50% [i.e the capacity of the hash table is more than twice the number of keys] insertions have expected constant time. So it does require a very sparse hash table to get good time on insertions, hence it is either fast and takes a lot of space, or slow and uses space efficiently - but never both.

If you're wondering about the name "cuckoo", it comes from the nesting habits of a European Cuckoo which throws out the occupants of the nest to make room for itself.
 
Bloom Filters

If two hash functions aren't enough, how about using three, or more? Believe it or not, this is what bloom filters do. You start off with an empty bit array of, say, n bits, all set to 0. You also have k hash functions which map keys to array indices randomly and uniformly. The idea being, when you need to add an item to the filter you hash it's key with all k hash functions and then put a 1 in each of the positions returned. The diagram below shows how three values {x, y, z} have been mapped into the bit array, whilst w is the item being queried.

So what use is this? Well Bloom filters are used to test if an element is a member of a set (you can add elements to a set but not easily remove them unless you introduce counters in the array). To do a query you pass the key of the data under consideration through the same k hash functions and check if ALL of the resulting array positions are equal to 1. If all positions are 1 it means that, either the item has already been added, OR the bits were all set by insertions of other items. Thus it is a probabilistic data structure. There is a chance of false positives, but if items are never removed there is NO chance of a false negative. The likelihood of false positives reduces as the size of the bit array increases, and increases as the number of inserted elements increases.

They are space-efficient and, because they employ hash functions, exhibit constant time for both insertion and lookup operations. Here the "lookup" operation is checking for inclusion in the set. As well, the k hashing that need to be taken can be done in parallel. As well, multiple Bloom filters of the same size can be combined using union and intersection operations implemented efficiently with bitwise OR/AND functions.

So what values of k - number of hash functions, and n - size of bit array should we chose for a given number of inserted items, s? Mathematical analysis has shown that the optimal value of k that minimises the probability of false positives is given by:

When this is done the expected false positive rate is less than 1%. Also note that since the function above is expressed in terms of s, the number of inserted items, it follows that n, the size of the array used for the Bloom filter, needs to expand as s increases.

In summary, a Bloom filter is a simple space-efficient randomised data structure for representing a set in order to support membership/inclusion queries. They can lead to false positives but the space savings often outweigh this drawback when the probability of error is controlled.

Note that if the set of possible values is small and many of them might be included in the set then other data structures like plain-and-simple bit arrays are more effective since they are deterministic.

In Part 2 of this series we'll move on from the background theory to look at implementation/usage details in .Net.

Uncategorized

Stack with Constant-Time Minimum

Problem: You need to implement a stack data structure that can also report the minimum element in constant-time.

Solution: This is not all that difficult if you think about it. Regular stacks support push() and pop() functions. We need to add a new read-only property, Minimum. You can't just add a new variable to track the minimum because when the stack is popped you wouldn't be know how to update the minimum variable without scanning through all items in the stack which would require you to pop and re-push each item, which is downright ugly. So we need to keep some auxiliary list of the stack items so we can update the minimum variable after we pop items. A clever way to do that is to use another stack that stores the minimum of all items in the stack. The minimum value is calculated when items are pushed to the stack and when items are popped from the stack we also pop from the minStack to reveal the new minimum value.

Here's some simple C# code to show this:

   1:  public class MinStack<T> where T:IComparable<T>
   2:      {
   3:          private readonly Stack<T> minStack = new Stack<T>();
   4:          private readonly Stack<T> stack = new Stack<T>();
   5:   
   6:          public T Minimum
   7:          {
   8:              get { return minStack.Peek(); }
   9:          }
  10:   
  11:          public void Push(T element)
  12:          {
  13:              stack.Push(element);
  14:              if (minStack.Count == 0 || element.CompareTo(minStack.Peek()) <= 0)
  15:              {
  16:                  minStack.Push(element);
  17:              }
  18:              else
  19:                  minStack.Push(minStack.Peek());
  20:          }
  21:   
  22:          public T Pop()
  23:          {
  24:              minStack.Pop();
  25:              return stack.Pop();
  26:          }
  27:      }

And here's my Java code to do it.

   1:  class MinStack  <T extends Comparable<T>>
   2:  {
   3:      private Stack<T> minStack = new Stack<T>();
   4:      private Stack<T> stack = new Stack<T>();
   5:   
   6:      public T getMinimum()
   7:      {
   8:          return minStack.peek(); 
   9:      }
  10:   
  11:      public void Push(T element)
  12:      {
  13:          stack.push(element);
  14:          if (minStack.size() == 0 || element.compareTo(minStack.peek()) <= 0)
  15:          {
  16:              minStack.push(element);
  17:          }
  18:          else
  19:              minStack.push(minStack.peek());
  20:      }
  21:   
  22:      public T Pop()
  23:      {
  24:          minStack.pop();
  25:          return stack.pop();
  26:      }
  27:  }

Next »