Data Structures: Hash Functions and Tables (Part 2)
In Part 1 of this series we reviewed some fundamental theory of hash functions and how they are used to create hash tables. With the preliminary theory aside it's time to examine the practical use of hash tables in .Net.
System.Collections.Hashtable
Prior to the introduction of generic collections in .Net 2.0 the non-generic Hashtable class was used to create hash tables. Because it's non-generic you suffer all the usual boxing and casting side-effects so it's much better to use generic data structure. It should also be noted that this data structure uses re-hashing for collision resolution and employs a series of hash functions of the following form:
Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1)))] % hashsize
Re-hashing provides better collision avoidance than either linear or quadratic probing because it is less likely to lead to clustering. As well, it is important that the different hash functions used do not result in the same slot being assigned for a given key. Mathematically this can be guaranteed by using a hashsize that is prime. It is for this reason that the size is always chosen to be a prime number. In fact, to ensure the load factor (more on this later) is kept under control the Hashtable expands to the next prime number that is bigger than twice the current number of buckets.
private void expand() { int prime = HashHelpers.GetPrime(this.buckets.Length * 2); this.rehash(prime); }
Interestingly, if Reflector is to be believed, Microsoft employs an internal lookup table of known prime numbers and supplements this with a fairly naive primality test.
internal static bool IsPrime(int candidate) { if ((candidate & 1) == 0) { return (candidate == 2); } int num = (int) Math.Sqrt((double) candidate); for (int i = 3; i <= num; i += 2) { if ((candidate % i) == 0) { return false; } } return true; }
System.Collections.Generic.Dictionary<TKey,TValue>
This is a better data structure to use because both the key and value are generically typed which gives you better type safety than the non-generic Hashtable or Dictionary class since they effectively treat everything as a System.Object.
Unlike the Hashtable, this data structure uses chaining for conflict resolution. Recall that with probing, in the event of a collision another slot in the list of buckets is tried. With re-hashing the hash is recomputed, and that new slot is tried. With chaining, however, a secondary data structure is utilized to hold any collisions. Specifically, each slot in the Dictionary has an array of elements that map to that bucket. In the event of a collision, the colliding element is prepended to the bucket's list.
Adding and deleting items is pretty straight forward. You must remember that you cannot add the same key twice - they need to be unique otherwise you'll get an ArgumentException, and you cannot give a null key value. Although it is intuitive to use the .Add() method to add items to the dictionary it is usually better to use the indexer of the dictionary since this method uses create-or-replace semantics which means you avoid exceptions if you assign a value to a key that already exists. Here's an example in C# to illustrate these concepts. The dictionary is used to store the age of a group of friends. As such the choice of the key, the friends' name, isn't great but we use it here only for illustrative purposes.
// creating the dictionary
var ht = new Dictionary<string, int>(50);
// populating the dictionary
ht.Add("Darren",27);
ht.Add("Amanda",25);
ht.Add("Oliver",2);
ht.Add("Benny",65);
ht.Add("Bjorn", 69);
ht.Add("Agnita",57);
//ht.Add("Agnita",53); // throws an ArgumentException since keys must be unique
ht.Add("Allison",27);
// removing items from the dictionary
ht.Remove("Allison");
ht.Remove("Allison"); // does NOT throw an exception if it's not found
// a better way to add items to the dictionary
ht["Bobby"] = 45;
ht["Cassandra"] = 27;
ht["Felicity"] = 19;
ht["Bjorn"] = 69; // won't throw an exception even though key already exists
Regardless of the technique you use to add items the cost is the same: O(1) if the item count is less than the capacity, and O(n) - where n is the count - if you have hit the capacity of the dictionary. The reason for this is that the dictionary needs to be completely rebuilt when you hit the capacity. There is a read-only property called Count on the dictionary which allows you to (cheaply) check the item count of the dictionary at any time. However, the only time you can specify the capacity of the dictionary is in the constructor! The attractive feature of hash tables - near-constant time lookup - is actually dependent on the load factor of the hash table. The load factor is defined as count/capacity. To ensure near-constant time lookup speed it's best to keep the load factor below 70%. However, unlike the Hashtable which allows you to specify the load factor (which is actually scaled back to a fraction of 0.72), the generic Dictionary only allows you to specify the capacity.
Iterating over all the elements in the dictionary is trivial and unsurprising since the collection implements IEnumerable
// iterating over the dictionary is easy since it implements IEnumerable<T>
foreach(KeyValuePair<string,int> kvp in ht)
{
Console.WriteLine("Key = {0}; Value={1}", kvp.Key, kvp.Value);
}
Finding items in a dictionary is done in one of four ways: ContainsKey(), ContainsValue(), TryGetValue() and the indexer. The two Contains* methods return a boolean value indicating if the specified key or value is in the dictionary. These methods never throw exceptions. Fetching values using the indexer is fine if you are sure the key is in the dictionary. If it is not the call throws a KeyNotFoundException that you must catch. Throwing and catching exceptions is generally considered to be expensive and therefore not best practice. Instead, use the TryGetValue() method that returns a boolean value indicating if the key was found and it if was the method populates the output parameter you passed in. Here's some code to demonstrate all four approaches:
// querying the dictionary
Console.WriteLine("Have someone called Benny: {0}", ht.ContainsKey("Benny"));
Console.WriteLine("Have someone aged 65: {0}", ht.ContainsValue(65));
Console.WriteLine("Have someone aged 160: {0}", ht.ContainsValue(160));
// if you don't know if the key is in the dictionary...
int age;
if (ht.TryGetValue("Bobby", out age))
Console.WriteLine("Bobby's age is {0}",age );
// getting the value associated with a key using the indexer
Console.WriteLine(ht["Bobby"]);
Console.WriteLine(ht["Bobby2"]); // throws KeyNotFoundException
You may be tempted to use a SortedDictionary<TKey,TValue> instead of a dictionary thinking you get an extra feature - ordering - that you don't get in the generic dictionary. DON'T do this! The SortedDictionary is an entirely different beast. Internally it uses a binary search tree, not a Hashtable. Thus it has running time ~ O(log n) for find operations whereas all hash table have close to constant time, O(1), find operations, depending on the load factor and the quality of the hash function used.
14 May 2009 Damien Wintour 1 comment











