Uncategorized

Data Structures: Hash Functions and Tables (Part 2)

In Part 1 of this series we reviewed some fundamental theory of hash functions and how they are used to create hash tables. With the preliminary theory aside it's time to examine the practical use of hash tables in .Net.

System.Collections.Hashtable
Prior to the introduction of generic collections in .Net 2.0 the non-generic Hashtable class was used to create hash tables. Because it's non-generic you suffer all the usual boxing and casting side-effects so it's much better to use generic data structure. It should also be noted that this data structure uses re-hashing for collision resolution and employs a series of hash functions of the following form:

Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1)))] % hashsize

Re-hashing provides better collision avoidance than either linear or quadratic probing because it is less likely to lead to clustering. As well, it is important that the different hash functions used do not result in the same slot being assigned for a given key. Mathematically this can be guaranteed by using a hashsize that is prime. It is for this reason that the size is always chosen to be a prime number. In fact, to ensure the load factor (more on this later) is kept under control the Hashtable expands to the next prime number that is bigger than twice the current number of buckets.

private void expand()
{
    int prime = HashHelpers.GetPrime(this.buckets.Length * 2);
    this.rehash(prime);
}

Interestingly, if Reflector is to be believed, Microsoft employs an internal lookup table of known prime numbers and supplements this with a fairly naive primality test.

internal static bool IsPrime(int candidate)
{
    if ((candidate & 1) == 0)
    {
        return (candidate == 2);
    }
    int num = (int) Math.Sqrt((double) candidate);
    for (int i = 3; i <= num; i += 2)
    {
        if ((candidate % i) == 0)
        {
            return false;
        }
    }
    return true;
}

System.Collections.Generic.Dictionary<TKey,TValue>
This is a better data structure to use because both the key and value are generically typed which gives you better type safety than the non-generic Hashtable or Dictionary class since they effectively treat everything as a System.Object.

Unlike the Hashtable, this data structure uses chaining for conflict resolution. Recall that with probing, in the event of a collision another slot in the list of buckets is tried. With re-hashing the hash is recomputed, and that new slot is tried. With chaining, however, a secondary data structure is utilized to hold any collisions. Specifically, each slot in the Dictionary has an array of elements that map to that bucket. In the event of a collision, the colliding element is prepended to the bucket's list.

Adding and deleting items is pretty straight forward. You must remember that you cannot add the same key twice - they need to be unique otherwise you'll get an ArgumentException, and you cannot give a null key value. Although it is intuitive to use the .Add() method to add items to the dictionary it is usually better to use the indexer of the dictionary since this method uses create-or-replace semantics which means you avoid exceptions if you assign a value to a key that already exists. Here's an example in C# to illustrate these concepts. The dictionary is used to store the age of a group of friends. As such the choice of the key, the friends' name, isn't great but we use it here only for illustrative purposes.

            // creating the dictionary
            var ht = new Dictionary<string, int>(50);

            // populating the dictionary
            ht.Add("Darren",27);
            ht.Add("Amanda",25);
            ht.Add("Oliver",2);
            ht.Add("Benny",65);
            ht.Add("Bjorn", 69);
            ht.Add("Agnita",57);
            //ht.Add("Agnita",53);    // throws an ArgumentException since keys must be unique
            ht.Add("Allison",27);

            // removing items from the dictionary
            ht.Remove("Allison");
            ht.Remove("Allison");   // does NOT throw an exception if it's not found

            // a better way to add items to the dictionary
            ht["Bobby"] = 45;
            ht["Cassandra"] = 27;
            ht["Felicity"] = 19;
            ht["Bjorn"] = 69;       // won't throw an exception even though key already exists

Regardless of the technique you use to add items the cost is the same: O(1) if the item count is less than the capacity, and O(n) - where n is the count - if you have hit the capacity of the dictionary. The reason for this is that the dictionary needs to be completely rebuilt when you hit the capacity. There is a read-only property called Count on the dictionary which allows you to (cheaply) check the item count of the dictionary at any time. However, the only time you can specify the capacity of the dictionary is in the constructor! The attractive feature of hash tables - near-constant time lookup - is actually dependent on the load factor of the hash table. The load factor is defined as count/capacity. To ensure near-constant time lookup speed it's best to keep the load factor below 70%. However, unlike the Hashtable which allows you to specify the load factor (which is actually scaled back to a fraction of 0.72), the generic Dictionary only allows you to specify the capacity.

Iterating over all the elements in the dictionary is trivial and unsurprising since the collection implements IEnumerable.

            // iterating over the dictionary is easy since it implements IEnumerable<T>
            foreach(KeyValuePair<string,int> kvp in ht)
            {
                Console.WriteLine("Key = {0}; Value={1}", kvp.Key, kvp.Value);
            }

Finding items in a dictionary is done in one of four ways: ContainsKey(), ContainsValue(), TryGetValue() and the indexer. The two Contains* methods return a boolean value indicating if the specified key or value is in the dictionary. These methods never throw exceptions. Fetching values using the indexer is fine if you are sure the key is in the dictionary. If it is not the call throws a KeyNotFoundException that you must catch. Throwing and catching exceptions is generally considered to be expensive and therefore not best practice. Instead, use the TryGetValue() method that returns a boolean value indicating if the key was found and it if was the method populates the output parameter you passed in. Here's some code to demonstrate all four approaches:

            // querying the dictionary
            Console.WriteLine("Have someone called Benny: {0}", ht.ContainsKey("Benny"));
            Console.WriteLine("Have someone aged 65: {0}", ht.ContainsValue(65));
            Console.WriteLine("Have someone aged 160: {0}", ht.ContainsValue(160));

            // if you don't know if the key is in the dictionary...
            int age;
            if (ht.TryGetValue("Bobby", out age))
                Console.WriteLine("Bobby's age is {0}",age  );

            // getting the value associated with a key using the indexer
            Console.WriteLine(ht["Bobby"]);
            Console.WriteLine(ht["Bobby2"]);        // throws KeyNotFoundException

You may be tempted to use a SortedDictionary<TKey,TValue> instead of a dictionary thinking you get an extra feature - ordering - that you don't get in the generic dictionary. DON'T do this! The SortedDictionary is an entirely different beast. Internally it uses a binary search tree, not a Hashtable. Thus it has running time ~ O(log n) for find operations whereas all hash table have close to constant time, O(1), find operations, depending on the load factor and the quality of the hash function used.

Uncategorized

Applied Number Theory: One Way Functions

Whilst many developers have to incorporate a degree of "security" or "crypto" elements into products they build, it recently came to my attention that many developers have a thin grasp of the mathematical concepts that underpin the foundations of cryptography. I guess it shouldn't have surprised me, but none-the-less it did.

For example, many developers will tell you that passwords should not be stored in clear-text in the database, but rather the hash of the password should be stored. That is excellent advice, but when probed on hash functions - what exactly are they, how do they work, what assumptions do they make, etc - you'll quickly find that it is an area that isn't well understood by many developers.

Under the covers hash functions are specialized one way functions. One way functions are a cryptographic primitive that, by definition, are "easy" to compute in one direction but "hard" to calculate in the other direction, that is, finding the inverse.  In this context, "easy" means solvable in polynomial time, and "hard" means, on average, the inverse for some randomly given value is not solvable in polynomial time. That is, easy means ∈ P, and hard means ∈ NP.

An Example

Let me give an example to demonstrate the easy/hard asymmetry that gives the one way function it's power. Consider the following equation:

c ≡ be mod m   (1)           [ here b is the "base", e is the "exponent", and m is the "modulus" ]

This is called modular or discrete exponentiation and it is a carefully constructed equation to exhibit qualities that make it a suitable one way function. Essentially, (1) consists of 2 components: an exponentiation operation, be, and a modulus operation. It's clear that evaluating be requires e-1 multiplication operations. This can also be seen by considering the recurrence relationship :

be = b. b(e-1) (2)

= b. b. b(e-2)

...

= b. b. b. b.  ... .b

The astute reader will note that we can perform this more efficiently by computing b2, and then multiplying that by itself to get b4 and so on to get b8,  b16,  b32 and so on. We can see that this approach requires approximately log2(e) operations. This is called exponentiation by squaring. It should be noted that there is considerable efficiency to be gained in using an Θ(log2e) algorithm over a Θ(e) algorithm when e is large.

The second component needing evaluation in (1) is the modulus operation. This involves dividing be by m and finding the remainder. So overall we can be certain that we have an expression that can be solved in polynomial time. Let's now consider the inverse function. That is, calculating e for given values of c, b and m. The equation is:

e = loge (b) -> Zn (3)

This is a discrete logarithm to base e in Zn and the tractability of such a problem is known to be hard and have no efficient solutions (unless of course, you have a quantum computer). Ergo, modular exponentiation is an elegant construction that can be used as a cryptographic one way function. Indeed, this particular function is the basis of the ElGamal encryption scheme.

Underlying Assumptions

The strength of one-way functions is in the difficulty in calculating the inverse. In the forward direction we have a problem ∈ P, and in the the reverse direction we have a problem ∈ NP. Clearly, we are assuming that:

PNP (4)

(4) is considered to be one of the most fundamental, yet difficult, open unresolved mathematical equations. Presently this assumption stated in (4) is a safe, albeit unproven, assumption as the majority of mathematicians believe (4) to be true as it has been scrutinized over many years without being disproved. The importance of this assumption is exemplified by the lucrative financial rewards (1 million USD) on offer from the Clay Institute for a formal proof or disproof. To date, however, no one has been able to offer such a proof. Until that happens cryptographers will continue to assume this is the case.