Archive for February, 2010

Uncategorized

Intersection of 2 Arrays


Problem: Given two numerical arrays find the intersection. i.e the elements common to both arrays.

Solution: If you understand how databases do this the answer is straight forward. In essence, databases employ one of 3 algorithms. The nested loop has a worst case runtime complexity of O(n.m) where n and m are the sizes of the 2 arrays. That's NOT the one to use. At the other end of the spectrum is the Hash join. The worst case runtime complexity of the hash-join approach is O(n+m) however if the size of one of the input arrays, say n, dominates the size of the other, m, we can say the complexity is O(n). i.e overall it is linear in the size of the larger array.

Here's some C# code that does it. Not that we use the smaller array for the hash table. The assumption we are making here is that this will fit into memory. If it doesn't then you can resort to Grace Hash Joins. Here you use a hash function to partition the input into smaller chunks then operate on each chunk separately, and in parallel. This approach is highly amenable to parallelization and would fit well into a Map-Reduce computational model.

Also note that the HashSet<T> type available in .NET allows you to add duplicates without bothering to check beforehand. When you try to add a duplicate it silently ignores you. Of course, being a hash table, the Contains() has constant time, O(1) complexity. I'm also using a method group syntax inside the Linq extension method in line 20.

    1 private static void Main(string[] args)
    2 {
    3     var a = new[]{4, 5, 6, 7, 2, 3, 33};
    4     var b = new[] {7, 7,2, 33 };
    5 
    6     var ab = FindInterestion(a.ToList(), b.ToList());
    7     foreach(var item in ab)
    8         Console.WriteLine( item );
    9 
   10 }
   11 static IEnumerable<int> FindInterestion(List<int> a, List<int> b)
   12 {
   13     var hash = new HashSet<int>();
   14 
   15     IEnumerable<int> hashArray = a.Count > b.Count ? b : a;
   16     IEnumerable<int> otherArray = a.Count > b.Count ? a : b;
   17     foreach (var item in hashArray)
   18         hash.Add(item);
   19 
   20     return otherArray.Where(hash.Contains).ToArray();
   21 }

Uncategorized

Join Algorithms Illustrated

A work colleague recently asked me about efficient inner join algorithms that could be applied to in-memory data collections. My mind immediately went back to the common algorithms that most modern databases usually employ:

  • Nested Loop Join
  • Sort-Merge Join
  • Hash Join

Let's explain each one of these using some simple C# 4.0 code to illustrate. First, consider we have 2 lists of key-value pairs. The first is a list of customerID and name. The second list is a list of customerID and item purchased:

            // a list of customers with corresponding customerID
            var names = new List<KeyValuePair<int,string>>();
            names.Add(new KeyValuePair<int, string>(1, "bob"));
            names.Add(new KeyValuePair<int, string>(2, "jim"));
            names.Add(new KeyValuePair<int, string>(3, "oscar"));
            names.Add(new KeyValuePair<int, string>(5, "mia"));

            // a list of things the customers purchased. integer value is customerID
            var items = new List<KeyValuePair<int, string>>();
            items.Add(new KeyValuePair<int, string>(2, "beer"));
            items.Add(new KeyValuePair<int, string>(2, "chips"));
            items.Add(new KeyValuePair<int, string>(2, "ketchup"));
            items.Add(new KeyValuePair<int, string>(5, "milk"));
            items.Add(new KeyValuePair<int, string>(5, "bread"));
            items.Add(new KeyValuePair<int, string>(5, "chocolate"));
            items.Add(new KeyValuePair<int, string>(1, "newspaper"));

Nested loop joins are the simplest of all joins. It is an extremely naive algorithm that loops over one data set in an outer loop and loops over the data set to join on in an inner loop (assuming the join involves only 2 items). For an inner (equi-) join, when it finds matches on the join key it outputs the resulting tuples. This approach has worst case run time complexity of O(A.B) where A and B are the sizes of the lists/tables being joined, hence it's only feasible when the lists are fairly small. Here's an example of it in C#:

        static void NestedLoopJoin(List<KeyValuePair<int, string>> people, List<KeyValuePair<int, string>> purchases)
        {
            foreach (var n in people)
                foreach (var i in purchases)
                    if (n.Key == i.Key)
                        Console.WriteLine("Name index {0}, {1}, bought some {2} ", n.Key, n.Value, i.Value);
        }

As the name implies, the sort-merge join starts by sorting both lists/tables by the join key and then doing an interleaved linear scan of both join key sets. Because the key sets of the join tables are both sorted the algorithm knows when no more inner joins are possible for a given join key and can advance the set pointers accordingly. Of course, the downside is that you have to pre-sort the data if it isn't already sorted. [On the flip side, it's a good candidate when the data is already sorted for you!]

        static void SortMergeJoin(List<KeyValuePair<int, string>> people,
                                  List<KeyValuePair<int, string>> purchases)
        {
            // first we sort both lists
            people.Sort(Compare);
            purchases.Sort(Compare);

            // then we do an interleaved linear scan to merge the sorted lists
            int i = 0, j = 0;

            while(i<people.Count & j<purchases.Count)
            {
                if (people[i].Key < purchases[j].Key)
                    i++;
                else if (people[i].Key > purchases[j].Key)
                    j++;
                else   // must be a match
                {
                    Console.WriteLine("Name index {0}, {1}, bought some {2} ",
                                       people[i].Key, people[i].Value, purchases[j].Value);
                    j++;
                }
            }
        }

        static int Compare(KeyValuePair<int, string> a, KeyValuePair<int, string> b)
        {
            if (a.Key > b.Key)
                return 1;

            if (a.Key == b.Key)
                return 0;

            return -1;
        }

Hash joins involve creating a hash table up front, in memory, for the smaller list/table in the join. You hash on the join keys found in the smaller table so that you can loop over the larger table in the join and for each encountered join key take the hash of it and use this to interrogate the pre-prepared hash table for joining rows. The nice thing about this approach is that hash-based lookup component is a constant time operation. Hash joins excel when at-least one of the join tables has no index or is not sorted.

        static void HashJoin(List<KeyValuePair<int, string>> people,
                             List<KeyValuePair<int, string>> purchases)
        {
            var peopleHash = new Dictionary<int,string>();
            foreach(var n in people)
                peopleHash.Add(n.Key,n.Value);

            foreach (var i in purchases)
                if (peopleHash.ContainsKey(i.Key))
                {
                    string custName;
                    peopleHash.TryGetValue(i.Key, out custName);
                    Console.WriteLine("Name index {0}, {1}, bought some {2} ",
                                       i.Key, custName , i.Value);
                }
        }

If you do this join inside a database the query optimizer helps you pick which join is appropriate but be warned: as the number of tables involved in the join grows so too does the solution space the optimizer needs to consider. I wrote about this previously here. If you do such a join yourself on an in-memory data structure, it pays to understand the space/time complexity trade-offs involved.

Uncategorized

Drunken Airline Passenger

Question: 228 people board a short-haul flight from Sydney to Melbourne that is fully booked - there are exactly 228 seats. All passengers are inclined to sit in their allocated seat however the first person to board has spent too long in the bar and ends up taking a completely random seat. All other passengers go to their allocated seat and take it, unless it is already occupied, in which case they too take a random vacant seat. What is the probability that the last passenger to board the flight finds the seat he/she was assigned unoccupied?

Answer: First we need to consider the possible outcomes from this scenario...

If the first passenger, by chance, takes the seat that was indeed allocated to him then every other passenger that boards after him will find their assigned seat. In this case, clearly the last person to board does indeed find his/her allocated seat.

If the first passenger takes a seat that was NOT allocated to him, then at least one of the remaining 227 passengers to board will not find their allocated seat. One of two things can eventuate from this:

  1. One of the following passengers finds his/her seat taken and, by chance, takes the seat that was assigned to the first passenger, in which case all subsequent passengers to board do find their assigned seat; or

  2. None of the next (n-2) passengers to board, by chance, takes the seat that was assigned to the first passenger, in which case the last passenger to board can only take the seat that was assigned to the first passenger since it will be the only one left.

So it's obvious that the last person will either get his/her allocated seat, or the seat allocated to the first person. But what is the probability of each of these outcomes? You might be tempted to get into something complicated but it's actually quite simple. The probability that the last unoccupied seat does indeed belong to the last passenger to board is exactly 1/2. Hence the answer to the puzzle is 50%.