Uncategorized

GCJ Africa 2010 - Store Credit

The Store Credit problem in the qualification round of Google Code Jam Africa 2010 is relatively straight forward. You are given a collection of priced items and a credit limit. You are also told that 2 items will exactly add up to given credit limit thus you must find which two items can be purchased. To solve it, just keep a Dictionary (hash table) of prices with the index of the item that carries that price as you traverse the price listing. By doing this you can work out - in constant time - if the "complementary price", that is the price that takes you up to the credit limit, is in the list and if so what it's index is. Furthermore the program bails out as soon as it has found the item pair that satisfies the credit limit. We know this approach must terminate with a feasible solution because the problem states that there is guaranteed to be exactly one pair that solves the problem. Just be careful with your indexes here. Collections in C# and Java are zero-based, but this item list is a 1-based list [you can tell that from the examples given in the problem definition] so you need to be a bit careful with your indexes.

Here's some simple C# code that solves it [running time is O(n) for a list of n items]:

using System;

using System.Collections.Generic;

using System.IO;

 

namespace GoogleCodeJam.Africa2010.StoreCredit

{

    class Program

    {

        static void Main()

        {

            const string basePath = @"C:\";

            var infile = new StreamReader(basePath + "large.txt");

            var outfile = new StreamWriter(basePath + "output.txt");

 

            int n = Int32.Parse(infile.ReadLine());

            for (int caseNo = 1; caseNo <= n; caseNo++)

            {

                int C = Int32.Parse(infile.ReadLine()); // credit

                int I = Int32.Parse(infile.ReadLine()); // item count

                var P = infile.ReadLine().Split(new[] { ' ' }); // prices

                var priceIndexMap = new Dictionary<int, int>();

                int k = -1;

                int j = 0;

                for (; j < P.Length; j++ )

                {

                    int currentPrice = Int32.Parse(P[j]);

                    if (priceIndexMap.ContainsKey(C - currentPrice))

                    {

                        k = priceIndexMap[C - currentPrice];

                        break;

                    }

                    if (!priceIndexMap.ContainsKey(currentPrice))

                        priceIndexMap.Add(currentPrice, j + 1);

                }

                j++;

                if (j < k)

                    outfile.WriteLine("Case #{0}: {1} {2}", caseNo, j, k);

                else

                    outfile.WriteLine("Case #{0}: {1} {2}", caseNo, k, j);

            }

            infile.Close();

            outfile.Close();

        }

    }

}

Uncategorized

Burrows-Wheeler Transform: Part 1

The Burrows-Wheeler Transform is a well-known text transformation that is used to pre-condition strings prior to compression. It's used by the popular bzip2 utility in Linux. To be more specific, it is a lossless block-sorting algorithm. It was invented by Mike Burrows (now at Google) and David Wheeler who advise that the size of the input block must be a few KBs to achieve good compression. It is a surprising simple algorithm to describe and has the remarkable property of being easily reversible. Let's look at some code in C#...

The Encode() method in lines 6-21 takes a given string and creates every cyclic permutation of the string in lines 9-12. It then does a lexicographical sort of this new list of strings. I chose an ordinal search to achieve the same interim results published on Wikipedia using "^BANANA@" as the string to transform (here "@" is an EOF marker). If you use the default Array.Sort() parameters in .NET the sort will be culture-sensitive which means the caret sign "^" will be ordered before alpha characters for most English-speaking cultures. See this article for a review of linguistic and ordinal string operations. But in reality you can use any sort variant here as long as you use the same approach in the Decode() method.

Lines 16-20 then create a new string of the same length by concatenating the last character in each string in the sorted list. The resulting text contains all of the characters that were in the original string differing only in their ordering. It is in fact a collection of prefix characters to the first characters of each word in the sorted list.

To decode the string we perform a series of Add+Sort operations as follows: take the given (transformed) string and make a list of new strings making the first character of each respective string in the list equal to the characters of the given string, then sort them. We then start the process again adding the characters of the given string as their new first character, sort them, etc until you have strings of the same length as the original. At that point you can identify the original by looking for the string with the EOF marker in it's rightful place - at the end.

    1 public class BurrowsWheelerTransform

    2 {

    3     public const string EOF = "\xFF";

    4 

    5     // unoptimised code

    6     public string Encode(string s)

    7     {

    8         s += EOF;

    9         int n = s.Length;

   10         var permutations = new string[n];

   11         for (int i = 0; i < n; i++)

   12             permutations[i] = s.Substring(n - i, i) + s.Substring(0, n - i);

   13 

   14         Array.Sort(permutations, String.CompareOrdinal); // uses quicksort.

   15 

   16         var result = new StringBuilder();

   17         foreach (var p in permutations)

   18             result.Append(p[n - 1]);

   19 

   20         return result.ToString();

   21     }

   22 

   23     public string Decode(string s)

   24     {

   25         int n = s.Length;

   26         var a = new List<string>();

   27         for (int j = 0; j < n; j++)

   28         {

   29             for (int i = 0; i < n; i++)

   30             {

   31                 if (j > 0)

   32                     a[i] = s[i] + a[i];

   33                 else

   34                     a.Add(s[i].ToString());

   35             }

   36             a.Sort(String.CompareOrdinal);

   37         }

   38         foreach (var w in a)

   39             if (w[n - 1].ToString() == EOF)

   40                 return w;

   41 

   42         throw new Exception();

   43     }

   44 }

Since the BWT is designed to work on files in memory we need to optimize this naive implementation to be more memory efficient. If the text block is large, as the authors recommend, then the list of all permutations created in the Encode() method is going to be wasteful with memory. This issue, and a few others will be something I'll explore in Part 2 of this series...

Uncategorized

Join Algorithms Illustrated

A work colleague recently asked me about efficient inner join algorithms that could be applied to in-memory data collections. My mind immediately went back to the common algorithms that most modern databases usually employ:

  • Nested Loop Join
  • Sort-Merge Join
  • Hash Join

Let's explain each one of these using some simple C# 4.0 code to illustrate. First, consider we have 2 lists of key-value pairs. The first is a list of customerID and name. The second list is a list of customerID and item purchased:

            // a list of customers with corresponding customerID
            var names = new List<KeyValuePair<int,string>>();
            names.Add(new KeyValuePair<int, string>(1, "bob"));
            names.Add(new KeyValuePair<int, string>(2, "jim"));
            names.Add(new KeyValuePair<int, string>(3, "oscar"));
            names.Add(new KeyValuePair<int, string>(5, "mia"));

            // a list of things the customers purchased. integer value is customerID
            var items = new List<KeyValuePair<int, string>>();
            items.Add(new KeyValuePair<int, string>(2, "beer"));
            items.Add(new KeyValuePair<int, string>(2, "chips"));
            items.Add(new KeyValuePair<int, string>(2, "ketchup"));
            items.Add(new KeyValuePair<int, string>(5, "milk"));
            items.Add(new KeyValuePair<int, string>(5, "bread"));
            items.Add(new KeyValuePair<int, string>(5, "chocolate"));
            items.Add(new KeyValuePair<int, string>(1, "newspaper"));

Nested loop joins are the simplest of all joins. It is an extremely naive algorithm that loops over one data set in an outer loop and loops over the data set to join on in an inner loop (assuming the join involves only 2 items). For an inner (equi-) join, when it finds matches on the join key it outputs the resulting tuples. This approach has worst case run time complexity of O(A.B) where A and B are the sizes of the lists/tables being joined, hence it's only feasible when the lists are fairly small. Here's an example of it in C#:

        static void NestedLoopJoin(List<KeyValuePair<int, string>> people, List<KeyValuePair<int, string>> purchases)
        {
            foreach (var n in people)
                foreach (var i in purchases)
                    if (n.Key == i.Key)
                        Console.WriteLine("Name index {0}, {1}, bought some {2} ", n.Key, n.Value, i.Value);
        }

As the name implies, the sort-merge join starts by sorting both lists/tables by the join key and then doing an interleaved linear scan of both join key sets. Because the key sets of the join tables are both sorted the algorithm knows when no more inner joins are possible for a given join key and can advance the set pointers accordingly. Of course, the downside is that you have to pre-sort the data if it isn't already sorted. [On the flip side, it's a good candidate when the data is already sorted for you!]

        static void SortMergeJoin(List<KeyValuePair<int, string>> people,
                                  List<KeyValuePair<int, string>> purchases)
        {
            // first we sort both lists
            people.Sort(Compare);
            purchases.Sort(Compare);

            // then we do an interleaved linear scan to merge the sorted lists
            int i = 0, j = 0;

            while(i<people.Count & j<purchases.Count)
            {
                if (people[i].Key < purchases[j].Key)
                    i++;
                else if (people[i].Key > purchases[j].Key)
                    j++;
                else   // must be a match
                {
                    Console.WriteLine("Name index {0}, {1}, bought some {2} ",
                                       people[i].Key, people[i].Value, purchases[j].Value);
                    j++;
                }
            }
        }

        static int Compare(KeyValuePair<int, string> a, KeyValuePair<int, string> b)
        {
            if (a.Key > b.Key)
                return 1;

            if (a.Key == b.Key)
                return 0;

            return -1;
        }

Hash joins involve creating a hash table up front, in memory, for the smaller list/table in the join. You hash on the join keys found in the smaller table so that you can loop over the larger table in the join and for each encountered join key take the hash of it and use this to interrogate the pre-prepared hash table for joining rows. The nice thing about this approach is that hash-based lookup component is a constant time operation. Hash joins excel when at-least one of the join tables has no index or is not sorted.

        static void HashJoin(List<KeyValuePair<int, string>> people,
                             List<KeyValuePair<int, string>> purchases)
        {
            var peopleHash = new Dictionary<int,string>();
            foreach(var n in people)
                peopleHash.Add(n.Key,n.Value);

            foreach (var i in purchases)
                if (peopleHash.ContainsKey(i.Key))
                {
                    string custName;
                    peopleHash.TryGetValue(i.Key, out custName);
                    Console.WriteLine("Name index {0}, {1}, bought some {2} ",
                                       i.Key, custName , i.Value);
                }
        }

If you do this join inside a database the query optimizer helps you pick which join is appropriate but be warned: as the number of tables involved in the join grows so too does the solution space the optimizer needs to consider. I wrote about this previously here. If you do such a join yourself on an in-memory data structure, it pays to understand the space/time complexity trade-offs involved.

« Prev - Next »