Uncategorized

Join Algorithms Illustrated

A work colleague recently asked me about efficient inner join algorithms that could be applied to in-memory data collections. My mind immediately went back to the common algorithms that most modern databases usually employ:

  • Nested Loop Join
  • Sort-Merge Join
  • Hash Join

Let's explain each one of these using some simple C# 4.0 code to illustrate. First, consider we have 2 lists of key-value pairs. The first is a list of customerID and name. The second list is a list of customerID and item purchased:

            // a list of customers with corresponding customerID
            var names = new List<KeyValuePair<int,string>>();
            names.Add(new KeyValuePair<int, string>(1, "bob"));
            names.Add(new KeyValuePair<int, string>(2, "jim"));
            names.Add(new KeyValuePair<int, string>(3, "oscar"));
            names.Add(new KeyValuePair<int, string>(5, "mia"));

            // a list of things the customers purchased. integer value is customerID
            var items = new List<KeyValuePair<int, string>>();
            items.Add(new KeyValuePair<int, string>(2, "beer"));
            items.Add(new KeyValuePair<int, string>(2, "chips"));
            items.Add(new KeyValuePair<int, string>(2, "ketchup"));
            items.Add(new KeyValuePair<int, string>(5, "milk"));
            items.Add(new KeyValuePair<int, string>(5, "bread"));
            items.Add(new KeyValuePair<int, string>(5, "chocolate"));
            items.Add(new KeyValuePair<int, string>(1, "newspaper"));

Nested loop joins are the simplest of all joins. It is an extremely naive algorithm that loops over one data set in an outer loop and loops over the data set to join on in an inner loop (assuming the join involves only 2 items). For an inner (equi-) join, when it finds matches on the join key it outputs the resulting tuples. This approach has worst case run time complexity of O(A.B) where A and B are the sizes of the lists/tables being joined, hence it's only feasible when the lists are fairly small. Here's an example of it in C#:

        static void NestedLoopJoin(List<KeyValuePair<int, string>> people, List<KeyValuePair<int, string>> purchases)
        {
            foreach (var n in people)
                foreach (var i in purchases)
                    if (n.Key == i.Key)
                        Console.WriteLine("Name index {0}, {1}, bought some {2} ", n.Key, n.Value, i.Value);
        }

As the name implies, the sort-merge join starts by sorting both lists/tables by the join key and then doing an interleaved linear scan of both join key sets. Because the key sets of the join tables are both sorted the algorithm knows when no more inner joins are possible for a given join key and can advance the set pointers accordingly. Of course, the downside is that you have to pre-sort the data if it isn't already sorted. [On the flip side, it's a good candidate when the data is already sorted for you!]

        static void SortMergeJoin(List<KeyValuePair<int, string>> people,
                                  List<KeyValuePair<int, string>> purchases)
        {
            // first we sort both lists
            people.Sort(Compare);
            purchases.Sort(Compare);

            // then we do an interleaved linear scan to merge the sorted lists
            int i = 0, j = 0;

            while(i<people.Count & j<purchases.Count)
            {
                if (people[i].Key < purchases[j].Key)
                    i++;
                else if (people[i].Key > purchases[j].Key)
                    j++;
                else   // must be a match
                {
                    Console.WriteLine("Name index {0}, {1}, bought some {2} ",
                                       people[i].Key, people[i].Value, purchases[j].Value);
                    j++;
                }
            }
        }

        static int Compare(KeyValuePair<int, string> a, KeyValuePair<int, string> b)
        {
            if (a.Key > b.Key)
                return 1;

            if (a.Key == b.Key)
                return 0;

            return -1;
        }

Hash joins involve creating a hash table up front, in memory, for the smaller list/table in the join. You hash on the join keys found in the smaller table so that you can loop over the larger table in the join and for each encountered join key take the hash of it and use this to interrogate the pre-prepared hash table for joining rows. The nice thing about this approach is that hash-based lookup component is a constant time operation. Hash joins excel when at-least one of the join tables has no index or is not sorted.

        static void HashJoin(List<KeyValuePair<int, string>> people,
                             List<KeyValuePair<int, string>> purchases)
        {
            var peopleHash = new Dictionary<int,string>();
            foreach(var n in people)
                peopleHash.Add(n.Key,n.Value);

            foreach (var i in purchases)
                if (peopleHash.ContainsKey(i.Key))
                {
                    string custName;
                    peopleHash.TryGetValue(i.Key, out custName);
                    Console.WriteLine("Name index {0}, {1}, bought some {2} ",
                                       i.Key, custName , i.Value);
                }
        }

If you do this join inside a database the query optimizer helps you pick which join is appropriate but be warned: as the number of tables involved in the join grows so too does the solution space the optimizer needs to consider. I wrote about this previously here. If you do such a join yourself on an in-memory data structure, it pays to understand the space/time complexity trade-offs involved.

Uncategorized

Sorting Algorithms in F#

A while ago I reviewed basic sorting algorithms in Java and C#. I've been playing a lot with F# over the past 6 months so I'd thought it would be worthwhile whipping up the same algorithms in F#.

The earlier posts with Java and C# implementations are here:
Sorting Fundamentals: Part 1
Sorting Fundamentals: Part 2

Bubble Sort

Everyone knows this is an inefficient algorithm but none-the-less it has "educational value". In my F# implementation I've defined 2 recursive functions, sink and outer. Both functions accept a list and return a new list. The sink function makes a single pass through the list sinking the largest value to the rightmost/last position in the list using simple pattern matching semantics. In lines 7-8 the cons operator (::) is used to separate the first 2 items in the list from the remainder of the list so a greater than pairwise comparison can be make on the leading pair. Note that lists, and indeed all identifiers not explicitly marked as mutable, are immutable so a new list is created rather than returning the original list. In other words, it is not in-place.

The second function, outer, is logically equivalent to the outer loop in the C#/Java implementation that you've no doubt seen before. We pass a list and the list size to this function. The function sinks one value using the function described above, then calls itself recursively on a new list that is a copy of the original list but has one fewer items - the "sunken" element is removed from the end.

    1 #light

    2 open System

    3 

    4 let rec sink ls =

    5     match ls with

    6     | [] -> []

    7     | x::y::rest when x > y -> y::(sink (x::rest))

    8     | x::rest -> x::(sink rest)

    9 

   10 let rec bubbleSort ls =

   11   let rec outer ls lsSize =

   12     match lsSize with

   13     | 0 -> ls

   14     | x -> outer (sink ls) (x - 1)

   15   outer ls (Seq.length ls)

   16 

   17 let mylist = [3;5;1;2;2;78;9;8;43;5;-6]

   18 let mylistsorted = (bubbleSort mylist)

   19 print_any mylistsorted

   20 Console.ReadLine() |> ignore

Insertion Sort

The insertion sort algorithm works by inserting each item into an already sorted sub-list. In F# I've implemented it using 2 recursive functions, insert and insertionSort. Both functions take a list and return a new list. The insert function takes an item and a list and creates a new list which is a copy of the one passed but it inserts the passed item into the new list. This function is designed to take a sorted list or an empty list - which then becomes a sorted list.

The insertionSort function picks off the first item in the list and inserts it into the already-sorted sublist .

    1 #light

    2 open System

    3 

    4 let rec insert x ls = 

    5     match ls with

    6     | []    -> [x]

    7     | y::rest -> if x <= y then x::y::rest

    8                  else y::(insert x rest)

    9 

   10 let rec insertionSort ls =

   11     match ls with

   12     | []    -> []

   13     | x::rest -> insert x (insertionSort rest)     

   14 

   15 let mylist = [3;5;1;2;2;78;9;8;43;5;-6]

   16 let mylistsorted = (insertionSort mylist)

   17 print_any mylistsorted

   18 Console.ReadLine() |> ignore

QuickSort

The quicksort algorithm is a classic divide-and-conquer algorithm. It selects a pivot from the list then creates 2 sub-lists containing all items less than, and greater than the pivot. The result is a new list which is the concatenation of the less-than list, the pivot, and the greater-than list. And, in order to sort the 2 sub-lists it calls itself recursively.

#light
open System
 
let rec quickSort ls =
  match ls with
  | [] -> []
  | p::r -> 
     quickSort [for i in r do if (compare i p) <= 0 then yield i] 
     @ [p] 
     @ quickSort [for i in r do if (compare i p) > 0 then yield i]          
 
let mylist = [3;5;1;2;2;78;9;8;43;5;-6]
let mylistsorted = (quickSort mylist)
print_any mylistsorted
Console.ReadLine() |> ignore

MergeSort

Of all the sort algorithms, quicksort is perhaps the most amenable to functional programming as it can exploit parallelization. If I had 10 billion records in numerous files over many machines a modified version of mergeSort is what you'd want top use, but that's a topic for another post...

    1 #light

    2 open System

    3 

    4 let rec splitList ls n =

    5     match ls with

    6     | [] -> [],[]

    7     | ls when n=0 -> ([],ls)

    8     | head::tail ->

    9         let l1, l2 = splitList tail (n-1);

   10         head::l1, l2

   11 

   12 let rec mergeLists ls1 ls2 =

   13     match ls1, ls2 with

   14     | [],[] -> []

   15     | [],ls2 -> ls2

   16     | ls1,[] -> ls1

   17     | h1::t1, h2::t2 when h1 <= h2 -> h1 :: mergeLists t1 ls2

   18     | h1::t1, h2::t2 -> h2 :: mergeLists ls1 t2

   19 

   20 let rec mergeSort ls =

   21     match ls with

   22     | [] -> []

   23     | [x] -> [x]

   24     | ls ->

   25         let ls1, ls2 = splitList ls (ls.Length/2)

   26         mergeLists (mergeSort ls1) (mergeSort ls2)

   27 

   28 let mylist = [3;5;1;2;2;78;9;8;43;5;-6]

   29 let mylistsorted = (mergeSort mylist)

   30 print_any mylistsorted

   31 Console.ReadLine() |> ignore

As you can see, F# code is succinct and highly expressive. I liken it to Python in it's terseness without the performance penalty! The beauty of F# is a lightweight syntax where whitespace is used instead of block- and scope-defining curly braces, and the clever use of type inference within a strongly-typed type system, so I needn't get bogged down defining explicit types which makes the code's purpose clearer.

Uncategorized

Query Plan Optimisation and Catalan Numbers

The only reason databases store data is so you can later query that data. To query the data you give the database a query, usually in the form of SQL. The database will them examine the query, check to see if it already has a cached query plan that will work for this query, and if not go ahead a create a query plan before executing the query with that plan. There is a small overhead to pay when creating query plans, and the quality of the query plan is influenced by the parameters used in the query. It is for these reasons that there are fanatical arguments over whether it is more performant to use stored procedures (pre-defined queries that have already had a plan created), or free form SQL that has it's query plan recalculated every time using the specific parameters given. I'm not going to open that can of worms but it did get me thinking about query optimisation...

How many plans does the query optimiser need to evaluate and how does it do so?

It is reasonable to assume that the difficultly of optimizing a query is proportional to the solution space for alternative execution plans. Since the solution space is dependent upon the complexity of the input query let us consider a query which joins n tables and examine it formally.

Let J(n) denote the number of different join orderings when joining n tables.

Let T(n) denote the number of different binary tree shapes that can be found for a collection of n leaf nodes.

It should be intuitive that J(n) = T(n) * n! because there are n! ways to organise the leaf nodes of a binary tree, and there are T(n) different tree shapes. But how do we calculate T(n)? First recall that a binary tree has at most 2 branches which are in fact binary sub-trees. Thus, the number of permutations for a binary tree is the product of the number of permutations for each of it's sub-trees summed over all possible leaf node splittings between the left and right sub-trees:

where T(1) = 1 is the base case.

It turns out that T(n) = C(n-1), where C(n) = the n-th Catalan number:

Substituting T(n) = C(n-1), we get:

The table below shows that this equation grows extremely fast!

n J(n)
1 1
2 2
3 12
4 120
5 1680
6 30,240
7 665,280
8 17,297,280
9 518,918,400
10 17,643,225,600

In other words, if you have, say 7 tables being joined, the query optimiser has 665,280 possible ways of doing this and therefore needs to evaluate quite a large number of possible plans. Of course there are more things to query plan optimisations than just working out table join order,including proper choices of access methods and indexes, but the point is the solution space in this optimisation problem grows extremely quickly as queries become more complex.

Typically, optimisation is performed using dynamic programming since the problem has the classic traits of overlapping sub-problems and optimal substructure but given the explosion of search space shown above it's not feasible to evaluate all possible plans. In other words, the optimiser is sub-optimal but it represents a best guess.

The seminal paper on cost-based query plan optimisation is Access Path Selection in a Relational Database Management System by Pat Selinger et al (pages 103-114 in this excellent manuscript). Be warned - it has the dot matrix printer font that makes it hard not to notice the paper was written in 1979!

« Prev - Next »