Uncategorized

## Google Code Jam 2009: Collecting Cards

The last problem in Round 1A was the "Collecting Cards" problem which is a simply-stated problem but requires some understanding of combinatorics to solve. Basically, there are C distinct cards in a collection and you need to figure out how many packs of N distinct cards you need to buy in order to have the complete set assuming you started with none.

Firstly, some basic observations:

• Since order is irrelevant, we are dealing with combinations not permutations.
• Combinatorics tells us that there are CCN distinct combinations, all of which are equally likely.
• The theoretical minimum number of packs that we would need would only occur when every pack we bought had completely new cards in it (rather unlikely). That is, absolute lower limit is:

• We need to account for the likelihood that in packs purchased we are going to see an percentage of cards we already have. Dealing with this is the crux of the problem.

Background on Combinations
Note that xCy can also be written as:

and, choosing y items from a population of x items is equivalent to choosing x-y items. That is:

Clearly there is only 1 combination of xitems in a population of x, and by the above rule there is only 1 combination of zero items (that confuses many people!). That is:

Problem Solution
To solve this problem let us define f(k) be the expected number of packs we need to buy if we still need k cards (i.e we already have C-k distinct cards). To solve the problem we just need to find f(C). It is also obvious that f(0)=0, and f(C) = 1 + f(C-N) since we are guaranteed to get N distinct new cards when we purchase the first pack. What we now need is a generic formula for f(k) that we can use in a recursive fashion.

When we need k (distinct) cards, it follows that we must already have (C-k) distinct cards. Thus purchasing an additional pack of N cards can result in 0, 1, 2, 3, ..., min(N, k) new cards being collected. If we let pj,m be the probability of getting j new cards from an additional pack of N randomly-drawn cards, given that we already have m distinct cards, then we can say:

To calculate pj,m we simply need to figure out how many packs would have j cards that we don't already have and (N-j) cards that we do have, divided by the total number of possible pack combinations. In other words:

To efficiently compute the combinations we make use of the well-known property of Pascal's Triangle - the
[i,j] coefficient is the sum of the 2 co-efficients "above" it. This recursive definition can be stated formally as,

Since we know that N<=C<=40 we can go ahead and pre-compute these storing them in a 2-dimensional array.

Substituting for pj,C-k in our formula for f(k) yields:

But there is a problem here. If you use the above formula as is you'll hit a stack overflow as I did (dohh!) because you're creating an infinite loop due to f(k) being on both sides of the equation. To remedy this we need to revise our equation as follows (we took out the j=0 case in the summation and ensured that f(k) only appears on the LHS):

Holly cat fish , Batman! Who would have though collecting cards was so complicated

Here's the C# code to implement the above mess, errr equation:

```using System;
using System.IO;

namespace CollectingCards
{
class Program
{
private static readonly ulong[,] choose = new ulong[41,41];
private static double?[] solved;
private static int C,N;

static void Main()
{
const string basePath = "";
var infile = new StreamReader(basePath + "small.txt");
var outfile = new StreamWriter(basePath + "output.txt");

for (int i=0; i<41; i++)
{
choose[i, i] = 1;
choose[i, 0] = 1;
for (int j=1; j<i; j++)
choose[i,j] = choose[i-1,j] + choose[i-1,j-1];
}

for (int caseNo = 1; caseNo <= testCases; caseNo++)
{
C = Int32.Parse(data[0]);
N = Int32.Parse(data[1]);
solved = new double?[41];

outfile.WriteLine("Case #{0}: {1:#.00000}", caseNo, Solve(C));
}
infile.Close();
outfile.Close();
}

static double Solve(int k)
{
if (k == 0)
return 0;

if (solved[k].HasValue)
return solved[k].Value;

double factor = 1.0/(1.0-(double)choose[C-k,N]/choose[C,N]);
double ans = 1;
for (int j = 1; j <= Math.Min(N, k); j++)
{
if (k - j == 0) continue;
double p = (double)(choose[k, j] * choose[C - k, N - j]) / choose[C, N];
if (p != 0)
ans += p*Solve(k - j);
}
ans *= factor;
solved[k] = ans;
return ans;
}
}
}```

And here's some C++ code to do the same:

```#include <iostream>
#include <string>

using namespace std;

long long choose[41][41];
int C;
int N;
double solved[41];

double Solve(int k)
{
if (solved[k] != -1)
return solved[k];

double factor = 1.0/(1.0-(double)choose[C-k][N]/choose[C][N]);
double ans = 1;
int minVal = min(k,N);
for (int j = 1; j <= minVal; j++)
{
if (k - j == 0) continue;
double p = (double)(choose[k][j] * choose[C - k][N - j]) / choose[C][N];
if (p != 0)
ans += p*Solve(k - j);
}
ans *= factor;
solved[k] = ans;
return ans;
}

void main()
{
string basePath = "";
string inFile = basePath + "large.txt";
string outFile = basePath + "output.txt";

freopen(inFile.c_str(),"r",stdin);
freopen(outFile.c_str(),"w",stdout);

for (int i=0; i<41; i++)
{
choose[i][i] = 1;
choose[i][0] = 1;
for (int j=1; j<i; j++)
choose[i][j] = choose[i-1][j] + choose[i-1][j-1];
}

int testCases; cin >> testCases;
for (int caseNo = 1; caseNo <= testCases; caseNo++)
{
cin >> C >> N;
for(int i=0; i<41;i++)
solved[i]=-1;

cout << "Case #" << caseNo << ": " << Solve(C) << endl;
}
}```

Uncategorized

## Query Plan Optimisation and Catalan Numbers

The only reason databases store data is so you can later query that data. To query the data you give the database a query, usually in the form of SQL. The database will them examine the query, check to see if it already has a cached query plan that will work for this query, and if not go ahead a create a query plan before executing the query with that plan. There is a small overhead to pay when creating query plans, and the quality of the query plan is influenced by the parameters used in the query. It is for these reasons that there are fanatical arguments over whether it is more performant to use stored procedures (pre-defined queries that have already had a plan created), or free form SQL that has it's query plan recalculated every time using the specific parameters given. I'm not going to open that can of worms but it did get me thinking about query optimisation...

How many plans does the query optimiser need to evaluate and how does it do so?

It is reasonable to assume that the difficultly of optimizing a query is proportional to the solution space for alternative execution plans. Since the solution space is dependent upon the complexity of the input query let us consider a query which joins n tables and examine it formally.

Let J(n) denote the number of different join orderings when joining n tables.

Let T(n) denote the number of different binary tree shapes that can be found for a collection of n leaf nodes.

It should be intuitive that J(n) = T(n) * n! because there are n! ways to organise the leaf nodes of a binary tree, and there are T(n) different tree shapes. But how do we calculate T(n)? First recall that a binary tree has at most 2 branches which are in fact binary sub-trees. Thus, the number of permutations for a binary tree is the product of the number of permutations for each of it's sub-trees summed over all possible leaf node splittings between the left and right sub-trees:

where T(1) = 1 is the base case.

It turns out that T(n) = C(n-1), where C(n) = the n-th Catalan number:

Substituting T(n) = C(n-1), we get:

The table below shows that this equation grows extremely fast!

 n J(n) 1 1 2 2 3 12 4 120 5 1680 6 30,240 7 665,280 8 17,297,280 9 518,918,400 10 17,643,225,600

In other words, if you have, say 7 tables being joined, the query optimiser has 665,280 possible ways of doing this and therefore needs to evaluate quite a large number of possible plans. Of course there are more things to query plan optimisations than just working out table join order,including proper choices of access methods and indexes, but the point is the solution space in this optimisation problem grows extremely quickly as queries become more complex.

Typically, optimisation is performed using dynamic programming since the problem has the classic traits of overlapping sub-problems and optimal substructure but given the explosion of search space shown above it's not feasible to evaluate all possible plans. In other words, the optimiser is sub-optimal but it represents a best guess.

The seminal paper on cost-based query plan optimisation is Access Path Selection in a Relational Database Management System by Pat Selinger et al (pages 103-114 in this excellent manuscript). Be warned - it has the dot matrix printer font that makes it hard not to notice the paper was written in 1979!

Uncategorized

## Google Code Jam: Crop Triangles

Round 1B kicks off with a problem called "Crop Triangles". You are given a grid of integer-valued coordinates and need to find out how many distinct triangles can be formed that have an integer-valued midpoint. The obvious answer that comes to mind - to construct every possible triangle and chceck it's midpoint - is not really feasible as the problem size increases. With n grid points, and 3 points needed for a triangle, the number of triangles to examine in the brute force approach is: nC3 = n(n-1)(n-2)/6  so when n = 100,000 , the upper limit of n in the large problem, the triangle count is approx. 1 x 1015. Ergo the brute force approach is not feasible as it has running time O(n3). So how can we reduce the search space? A little bit of number theory comes in very handy here...

We are told that: (i) all coordinate pairs are integer values, (ii) the midpoint of a triangle must be an integer value, and (iii) the x- and y-coordinates of the midpoint are the average of the x- and y-coordinates of the 3 vertices respectively. So in mathematical notation (focusing on just the x-coordinates):

The same logic applies to the y-coordinates, so we need only examine the coordinates mod 3 that I'm referring to herein as the "adjusted coordinates". Adjusting with a modulus 3 operation effectively reduces the x- and y-coordinates to values in the range [0,1,2]. The following matrix depicts this graphically.

 y mod 3 0 1 2 x mod 3 0 C0,0 C0,1 C0,2 1 C1,0 C1,1 C1,2 2 C2,0 C2,1 C2,2

Since the sum of the adjusted coordinates for the x-values, and the y-values must equal 0 (the last equation given above) we only need to group together either:

1. Three x-values that have the same mod 3 result since 3x0=0, 3x1=3, 3x2=6 and 0,3,6 are all multiples of 3; or
2. Three x-values where each value has a different mod 3 result, since 0 + 1 + 2 =3 is a multiple of 3.

Obviously the same logic applies to the y-values, so our approach is to create a 3x3 matrix representing the mod 3 adjusted values. We iterate over all the points given and increment the appropriate counters in the matrix after we do the mod 3 operation on both the x- and y-values. Then we can examine the resulting 3x3 matrix of counts and produce the final answer by adding the number of 3-point combinations that can be made:

1. by items within each cell;
2. by items across columns for each row. This entails taking one point from each column for a given row in the matrix;
3. by items across rows for each column. This entails taking one point from each row for a given column in the matrix;
4. by items across rows and columns taking, at most, one point from each row and column in the matrix;

Here's my C# code:

```using System;
using System.IO;

namespace CropTriangles
{
class Program
{
static void Main()
{
string basePath = @"";
var infile = new StreamReader(basePath + "large.txt");
var outfile = new StreamWriter(basePath + "output.txt");

for (int caseNo = 1; caseNo <= testCases; caseNo++) // note this is 1-based
{
// read in the input data
string[] data = (infile.ReadLine()).Split(new char[] { ' ' }, StringSplitOptions.None);

int treeCount = Int32.Parse(data[0]);
long A = Int64.Parse(data[1]);
long B = Int64.Parse(data[2]);
long C = Int64.Parse(data[3]);
long D = Int64.Parse(data[4]);

long x = Int64.Parse(data[5]);
long y = Int64.Parse(data[6]);
long M = Int64.Parse(data[7]);

long[,] bins = new long[3, 3];

bins[x%3, y%3]++;

for(int i=1; i<treeCount; i++)
{
checked
{
x = (A*x + B)%M;
y = (C*y + D)%M;

bins[x % 3, y % 3]++;
}
}

for(int i=0; i< 3; i++)
for(int j=0; j< 3; j++)
answer += (bins[i, j] * (bins[i, j] - 1) * (bins[i, j] - 2) / 6);

// get column and row products
for (int i = 0; i < 3; i++)
{
answer += (bins[i,0] * bins[i,1] * bins[i,2]);
answer += (bins[0,i] * bins[1,i] * bins[2,i]);
}

// get
for (int i = 0; i < 3; i++)
for (int j = 0; j < 3; j++)
for (int k = 0; k < 3; k++)
if (i !=j & i!=k & j!=k)
answer += (bins[0,i] * bins[1,j] * bins[2,k]);

// write out the results