The Burrows-Wheeler Transform is a well-known text transformation that is used to pre-condition strings prior to compression. It's used by the popular bzip2 utility in Linux. To be more specific, it is a lossless block-sorting algorithm. It was invented by Mike Burrows (now at Google) and David Wheeler who advise that the size of the input block must be a few KBs to achieve good compression. It is a surprising simple algorithm to describe and has the remarkable property of being easily reversible. Let's look at some code in C#...
The Encode() method in lines 6-21 takes a given string and creates every cyclic permutation of the string in lines 9-12. It then does a lexicographical sort of this new list of strings. I chose an ordinal search to achieve the same interim results published on Wikipedia using "^BANANA@" as the string to transform (here "@" is an EOF marker). If you use the default Array.Sort() parameters in .NET the sort will be culture-sensitive which means the caret sign "^" will be ordered before alpha characters for most English-speaking cultures. See this article for a review of linguistic and ordinal string operations. But in reality you can use any sort variant here as long as you use the same approach in the Decode() method.
Lines 16-20 then create a new string of the same length by concatenating the last character in each string in the sorted list. The resulting text contains all of the characters that were in the original string differing only in their ordering. It is in fact a collection of prefix characters to the first characters of each word in the sorted list.
To decode the string we perform a series of Add+Sort operations as follows: take the given (transformed) string and make a list of new strings making the first character of each respective string in the list equal to the characters of the given string, then sort them. We then start the process again adding the characters of the given string as their new first character, sort them, etc until you have strings of the same length as the original. At that point you can identify the original by looking for the string with the EOF marker in it's rightful place - at the end.
1 public class BurrowsWheelerTransform
3 public const string EOF = "\xFF";
5 // unoptimised code
6 public string Encode(string s)
8 s += EOF;
9 int n = s.Length;
10 var permutations = new string[n];
11 for (int i = 0; i < n; i++)
12 permutations[i] = s.Substring(n - i, i) + s.Substring(0, n - i);
14 Array.Sort(permutations, String.CompareOrdinal); // uses quicksort.
16 var result = new StringBuilder();
17 foreach (var p in permutations)
18 result.Append(p[n - 1]);
20 return result.ToString();
23 public string Decode(string s)
25 int n = s.Length;
26 var a = new List<string>();
27 for (int j = 0; j < n; j++)
29 for (int i = 0; i < n; i++)
31 if (j > 0)
32 a[i] = s[i] + a[i];
38 foreach (var w in a)
39 if (w[n - 1].ToString() == EOF)
40 return w;
42 throw new Exception();
Since the BWT is designed to work on files in memory we need to optimize this naive implementation to be more memory efficient. If the text block is large, as the authors recommend, then the list of all permutations created in the Encode() method is going to be wasteful with memory. This issue, and a few others will be something I'll explore in Part 2 of this series...