Archive for April, 2009

Uncategorized

Data Structures: Hash Functions and Tables (Part 1)

In the past I've written about cryptographically-strong hash functions. But today I'd like to start a series of posts on simple hash functions. The difference between a cryptographically-strong hash function and a simple one is primarily around the reversibility of the result. This must be "hard" in the crypto world, but when the hash function is used for other purposes - like consistent mapping - the reversibility condition is not so imperative.

A hash function is a function which consistently maps a data value, called the key, to an array (hash table) index. The standard practice is to feed a variety of input values (the keys) into a hash function and use the resulting array indices to store the key values in the array. By "consistently" we mean the function always produces the same result for a given input value. In mathematical terms - deterministic.

A hash table is a data structure that internally employs a hash function to efficiently map keys to values.

In the digaram below, the blue boxes on the left represent the keys whilst the boxes on the right hand side represent the hash table. The green slots are indices occupied by the blue keys.

Ideally we want the function used to be consistent, fast and injective (one-to-one) however this is not always possible. To see this consider h to be a hash function that operates on a set of keys, A, and projects onto a set of key values, B. If |A| > |B| then h cannot be injective over A.

When the function maps two different key values to the same index, we have what is termed a collision and the function is classified as non-injective. The array slots available to the hash function are referred to as buckets because collisions means multiple items could end up in each slot, hence the collective name "bucket". As well, the array indices returned are commonly referred to as bucket indices. The diagram below shows a hash with a collision on the keys, "Barry" and "Sally". In this case the hash table uses a simple linked list to point to all the items that hash to the given value. This is called hashing with chaining. The astute reader will be wondering if a linked list is the most efficient data structure to use inside the bucket - it certainly isn't since you need to walk the list to find a given item which leads to O(n) worst-case running time if n is the number of items in the bucket. However the number of bucket items is, on average, very small in most cases so whilst hash tables inside hash tables perform better, simple linked lists are easier for illustration purposes.

When collisions happens, the amount of work needed to subsequently lookup the value in the hash table increases since we now have multiple items in the bucket index. This is not desireable hence we prefer hash functions that minimise the number of collisions. Techniques used to resolve hash collisions involve linear probing, quadratic probing, double hashing and chaining. These methods effectively find another vacant slot in the hash table using retry loops or trickery with modular arithmetic.

The primary benefit of a well-behaved (injective) hash function is the speed of lookup that results from the constructed hash table.. The expected time for any operation is a constant, O(1) which makes it extremely appealling for tasks such as database indexing, de-duping lists, checking a set for inclusion, etc.  In fact, several other data structures use hash functions under the covers to weave their magic. These include dictionaries / associative arrays / maps, and bloom filters.

As an example, here's some C# code to de-dup a generic list in-place using a hash table.

        public static void Main()
        {
            var names = new List<string> { "Damien", "Alex", "Oscar", "Alex", "Benny", "Bjorn", "Oscar", "Oscar","Agnita" };

            DeDup(names);
            foreach(var name in names)
                Console.WriteLine(name);
        }

        // order-preserving, quick, in-place generic list de-duping
        public static void DeDup<T>(List<T> input)
        {
            var ht = new Hashtable();

            int i = 0;
            foreach(var name in input.ToArray())
            {
                var hash = name.GetHashCode();
                if (!ht.ContainsKey(hash))
                {
                    ht.Add(name.GetHashCode(), name);
                    i++;
                }
                else
                    input.RemoveAt(i);
            }
        }

In order to minimise collisions the output values from the hash function should, ideally, be uniformly distributed over the output range. Statistical techniques like the Chi-squared test can be used to test this using a null hypothesis, Ho: the function output values follow a uniform distribution.

Perfect Hash function

If there is no duplicate keys in the input set and the hash function results in no bucket having more than 1 entry, the hash function is said to be a perfect hash function. In other words, a perfect hash function is injective on a fixed, known set of inputs, and therefore collision-free. A perfect hash function is ideal when the input set is large but static, and there are many lookups of the data. i.e read-intensive, mostly-static data.

Minimal Perfect Hash function

A perfect hash function for n keys is said to be minimal if its range consists of consecutive integers, usually from 0 to n−1. Besides providing single-step lookup, a minimal perfect hash function are extremely space-efficient as they produce a compact hash table without any vacant slots. Minimal perfect hash functions are much harder to find than perfect ones with a wider range. The diagram below shows a minimal perfect hash function.

Cuckoo Hashing

When one hash function isn't enough how about using two? This is the idea behind cuckoo hashing. You employ 2 hash functions and, for each key, insert the item into the slot given by the first hash function. If that slot is already taken then you kick out the item in that slot and relocate it to the slot given by the second hash function. If that slot is taken you do the same thing again until a vacant slot is found. This can lead to an infinite loop in which case the hash table detects this and rebuilds itself in-place using new hash functions that hopefully lead to a better result.

The cute thing about cuckoo hashing is that the worst-case for lookup operations is constant time, O(1). This is much better than having expected time being constant. Furthermore, as long as the load factor is less than 50% [i.e the capacity of the hash table is more than twice the number of keys] insertions have expected constant time. So it does require a very sparse hash table to get good time on insertions, hence it is either fast and takes a lot of space, or slow and uses space efficiently - but never both.

If you're wondering about the name "cuckoo", it comes from the nesting habits of a European Cuckoo which throws out the occupants of the nest to make room for itself.
 
Bloom Filters

If two hash functions aren't enough, how about using three, or more? Believe it or not, this is what bloom filters do. You start off with an empty bit array of, say, n bits, all set to 0. You also have k hash functions which map keys to array indices randomly and uniformly. The idea being, when you need to add an item to the filter you hash it's key with all k hash functions and then put a 1 in each of the positions returned. The diagram below shows how three values {x, y, z} have been mapped into the bit array, whilst w is the item being queried.

So what use is this? Well Bloom filters are used to test if an element is a member of a set (you can add elements to a set but not easily remove them unless you introduce counters in the array). To do a query you pass the key of the data under consideration through the same k hash functions and check if ALL of the resulting array positions are equal to 1. If all positions are 1 it means that, either the item has already been added, OR the bits were all set by insertions of other items. Thus it is a probabilistic data structure. There is a chance of false positives, but if items are never removed there is NO chance of a false negative. The likelihood of false positives reduces as the size of the bit array increases, and increases as the number of inserted elements increases.

They are space-efficient and, because they employ hash functions, exhibit constant time for both insertion and lookup operations. Here the "lookup" operation is checking for inclusion in the set. As well, the k hashing that need to be taken can be done in parallel. As well, multiple Bloom filters of the same size can be combined using union and intersection operations implemented efficiently with bitwise OR/AND functions.

So what values of k - number of hash functions, and n - size of bit array should we chose for a given number of inserted items, s? Mathematical analysis has shown that the optimal value of k that minimises the probability of false positives is given by:

When this is done the expected false positive rate is less than 1%. Also note that since the function above is expressed in terms of s, the number of inserted items, it follows that n, the size of the array used for the Bloom filter, needs to expand as s increases.

In summary, a Bloom filter is a simple space-efficient randomised data structure for representing a set in order to support membership/inclusion queries. They can lead to false positives but the space savings often outweigh this drawback when the probability of error is controlled.

Note that if the set of possible values is small and many of them might be included in the set then other data structures like plain-and-simple bit arrays are more effective since they are deterministic.

In Part 2 of this series we'll move on from the background theory to look at implementation/usage details in .Net.

Uncategorized

Quotes

Over the years I've heard a bunch of quotes. Most of them aren't terribly insightful or inspirational but every now and again one sticks. A quote that has meaning to you, one that you embrace and perpetuate to others as ideas and ideals you believe in. Here's a few quotes that I've seen that have "stuck":

"A good engineer learns when compromise is more profitable than perfection" ...Robert Martin

"Talent hits a target no one else can hit; Genius hits a target no one else can see.” ...Arthur Schopenhauer (1788- 1860)

"Far batter to dare mighty things, to win glorious triumphs even though checkered by failure, than to rank with those poor spirits who neither enjoy nor suffer much because they live in the gray twilight that knows neither victory nor defeat" .. Theodore Roosevelt (quoted on the Harvard Business School Admissions website)

"No matter how busy you may think you are, you must find time for reading, or surrender yourself to self-chosen ignorance.” ...Confucius

“Live as if your were to die tomorrow. Learn as if you were to live forever.” ...Gandhi

"If you cannot measure it, you cannot improve it." ...Lord Kelvin

"Intuition is regularly wrong. Let data drive decisions." ...Damien Wintour

"The mass market comprises the pragmatists, those who just want the damn thing to work." ..Geoffrey Moore, author of Crossing the Chasm

"Revenue is the proof of your ability to win customers, which will determine your success more so than your ability to squeeze a few pennies out of your costs." ...extract from Ahead of the Curve

"Before criticising someone, you should walk a mile in their shoes. That way, if they get nasty, they're a mile away and barefoot." Unknown

"The will to win means nothing without the will to prepare." ...Juma Ikangaa, elite marathon runner

"If you’re concerned about scalability, any algorithm that forces you to run agreement will eventually become your bottleneck. Take that as a given." ...Werner Vogels, Amazon CTO

"Logic will take you from A to B. Imagination will take you everywhere." ...Albert Einstein

Uncategorized

Optimising Bit Counting using Iterative, Data-Driven Development

Based on several blogs and discussion boards I've seen lately, it seems that the old interview question chestnut of bit counting is still in vogue. I suspect it's more common at companies that seek quantitative skills in their programming staff -like investment banks, Hedge Funds, internet startups run by ex-Googlers, etc. So what is it? Bit counting is the problem of reporting the number of set (1) bits in the binary representation of an integer. In Java there is a built in static method, Integer.bitcount(), that you could use but it is interesting to examine what the internals of such a method would look like.

Many people have written on this topic, with varying degrees of accuracy and explanatory comments (a list of references comes later). I don't wish to repeat the good work others have done and I'm not proposing anything new that other smart people haven't already written about. Instead I'd like to outline a pragmatic approach that a sensible C# developer might go through in an interview situation when faced with such a question and where access to a Google search engine and the accompanying "instant-knowledge" is not available but a sensible solution needs to be delivered (hopefully it's close to optimal). By memorising the answer you might be able to rattle off a fast answer when needed, but IMHO it's far more valuable to understand the solution and more importantly the process taken to get there. i.e. Teach a (wo)man to fish, rather than give a (wo)man a fish. Alas, if you are after a fast answer, or a link to great reference material, skip to the end for instant gratification.

Motivation
You might ask yourself why people ask such interview questions when the answers can be found out using a search engine is no time. It's because they want to:

a) examine your logical approach to a problem they hope you haven't seen before; and

b) to find out the level of understanding you have about a particular problem if indeed you have seen it before.

Bit Counting has many various solutions so even if you've heard this one before your interviewer can push you deeper and deeper into the problem until you hit unfamiliar ground and are forced to think on your feet - which is really the ultimate goal of all interview puzzle problems.

Version 1: The "Naive" Solution
Firstly, we try to formulate an approach that yields the right answer using the most obvious approach. The solution that most people will think of first is to convert the number to binary notation and then walk through the binary digits summing the ones that are set. Here is some C# code showing an extension method to do just that:

        public static uint BitsSetCountNaive(this uint input)
        {
            uint count = 0;   

            while (input != 0)
            {
                count += input & 1;     // logical AND on the LSB
                input >>= 1;            // do a bitwise shift to the right to create a new LSB
            }
            return count;
        }

Notice that:

  1. We don't need to do anything explicit to convert the number from an integer into binary representation.
  2. Since pass-by-value is the default in C#, we can re-use the UInt32 variable, input, internally without affecting the caller's version of this variable.
  3. Figuring out if a bit is set or not is accomplished by doing a bitwise AND (&) of the input value with 1 which has the effect of isolating and returning the least significant bit, and conveniently returning the value we need to increment the counter by. Doing such bitwise AND operations is often called masking.
  4. Looping through each binary digit is accomplished by doing a bitwise shift to the right 1 place. This just pushes all bits to the right 1 position and inserts zeros at the most significant bit (big) end.
  5. For 32-bit unsigned integers that have lots of leading zeros in their binary representation, the loop terminates before iterating through all 32 bits. This is because the right shift operation after counting the leftmost non-zero bit produces zero which triggers the while{} exit condition.

In effect, this is a mask-add-shift loop which has no real memory overhead and running time of O(n) where n is the number of binary digits in the binary representation of the given integer. This approach is commonly referred to as the "Iterated Count". Note that running time is said to be O(n) despite the fact that the clever while loop will exit after processing the last non-zero bit. This is because asymptotic analysis refers to worst case (the upper bound), which in this case is all n bits hence O(n).

Because of (4) above, this approach would work best when most of the set-bits are among the least significant bits.

Version 2: Tightening the Loop
Now we have a basic, naive solution we can iterate on it (in an Agile sense) to improve it's performance. The inital solution had running time of O(n) because we have to loop through all n binary digits. We could obviously improve that if we could somehow jump from set bit to set bit. How do we do this? Again we can use a simple trick with bitwise operators:

        public static uint BitsSetCountWegner(this uint input)
        {
            uint count;
            for (count = 0; input!=0; count++)
            {
                input &= input - 1; // turn off the rightmost 1-bit
            }
            return count;
        }

Notice that:

  1. We use the formula:  x = x & (x-1) to turn off the rightmost set (1) bit.
  2. The loop now iterates on the counter and we jump from set bit (1) to set bit, exiting when there are no more set bits.
  3. The function is called "...Wegner" after the guy who originally came up with it but more on that later. Brian Kernighan from K&R C fame also mentions this approach in one of his books so many people incorrectly attribute this to him.

This uses no more memory than version 1 (both have O(1) space complexity), but should yield better running times since we don't enumerate through all binary digits, just the set ones. Running time is ~O(b) where b is the number of set (1) bits in the binary representation of the given integer, and common sense tells us that this approach would be useful when there are few set bits in the input. i.e when the input is sparse. For this reason, this approach is sometimes called the "Sparse Ones" approach.

Version 2b: From Sparse to Dense
The astute programmer will intuitively realise that if there exists an algo optimised for sparse input then there must be a complementary one optimised for dense input. "Dense" input, in this case, refers to a binary number which has a large number of set (1) bits relative to the number of 0 bits. It is shown below for the sake of completeness:

        public static uint BitsSetCountDense(this uint input)
        {
            uint count;
            for (count = 0; input != UInt32.MaxValue; count++)
            {
                input |= input + 1; // turn on the rightmost 0-bit
            }
            return 32-count;
        }

Here we repetitively turn ON the rightmost 0-bit until all bits are set to 1 in which case the input would equal UInt32.MaxValue. Here the "magic" bitwise operation used to turn on the rightmost 0-bit is: x =x | (x+1).

Alternately, the original number can be complemented (all bits are flipped), and we can count down from the max theoretical bit count, 32 as follows:

        public static uint BitsSetCountDenseAlternate(this uint input)
        {
            uint count;
            input = ~input;        // complement (flips all bits)
            for (count = 32; input != 0; count--)
            {
                input &= input - 1; // turn off the rightmost 1-bit
            }
            return count;
        }

Sparse Ones and Dense Ones were first described by Peter Wegner in “A Technique for Counting Ones in a Binary Computer”, Communications of the ACM, Volume 3 (1960) Number 5, page 322.

Evaluation of V1 and V2
Regardless of what asymptotic anaylsis tells us, it's prudent to measure and compare the alternatives since we are data-driven. Accordingly, we could write a quick test case as follows:

        static void Main()
        {
            const int LoopMax = 1000000;

            // test code for bit counting
            uint xx = 3160637183;

            // test case 1
            Stopwatch sw = new Stopwatch();
            sw.Start();
            uint ans = 0;
            for (int l = 0; l < LoopMax; l++)
                ans = xx.BitsSetCountNaive();
            Console.WriteLine(ans);
            Console.WriteLine(String.Format("Timing for Naive bit count: {0}", sw.ElapsedTicks));

            // test case 2
            sw = new Stopwatch();
            sw.Start();
            for (int l = 0; l < LoopMax; l++)
                ans = xx.BitsSetCountWegner();
            Console.WriteLine(ans);
            Console.WriteLine(String.Format("Timing for Wegner bit count: {0}", sw.ElapsedTicks));

            Console.ReadLine();
        }

This test is very crude in that it doesn't use a variety of inputs. However, we know that the algorithm proposed in Version 2 should definitely outperform the Ver.1 algo when there is only a few set (1) bits in the binary number. So our choice of 3160637183, which has 23 set bits out of 32 bits in total, should not favour the Ver. 2 algo very much. That said, the test figures we get are good enough to give us a relative measure:

23
Timing for Naive bit count: 588842
23
Timing for Wegner bit count: 451542

The data backs up our asymptotic analysis. We see a 25% improvement from Ver. 1 (Naive) to Ver. 2 (Wegner) which correlates to scanning 23 bits rather than 32 bits.

Version 3: The Penny has Dropped - Lookup Tables
A 32-bit unsigned integer repesents a finite number of integer values, 232 = 4,294,967,296 distinct values. The number of bits set in each of these integer representations does not change so it would be possible to precompute these values, or to use memoization /caching to store them after first use. If we have some sort of lookup table we could get constant run time, at the expense of space complexity. The feasibility of this approach depends on the environment and where the resource bottlenecks are. Often an interviewer will qualify the question with something like "you have unlimited memory storage". That's your queue to employ a lookup table. But what sort of lookup table should be used?

We need an 8-bit byte data type in C# to represent values in the range 0-32. Storing over 4 billion of these in a lookup array would consume 4 GB of memory. Since summing the 1-bits in a 32 bit number is the same as summing the 1-bits in both the lower and upper 16 bits separately and then adding them together, we can employ a divide-and-conquer approach here and use a byte array lookup table that only has 216 = 65,536 distinct values. This would consume only 64 KB of memory, a much more acceptable overhead. Here's what the code looks like:

        static readonly byte[] bitCounts = new byte[UInt16.MaxValue];

        private static bool IsInitialized;  // will be false by default

        static public void InitializeBitcounts()
        {
            for (uint i = 0; i < UInt16.MaxValue; i++)
            {
                // Get the bitcount using any method
                bitCounts[i] = (byte)i.BitsSetCountWegner();
            }
            IsInitialized = true;
        }

        public static uint BitsSetCountUsingLookup(this uint i)
        {
            if (!IsInitialized)
                InitializeBitcounts();

            return (uint)(bitCounts[i & 0xFFFF] + bitCounts[(i >> 16) & 0xFFFF]);
        }

Of course, it's possible to precompute 8-bit values and then break the 32-bit input value into 4 blocks of 8, each of which can be looked up from the table. Note that we are assuming that the lookup table can be

Version 4: The Quest for (Near) Constant Time without Memory Overhead
Getting this far is pretty good - we have a fast method albeit with some memory overhead. But what if the environment does not allow for additional memory being used for lookup tables? Is there a (near) constant time algo that doesn't have the memory overhead required by the previous solution?

The clue to this approach is mentioned above: divide-and-conquer, combined with some parallel processing. Many algorithms employ a divide-and-conquer approach and end up with running time ~O(log2 n) because they successively divide the problem in half until it cannot be divided in half any more. Such an approach is taken by binary search, balanced binary tree data structures, Fibonacci search, etc.

In our case, we know that we can partition the 32-bit integer any way we please and sum the bits of all the parts individually and easily compute the required value by summing these interim counts. Furthermore we know that some counting operations can be done in parallel using cleverly crafted bitwise operators. To see this consider the case where we want to sum adjacent bits. Bits can be 0 or 1, so the value of the sum will be 0, 1 or 2 (in decimal), which is 00, 01 or 10 in binary. In other words, we can sum 2 individual bits and store the resulting sum in those same 2 bits! This can be expressed mathematically as follows:

a = (a & 0x55555555) + ( (a >> 1) & 0x55555555)

Here, the hexadecimal constant 0x55555555 represents a binary mask (01010101010101010101010101010101) that picks up every even-numbered bit. That is, the 20, 22, 24, etc positions in the binary representation. Doing a right shift by one position and applying the mask again picks up all the other, odd-numbered bit positions, 21, 23, 25... Hence the above equation sums adjacent bits and stores the results in-place.

This can be extended to sum the adjacent 4-bits, then the adjacent 8-bits, and finally the adjacent 16-bits to get our answer. All we need to do is adjust the bit mask used at each step and the magnitude of the shift operator. Here's the code to do this in C#:

        public static uint BitsSetCountUnOptimised(this uint i)
        {
            i = (i & 0x55555555) + ((i >> 1) & 0x55555555);     // adjacent bits grouped
            i = (i & 0x33333333) + ((i >> 2) & 0x33333333);     // adjacent 2-bit groups paired
            i = (i & 0x0F0F0F0F) + ((i >> 4) & 0x0F0F0F0F);     // adjacent 4-bit groups paired
            i = (i & 0x00FF00FF) + ((i >> 8 ) & 0x00FF00FF);     // adjacent 8-bit groups paired
            i = (i & 0x0000FFFF) + ((i >> 16) & 0x0000FFFF);     // adjacent 16-bit groups paired
            return (i & 0x0000003F);    // a 32-bit unsigned int can have at most 32 set bits, 
                                        // which in decimal needs on 6 bits to represent
        }

Version 5: Finesse and Extreme Optimisation
We've already made some great leaps forward in efficiency and asymptotic run time but believe it or not there are further optimisations that can be made. First, note that the last logical AND is not necessary since it right shifts 16 places which would leave only 16 bits so the masking is redundant. As well, there are other logical ANDs that can safely be removed when there is no danger that a field will carry over into the adjacent field. In our case, the grouping of adjacent 4-, 8- and 16-bit groups can have redundant logical ANDs removed. Finally, the first line can be trimmed from 4 operations down to 3 by replacing it with:

x = x - (( x >> 1) & 0x55555555);

The revised code now looks like:

        // taken from Hacker's Delight, chapter 5, pp66
        public static uint BitsSetCountOptimised(this uint i)
        {
            i = i - ((i >> 1) & 0x55555555);                    // adjacent bits grouped
            i = (i & 0x33333333) + ((i >> 2) & 0x33333333);     // adjacent 2-bit groups paired
            i = i + ((i >> 4) & 0x0F0F0F0F);                    // adjacent 4-bit groups paired
            i = i + (i >> 8);                                   // adjacent 8-bit groups paired
            i = i + (i >> 16);                                  // adjacent 16-bit groups pair
            return (i & 0x0000003F);    // a 32-bit unsigned int can have at most 32 set bits, 
                                        // which in decimal needs on 6 bits to represent
        }

Putting this through a crude timing check we see the following results for 1 million iterations. The figures are total elapsed ticks:

Timing for Lookup bit count UnOptimised: 224970
Timing for Lookup bit count Optimised: 214467
Timing for Lookup-based bit count: 131663

This shows that the lookup table-based approach is still the fastest but the optimised approach given above is not too far behind. It is elegant in that it is branch-free (branching is usually a more expensive operation) and employs parallel bit counting.

If you have an inquisitive mind you are probably already asking yourself if the function can be made even faster by working with adjacent groups bigger than 2-bits. Well, it certainly can. In fact this was the approach taken by HAKMEM Item 196 (see references below). The solution offered there used adjacent 3-bit groups. This was further extended by other researchers to work with - you guessed it - adjacent 4-bit groups. Here's some C# code to do just that:

        // works by counting adjacent 4-bits
        public static uint BitsSetCountHackMem(this uint i)
        {
            uint n = (i >> 1) & 0x77777777;
            i = i - n;
            n = (n >> 1) & 0x77777777;
            i = i - n;
            n = (n >> 1) & 0x77777777;
            i = i - n;
            i = (i + (i >> 4) & 0xF0F0F0F);
            i = i * 0x01010101;
            return i >> 24;
        }

So how does it compare? Because we're data-driven rather then theory-driven, let's measure it and let the data tell us:

Timing for Lookup bit count UnOptimised: 216632
Timing for Lookup bit count Optimised: 205692
Timing for Lookup-based bit count: 104143
Timing for modified HAKMEM bit count: 83938

Damn, that's fast! It's on par with the lookup-table based approach and it doesn't have any memory overhead! That's good enough for me. At this point we're probably going to see diminishing returns for further optimisation efforts so let's settle on this approach.

One final bit of advice: your interviewer more than likely doesn't want a 20 minute sermon on the topic of bit counting. Think carefully before launching into a lengthy discussion upfront.
Advanced Topics
If you've dazzled the interviewer with your knowledge and reasoning skills it's more than likely they'll realise that you've seen the problem before and you'll get pushed onto the more esoteric issues of this vast topic. Here are a few you can think about in advance:

  • Why can't we apply these algorithms to signed integers?
  • Can you suggest some architecture-dependent optimisations?
  • How would you determine what constitutes a spare/dense number prior to bit counting?
  • Can you suggest an algo which is cache oblivious?
  • Can you think of some examples where you might need to use bit counting?
  • What is the worst case asymptopic run time of the above mentioned algorithms?
  • If you were building a framework for other developers to use, and you needed to expose a BitCount() method, what would your method internals look like?

Further Research
This problem has several common names:

  • Hamming Weight
  • Population Count (or popCount) of Binary Values
  • Sideways Sum / Sideways Addition

Useful References
Hackers Delight, Chapter 5
Bitwise Tricks and Techniques by Don Knuth (The Art of Computer Programming, Part IV)
Sean Anderson's Bit Twiddle Hacks
HAKMEM Item 169: MIT AI Memo 239, Feb. 29, 1972
Discussion of the Problem on StackOverflow
Gurmeet Singh's Weblog with C code
Jesse Farmer's Blog
Magic Algorithms

Bit Twiddle Basics
Hacker's DelightAll of the below mentioned bit twiddling hacks have been elucidated from a magnificent book called Hacker's Delight by Henry S. Warren Jnr. If you want comprehensive coverage of low-level, highly efficient tricks on various processor architectures this is a book you must read.

  1. To turn off the rightmost 1-bit in x:          x & (x-1)
  2. To isolate the rightmost bit in x:                x & 1
  3. To isolate the rightmost 1-bit in x:            x & (-x)
  4. To isolate the rightmost 0-bit in x:            x & (x+1)
  5. To create a mask for all trailing zeros:       ~x & (x-1)     (~ is the one's complement, aka bitwise NOT)
  6. To turn off the k-th bit for the right in x:   x &  ~(1<<k)
  7. To check if x is a power of 2:                      x & (x - 1) == 0
  8. To toggle multiple bits at one: cretae a mask for the bits to toggle, M, and use:  x ^ M
  9. To find the max value of a data type T, cast zero as T then complement:    ~(T)0

Next »