Based on several blogs and discussion boards I've seen lately, it seems that the old interview question chestnut of bit counting is still in vogue. I suspect it's more common at companies that seek quantitative skills in their programming staff -like investment banks, Hedge Funds, internet startups run by ex-Googlers, etc. So what is it? Bit counting is the problem of reporting the number of set (1) bits in the binary representation of an integer. In Java there is a built in static method, Integer.bitcount(), that you could use but it is interesting to examine what the internals of such a method would look like.

Motivation
You might ask yourself why people ask such interview questions when the answers can be found out using a search engine is no time. It's because they want to:

a) examine your logical approach to a problem they hope you haven't seen before; and

b) to find out the level of understanding you have about a particular problem if indeed you have seen it before.

Bit Counting has many various solutions so even if you've heard this one before your interviewer can push you deeper and deeper into the problem until you hit unfamiliar ground and are forced to think on your feet - which is really the ultimate goal of all interview puzzle problems.

Version 1: The "Naive" Solution
Firstly, we try to formulate an approach that yields the right answer using the most obvious approach. The solution that most people will think of first is to convert the number to binary notation and then walk through the binary digits summing the ones that are set. Here is some C# code showing an extension method to do just that:

```        public static uint BitsSetCountNaive(this uint input)
{
uint count = 0;

while (input != 0)
{
count += input & 1;     // logical AND on the LSB
input >>= 1;            // do a bitwise shift to the right to create a new LSB
}
return count;
}```

Notice that:

1. We don't need to do anything explicit to convert the number from an integer into binary representation.
2. Since pass-by-value is the default in C#, we can re-use the UInt32 variable, input, internally without affecting the caller's version of this variable.
3. Figuring out if a bit is set or not is accomplished by doing a bitwise AND (&) of the input value with 1 which has the effect of isolating and returning the least significant bit, and conveniently returning the value we need to increment the counter by. Doing such bitwise AND operations is often called masking.
4. Looping through each binary digit is accomplished by doing a bitwise shift to the right 1 place. This just pushes all bits to the right 1 position and inserts zeros at the most significant bit (big) end.
5. For 32-bit unsigned integers that have lots of leading zeros in their binary representation, the loop terminates before iterating through all 32 bits. This is because the right shift operation after counting the leftmost non-zero bit produces zero which triggers the while{} exit condition.

In effect, this is a mask-add-shift loop which has no real memory overhead and running time of O(n) where n is the number of binary digits in the binary representation of the given integer. This approach is commonly referred to as the "Iterated Count". Note that running time is said to be O(n) despite the fact that the clever while loop will exit after processing the last non-zero bit. This is because asymptotic analysis refers to worst case (the upper bound), which in this case is all n bits hence O(n).

Because of (4) above, this approach would work best when most of the set-bits are among the least significant bits.

Version 2: Tightening the Loop
Now we have a basic, naive solution we can iterate on it (in an Agile sense) to improve it's performance. The inital solution had running time of O(n) because we have to loop through all n binary digits. We could obviously improve that if we could somehow jump from set bit to set bit. How do we do this? Again we can use a simple trick with bitwise operators:

```        public static uint BitsSetCountWegner(this uint input)
{
uint count;
for (count = 0; input!=0; count++)
{
input &= input - 1; // turn off the rightmost 1-bit
}
return count;
}```

Notice that:

1. We use the formula:  x = x & (x-1) to turn off the rightmost set (1) bit.
2. The loop now iterates on the counter and we jump from set bit (1) to set bit, exiting when there are no more set bits.
3. The function is called "...Wegner" after the guy who originally came up with it but more on that later. Brian Kernighan from K&R C fame also mentions this approach in one of his books so many people incorrectly attribute this to him.

This uses no more memory than version 1 (both have O(1) space complexity), but should yield better running times since we don't enumerate through all binary digits, just the set ones. Running time is ~O(b) where b is the number of set (1) bits in the binary representation of the given integer, and common sense tells us that this approach would be useful when there are few set bits in the input. i.e when the input is sparse. For this reason, this approach is sometimes called the "Sparse Ones" approach.

Version 2b: From Sparse to Dense
The astute programmer will intuitively realise that if there exists an algo optimised for sparse input then there must be a complementary one optimised for dense input. "Dense" input, in this case, refers to a binary number which has a large number of set (1) bits relative to the number of 0 bits. It is shown below for the sake of completeness:

```        public static uint BitsSetCountDense(this uint input)
{
uint count;
for (count = 0; input != UInt32.MaxValue; count++)
{
input |= input + 1; // turn on the rightmost 0-bit
}
return 32-count;
}```

Here we repetitively turn ON the rightmost 0-bit until all bits are set to 1 in which case the input would equal UInt32.MaxValue. Here the "magic" bitwise operation used to turn on the rightmost 0-bit is: x =x | (x+1).

Alternately, the original number can be complemented (all bits are flipped), and we can count down from the max theoretical bit count, 32 as follows:

```        public static uint BitsSetCountDenseAlternate(this uint input)
{
uint count;
input = ~input;        // complement (flips all bits)
for (count = 32; input != 0; count--)
{
input &= input - 1; // turn off the rightmost 1-bit
}
return count;
}```

Sparse Ones and Dense Ones were first described by Peter Wegner in “A Technique for Counting Ones in a Binary Computer”, Communications of the ACM, Volume 3 (1960) Number 5, page 322.

Evaluation of V1 and V2
Regardless of what asymptotic anaylsis tells us, it's prudent to measure and compare the alternatives since we are data-driven. Accordingly, we could write a quick test case as follows:

```        static void Main()
{
const int LoopMax = 1000000;

// test code for bit counting
uint xx = 3160637183;

// test case 1
Stopwatch sw = new Stopwatch();
sw.Start();
uint ans = 0;
for (int l = 0; l < LoopMax; l++)
ans = xx.BitsSetCountNaive();
Console.WriteLine(ans);
Console.WriteLine(String.Format("Timing for Naive bit count: {0}", sw.ElapsedTicks));

// test case 2
sw = new Stopwatch();
sw.Start();
for (int l = 0; l < LoopMax; l++)
ans = xx.BitsSetCountWegner();
Console.WriteLine(ans);
Console.WriteLine(String.Format("Timing for Wegner bit count: {0}", sw.ElapsedTicks));

}```

This test is very crude in that it doesn't use a variety of inputs. However, we know that the algorithm proposed in Version 2 should definitely outperform the Ver.1 algo when there is only a few set (1) bits in the binary number. So our choice of 3160637183, which has 23 set bits out of 32 bits in total, should not favour the Ver. 2 algo very much. That said, the test figures we get are good enough to give us a relative measure:

```23
Timing for Naive bit count: 588842
23
Timing for Wegner bit count: 451542```

The data backs up our asymptotic analysis. We see a 25% improvement from Ver. 1 (Naive) to Ver. 2 (Wegner) which correlates to scanning 23 bits rather than 32 bits.

Version 3: The Penny has Dropped - Lookup Tables
A 32-bit unsigned integer repesents a finite number of integer values, 232 = 4,294,967,296 distinct values. The number of bits set in each of these integer representations does not change so it would be possible to precompute these values, or to use memoization /caching to store them after first use. If we have some sort of lookup table we could get constant run time, at the expense of space complexity. The feasibility of this approach depends on the environment and where the resource bottlenecks are. Often an interviewer will qualify the question with something like "you have unlimited memory storage". That's your queue to employ a lookup table. But what sort of lookup table should be used?

We need an 8-bit byte data type in C# to represent values in the range 0-32. Storing over 4 billion of these in a lookup array would consume 4 GB of memory. Since summing the 1-bits in a 32 bit number is the same as summing the 1-bits in both the lower and upper 16 bits separately and then adding them together, we can employ a divide-and-conquer approach here and use a byte array lookup table that only has 216 = 65,536 distinct values. This would consume only 64 KB of memory, a much more acceptable overhead. Here's what the code looks like:

```        static readonly byte[] bitCounts = new byte[UInt16.MaxValue];

private static bool IsInitialized;  // will be false by default

static public void InitializeBitcounts()
{
for (uint i = 0; i < UInt16.MaxValue; i++)
{
// Get the bitcount using any method
bitCounts[i] = (byte)i.BitsSetCountWegner();
}
IsInitialized = true;
}

public static uint BitsSetCountUsingLookup(this uint i)
{
if (!IsInitialized)
InitializeBitcounts();

return (uint)(bitCounts[i & 0xFFFF] + bitCounts[(i >> 16) & 0xFFFF]);
}```

Of course, it's possible to precompute 8-bit values and then break the 32-bit input value into 4 blocks of 8, each of which can be looked up from the table. Note that we are assuming that the lookup table can be

Version 4: The Quest for (Near) Constant Time without Memory Overhead
Getting this far is pretty good - we have a fast method albeit with some memory overhead. But what if the environment does not allow for additional memory being used for lookup tables? Is there a (near) constant time algo that doesn't have the memory overhead required by the previous solution?

The clue to this approach is mentioned above: divide-and-conquer, combined with some parallel processing. Many algorithms employ a divide-and-conquer approach and end up with running time ~O(log2 n) because they successively divide the problem in half until it cannot be divided in half any more. Such an approach is taken by binary search, balanced binary tree data structures, Fibonacci search, etc.

In our case, we know that we can partition the 32-bit integer any way we please and sum the bits of all the parts individually and easily compute the required value by summing these interim counts. Furthermore we know that some counting operations can be done in parallel using cleverly crafted bitwise operators. To see this consider the case where we want to sum adjacent bits. Bits can be 0 or 1, so the value of the sum will be 0, 1 or 2 (in decimal), which is 00, 01 or 10 in binary. In other words, we can sum 2 individual bits and store the resulting sum in those same 2 bits! This can be expressed mathematically as follows:

```a = (a & 0x55555555) + ( (a >> 1) & 0x55555555)
```

Here, the hexadecimal constant 0x55555555 represents a binary mask (01010101010101010101010101010101) that picks up every even-numbered bit. That is, the 20, 22, 24, etc positions in the binary representation. Doing a right shift by one position and applying the mask again picks up all the other, odd-numbered bit positions, 21, 23, 25... Hence the above equation sums adjacent bits and stores the results in-place.

This can be extended to sum the adjacent 4-bits, then the adjacent 8-bits, and finally the adjacent 16-bits to get our answer. All we need to do is adjust the bit mask used at each step and the magnitude of the shift operator. Here's the code to do this in C#:

```        public static uint BitsSetCountUnOptimised(this uint i)
{
i = (i & 0x55555555) + ((i >> 1) & 0x55555555);     // adjacent bits grouped
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);     // adjacent 2-bit groups paired
i = (i & 0x0F0F0F0F) + ((i >> 4) & 0x0F0F0F0F);     // adjacent 4-bit groups paired
i = (i & 0x00FF00FF) + ((i >> 8 ) & 0x00FF00FF);     // adjacent 8-bit groups paired
i = (i & 0x0000FFFF) + ((i >> 16) & 0x0000FFFF);     // adjacent 16-bit groups paired
return (i & 0x0000003F);    // a 32-bit unsigned int can have at most 32 set bits,
// which in decimal needs on 6 bits to represent
}```

Version 5: Finesse and Extreme Optimisation
We've already made some great leaps forward in efficiency and asymptotic run time but believe it or not there are further optimisations that can be made. First, note that the last logical AND is not necessary since it right shifts 16 places which would leave only 16 bits so the masking is redundant. As well, there are other logical ANDs that can safely be removed when there is no danger that a field will carry over into the adjacent field. In our case, the grouping of adjacent 4-, 8- and 16-bit groups can have redundant logical ANDs removed. Finally, the first line can be trimmed from 4 operations down to 3 by replacing it with:

x = x - (( x >> 1) & 0x55555555);

The revised code now looks like:

```        // taken from Hacker's Delight, chapter 5, pp66
public static uint BitsSetCountOptimised(this uint i)
{
i = i - ((i >> 1) & 0x55555555);                    // adjacent bits grouped
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);     // adjacent 2-bit groups paired
i = i + ((i >> 4) & 0x0F0F0F0F);                    // adjacent 4-bit groups paired
i = i + (i >> 8);                                   // adjacent 8-bit groups paired
i = i + (i >> 16);                                  // adjacent 16-bit groups pair
return (i & 0x0000003F);    // a 32-bit unsigned int can have at most 32 set bits,
// which in decimal needs on 6 bits to represent
}```

Putting this through a crude timing check we see the following results for 1 million iterations. The figures are total elapsed ticks:

```Timing for Lookup bit count UnOptimised: 224970
Timing for Lookup bit count Optimised: 214467
Timing for Lookup-based bit count: 131663
```

This shows that the lookup table-based approach is still the fastest but the optimised approach given above is not too far behind. It is elegant in that it is branch-free (branching is usually a more expensive operation) and employs parallel bit counting.

If you have an inquisitive mind you are probably already asking yourself if the function can be made even faster by working with adjacent groups bigger than 2-bits. Well, it certainly can. In fact this was the approach taken by HAKMEM Item 196 (see references below). The solution offered there used adjacent 3-bit groups. This was further extended by other researchers to work with - you guessed it - adjacent 4-bit groups. Here's some C# code to do just that:

```        // works by counting adjacent 4-bits
public static uint BitsSetCountHackMem(this uint i)
{
uint n = (i >> 1) & 0x77777777;
i = i - n;
n = (n >> 1) & 0x77777777;
i = i - n;
n = (n >> 1) & 0x77777777;
i = i - n;
i = (i + (i >> 4) & 0xF0F0F0F);
i = i * 0x01010101;
return i >> 24;
}```

So how does it compare? Because we're data-driven rather then theory-driven, let's measure it and let the data tell us:

```Timing for Lookup bit count UnOptimised: 216632
Timing for Lookup bit count Optimised: 205692
Timing for Lookup-based bit count: 104143
Timing for modified HAKMEM bit count: 83938
```

Damn, that's fast! It's on par with the lookup-table based approach and it doesn't have any memory overhead! That's good enough for me. At this point we're probably going to see diminishing returns for further optimisation efforts so let's settle on this approach.

One final bit of advice: your interviewer more than likely doesn't want a 20 minute sermon on the topic of bit counting. Think carefully before launching into a lengthy discussion upfront.
If you've dazzled the interviewer with your knowledge and reasoning skills it's more than likely they'll realise that you've seen the problem before and you'll get pushed onto the more esoteric issues of this vast topic. Here are a few you can think about in advance:

• Why can't we apply these algorithms to signed integers?
• Can you suggest some architecture-dependent optimisations?
• How would you determine what constitutes a spare/dense number prior to bit counting?
• Can you suggest an algo which is cache oblivious?
• Can you think of some examples where you might need to use bit counting?
• What is the worst case asymptopic run time of the above mentioned algorithms?
• If you were building a framework for other developers to use, and you needed to expose a BitCount() method, what would your method internals look like?

Further Research
This problem has several common names:

• Hamming Weight
• Population Count (or popCount) of Binary Values
• Sideways Sum / Sideways Addition

Bit Twiddle Basics
All of the below mentioned bit twiddling hacks have been elucidated from a magnificent book called Hacker's Delight by Henry S. Warren Jnr. If you want comprehensive coverage of low-level, highly efficient tricks on various processor architectures this is a book you must read.

1. To turn off the rightmost 1-bit in x:          x & (x-1)
2. To isolate the rightmost bit in x:                x & 1
3. To isolate the rightmost 1-bit in x:            x & (-x)
4. To isolate the rightmost 0-bit in x:            x & (x+1)
5. To create a mask for all trailing zeros:       ~x & (x-1)     (~ is the one's complement, aka bitwise NOT)
6. To turn off the k-th bit for the right in x:   x &  ~(1<<k)
7. To check if x is a power of 2:                      x & (x - 1) == 0
8. To toggle multiple bits at one: cretae a mask for the bits to toggle, M, and use:  x ^ M
9. To find the max value of a data type T, cast zero as T then complement:    ~(T)0