## Optimising Bit Counting using Iterative, Data-Driven Development

Based on several blogs and discussion boards I've seen lately, it seems that the old interview question chestnut of **bit counting** is still in vogue. I suspect it's more common at companies that seek quantitative skills in their programming staff -like investment banks, Hedge Funds, internet startups run by ex-Googlers, etc. So what is it? Bit counting is the problem of reporting the number of set (1) bits in the binary representation of an integer. In Java there is a built in static method, Integer.bitcount(), that you could use but it is interesting to examine what the internals of such a method would look like.

Many people have written on this topic, with varying degrees of accuracy and explanatory comments (a list of references comes later). I don't wish to repeat the good work others have done and I'm not proposing anything new that other smart people haven't already written about. Instead I'd like to outline a pragmatic approach that a sensible C# developer might go through in an interview situation when faced with such a question and where access to a Google search engine and the accompanying "instant-knowledge" is not available but a sensible solution needs to be delivered (hopefully it's close to optimal). By memorising the answer you might be able to rattle off a fast answer when needed, but IMHO it's far more valuable to understand the solution and more importantly the process taken to get there. i.e. Teach a (wo)man to fish, rather than give a (wo)man a fish. Alas, if you are after a fast answer, or a link to great reference material, skip to the end for instant gratification.

**Motivation**

You might ask yourself why people ask such interview questions when the answers can be found out using a search engine is no time. It's because they want to:

a) examine your logical approach to a problem they hope you haven't seen before; and

b) to find out the level of understanding you have about a particular problem if indeed you have seen it before.

Bit Counting has many various solutions so even if you've heard this one before your interviewer can push you deeper and deeper into the problem until you hit unfamiliar ground and are forced to think on your feet - which is really the ultimate goal of all interview puzzle problems.

**Version 1: The "Naive" Solution**

Firstly, we try to formulate an approach that yields *the right answer* using the most obvious approach. The solution that most people will think of first is to convert the number to binary notation and then walk through the binary digits summing the ones that are set. Here is some C# code showing an extension method to do just that:

public static uint BitsSetCountNaive(this uint input) { uint count = 0; while (input != 0) { count += input & 1; // logical AND on the LSB input >>= 1; // do a bitwise shift to the right to create a new LSB } return count; }

Notice that:

- We don't need to do anything explicit to convert the number from an integer into binary representation.
- Since pass-by-value is the default in C#, we can re-use the UInt32 variable, input, internally without affecting the caller's version of this variable.
- Figuring out if a bit is set or not is accomplished by doing a bitwise AND (&) of the input value with 1 which has the effect of isolating and returning the
*least significant bit*, and conveniently returning the value we need to increment the counter by. Doing such bitwise AND operations is often called*masking*. - Looping through each binary digit is accomplished by doing a
*bitwise shift*to the right 1 place. This just pushes all bits to the right 1 position and inserts zeros at the*most significant bit (big)*end. - For 32-bit unsigned integers that have lots of leading zeros in their binary representation, the loop terminates before iterating through all 32 bits. This is because the right shift operation after counting the leftmost non-zero bit produces zero which triggers the while{} exit condition.

In effect, this is a mask-add-shift loop which has no real memory overhead and running time of O(n) where *n* is the number of binary digits in the binary representation of the given integer. This approach is commonly referred to as the "Iterated Count". Note that running time is said to be O(n) despite the fact that the clever while loop will exit after processing the last non-zero bit. This is because asymptotic analysis refers to worst case (the *upper bound*), which in this case is all *n* bits hence O(n).

Because of (4) above, this approach would work best when most of the set-bits are among the least significant bits.

**Version 2: Tightening the Loop**

**Now we have a basic, naive solution we can iterate on it (in an Agile sense) to improve it's performance. The inital solution had running time of O(n) because we have to loop through all n binary digits. We could obviously improve that if we could somehow jump from set bit to set bit. How do we do this? Again we can use a simple trick with bitwise operators:**

public static uint BitsSetCountWegner(this uint input) { uint count; for (count = 0; input!=0; count++) { input &= input - 1; // turn off the rightmost 1-bit } return count; }

Notice that:

- We use the formula:
**x =****x & (x-1)**to turn off the rightmost set (1) bit. - The loop now iterates on the counter and we jump from set bit (1) to set bit, exiting when there are no more set bits.
- The function is called "...Wegner" after the guy who originally came up with it but more on that later. Brian Kernighan from K&R C fame also mentions this approach in one of his books so many people incorrectly attribute this to him.

This uses no more memory than version 1 (both have O(1) space complexity), but should yield better running times since we don't enumerate through all binary digits, just the set ones. Running time is ~O(b) where *b* is the number of set (1) bits in the binary representation of the given integer, and common sense tells us that this approach would be useful when there are few set bits in the input. i.e when the input is **sparse**. For this reason, this approach is sometimes called the "Sparse Ones" approach.

**Version 2b: From Sparse to Dense**

The astute programmer will intuitively realise that if there exists an algo optimised for sparse input then there must be a complementary one optimised for dense input. "Dense" input, in this case, refers to a binary number which has a large number of set (1) bits relative to the number of 0 bits. It is shown below for the sake of completeness:

public static uint BitsSetCountDense(this uint input) { uint count; for (count = 0; input != UInt32.MaxValue; count++) { input |= input + 1; // turn on the rightmost 0-bit } return 32-count; }

Here we repetitively turn ON the rightmost 0-bit until all bits are set to 1 in which case the input would equal UInt32.MaxValue. Here the "magic" bitwise operation used to turn on the rightmost 0-bit is: **x =x | (x+1)**.

Alternately, the original number can be *complemented* (all bits are flipped), and we can count down from the max theoretical bit count, 32 as follows:

public static uint BitsSetCountDenseAlternate(this uint input) { uint count; input = ~input; // complement (flips all bits) for (count = 32; input != 0; count--) { input &= input - 1; // turn off the rightmost 1-bit } return count; }

Sparse Ones and Dense Ones were first described by Peter Wegner in “A Technique for Counting Ones in a Binary Computer”, Communications of the ACM, Volume 3 (1960) Number 5, page 322.

**Evaluation of V1 and V2**

Regardless of what asymptotic anaylsis tells us, it's prudent to measure and compare the alternatives since we are **data-driven**. Accordingly, we could write a quick test case as follows:

static void Main() { const int LoopMax = 1000000; // test code for bit counting uint xx = 3160637183; // test case 1 Stopwatch sw = new Stopwatch(); sw.Start(); uint ans = 0; for (int l = 0; l < LoopMax; l++) ans = xx.BitsSetCountNaive(); Console.WriteLine(ans); Console.WriteLine(String.Format("Timing for Naive bit count: {0}", sw.ElapsedTicks)); // test case 2 sw = new Stopwatch(); sw.Start(); for (int l = 0; l < LoopMax; l++) ans = xx.BitsSetCountWegner(); Console.WriteLine(ans); Console.WriteLine(String.Format("Timing for Wegner bit count: {0}", sw.ElapsedTicks)); Console.ReadLine(); }

This test is very crude in that it doesn't use a variety of inputs. However, we know that the algorithm proposed in Version 2 should definitely outperform the Ver.1 algo when there is only a few set (1) bits in the binary number. So our choice of 3160637183, which has 23 set bits out of 32 bits in total, should not favour the Ver. 2 algo very much. That said, the test figures we get are good enough to give us a relative measure:

23 Timing for Naive bit count: 588842 23 Timing for Wegner bit count: 451542

The data backs up our asymptotic analysis. We see a 25% improvement from Ver. 1 (Naive) to Ver. 2 (Wegner) which correlates to scanning 23 bits rather than 32 bits.

**Version 3: The Penny has Dropped - Lookup Tables**

A 32-bit unsigned integer repesents a finite number of integer values, 2^{32} = 4,294,967,296 distinct values. The number of bits set in each of these integer representations does not change so it would be possible to precompute these values, or to use *memoization */caching to store them after first use. If we have some sort of lookup table we could get *constant run time*, at the expense of space complexity. The feasibility of this approach depends on the environment and where the resource bottlenecks are. Often an interviewer will qualify the question with something like "you have unlimited memory storage". That's your queue to employ a lookup table. But what sort of lookup table should be used?

We need an 8-bit **byte **data type in C# to represent values in the range 0-32. Storing over 4 billion of these in a lookup array would consume 4 GB of memory. Since summing the 1-bits in a 32 bit number is the same as summing the 1-bits in both the lower and upper 16 bits separately and then adding them together, we can employ a divide-and-conquer approach here and use a byte array lookup table that only has 2^{16} = 65,536 distinct values. This would consume only 64 KB of memory, a much more acceptable overhead. Here's what the code looks like:

static readonly byte[] bitCounts = new byte[UInt16.MaxValue]; private static bool IsInitialized; // will be false by default static public void InitializeBitcounts() { for (uint i = 0; i < UInt16.MaxValue; i++) { // Get the bitcount using any method bitCounts[i] = (byte)i.BitsSetCountWegner(); } IsInitialized = true; } public static uint BitsSetCountUsingLookup(this uint i) { if (!IsInitialized) InitializeBitcounts(); return (uint)(bitCounts[i & 0xFFFF] + bitCounts[(i >> 16) & 0xFFFF]); }

Of course, it's possible to precompute 8-bit values and then break the 32-bit input value into 4 blocks of 8, each of which can be looked up from the table. Note that we are assuming that the lookup table can be

**Version 4: The Quest for (Near) Constant Time without Memory Overhead**

Getting this far is pretty good - we have a fast method albeit with some memory overhead. But what if the environment does not allow for additional memory being used for lookup tables? Is there a (near) constant time algo that doesn't have the memory overhead required by the previous solution?

The clue to this approach is mentioned above: divide-and-conquer, combined with some parallel processing. Many algorithms employ a **divide-and-conquer** approach and end up with running time ~O(log_{2} n) because they successively divide the problem in half until it cannot be divided in half any more. Such an approach is taken by binary search, balanced binary tree data structures, Fibonacci search, etc.

In our case, we know that we can partition the 32-bit integer any way we please and sum the bits of all the parts individually and easily compute the required value by summing these interim counts. Furthermore we know that some counting operations can be done in parallel using cleverly crafted bitwise operators. To see this consider the case where we want to sum adjacent bits. Bits can be 0 or 1, so the value of the sum will be 0, 1 or 2 (in decimal), which is 00, 01 or 10 in binary. In other words, we can sum 2 individual bits and store the resulting sum in those same 2 bits! This can be expressed mathematically as follows:

a = (a & 0x55555555) + ( (a >> 1) & 0x55555555)

Here, the hexadecimal constant 0x55555555 represents a binary mask (01010101010101010101010101010101) that picks up every even-numbered bit. That is, the 2^{0}, 2^{2}, 2^{4}, etc positions in the binary representation. Doing a right shift by one position and applying the mask again picks up all the other, odd-numbered bit positions, 2^{1}, 2^{3}, 2^{5}... Hence the above equation sums adjacent bits and stores the results in-place.

This can be extended to sum the adjacent 4-bits, then the adjacent 8-bits, and finally the adjacent 16-bits to get our answer. All we need to do is adjust the bit mask used at each step and the magnitude of the shift operator. Here's the code to do this in C#:

public static uint BitsSetCountUnOptimised(this uint i) { i = (i & 0x55555555) + ((i >> 1) & 0x55555555); // adjacent bits grouped i = (i & 0x33333333) + ((i >> 2) & 0x33333333); // adjacent 2-bit groups paired i = (i & 0x0F0F0F0F) + ((i >> 4) & 0x0F0F0F0F); // adjacent 4-bit groups paired i = (i & 0x00FF00FF) + ((i >> 8 ) & 0x00FF00FF); // adjacent 8-bit groups paired i = (i & 0x0000FFFF) + ((i >> 16) & 0x0000FFFF); // adjacent 16-bit groups paired return (i & 0x0000003F); // a 32-bit unsigned int can have at most 32 set bits, // which in decimal needs on 6 bits to represent }

**Version 5: Finesse and Extreme Optimisation**

We've already made some great leaps forward in efficiency and asymptotic run time but believe it or not there are further optimisations that can be made. First, note that the last logical AND is not necessary since it right shifts 16 places which would leave only 16 bits so the masking is redundant. As well, there are other logical ANDs that can safely be removed when there is no danger that a field will carry over into the adjacent field. In our case, the grouping of adjacent 4-, 8- and 16-bit groups can have redundant logical ANDs removed. Finally, the first line can be trimmed from 4 operations down to 3 by replacing it with:

x = x - (( x >> 1) & 0x55555555);

The revised code now looks like:

// taken from Hacker's Delight, chapter 5, pp66 public static uint BitsSetCountOptimised(this uint i) { i = i - ((i >> 1) & 0x55555555); // adjacent bits grouped i = (i & 0x33333333) + ((i >> 2) & 0x33333333); // adjacent 2-bit groups paired i = i + ((i >> 4) & 0x0F0F0F0F); // adjacent 4-bit groups paired i = i + (i >> 8); // adjacent 8-bit groups paired i = i + (i >> 16); // adjacent 16-bit groups pair return (i & 0x0000003F); // a 32-bit unsigned int can have at most 32 set bits, // which in decimal needs on 6 bits to represent }

Putting this through a crude timing check we see the following results for 1 million iterations. The figures are total elapsed ticks:

Timing for Lookup bit count UnOptimised: 224970 Timing for Lookup bit count Optimised: 214467 Timing for Lookup-based bit count: 131663

This shows that the lookup table-based approach is still the fastest but the optimised approach given above is not too far behind. It is elegant in that it is branch-free (branching is usually a more expensive operation) and employs parallel bit counting.

If you have an inquisitive mind you are probably already asking yourself if the function can be made even faster by working with adjacent groups bigger than 2-bits. Well, it certainly can. In fact this was the approach taken by HAKMEM Item 196 (see references below). The solution offered there used adjacent 3-bit groups. This was further extended by other researchers to work with - you guessed it - adjacent 4-bit groups. Here's some C# code to do just that:

// works by counting adjacent 4-bits public static uint BitsSetCountHackMem(this uint i) { uint n = (i >> 1) & 0x77777777; i = i - n; n = (n >> 1) & 0x77777777; i = i - n; n = (n >> 1) & 0x77777777; i = i - n; i = (i + (i >> 4) & 0xF0F0F0F); i = i * 0x01010101; return i >> 24; }

So how does it compare? Because we're data-driven rather then theory-driven, let's measure it and let the data tell us:

Timing for Lookup bit count UnOptimised: 216632 Timing for Lookup bit count Optimised: 205692 Timing for Lookup-based bit count: 104143 Timing for modified HAKMEM bit count: 83938

Damn, that's fast! It's on par with the lookup-table based approach and it doesn't have any memory overhead! That's good enough for me. At this point we're probably going to see diminishing returns for further optimisation efforts so let's settle on this approach.

One final bit of advice: your interviewer more than likely doesn't want a 20 minute sermon on the topic of bit counting. Think carefully before launching into a lengthy discussion upfront.

**Advanced Topics**

If you've dazzled the interviewer with your knowledge and reasoning skills it's more than likely they'll realise that you've seen the problem before and you'll get pushed onto the more esoteric issues of this vast topic. Here are a few you can think about in advance:

- Why can't we apply these algorithms to signed integers?
- Can you suggest some architecture-dependent optimisations?
- How would you determine what constitutes a spare/dense number prior to bit counting?
- Can you suggest an algo which is
*cache oblivious*? - Can you think of some examples where you might need to use bit counting?
- What is the
*worst case*asymptopic run time of the above mentioned algorithms? - If you were building a framework for other developers to use, and you needed to expose a BitCount() method, what would your method internals look like?

**Further Research**

This problem has several common names:

- Hamming Weight
- Population Count (or popCount) of Binary Values
- Sideways Sum / Sideways Addition

**Useful References**

Hackers Delight, Chapter 5

Bitwise Tricks and Techniques by Don Knuth (The Art of Computer Programming, Part IV)

Sean Anderson's Bit Twiddle Hacks

HAKMEM Item 169: MIT AI Memo 239, Feb. 29, 1972

Discussion of the Problem on StackOverflow

Gurmeet Singh's Weblog with C code

Jesse Farmer's Blog

Magic Algorithms

**Bit Twiddle Basics**

All of the below mentioned bit twiddling hacks have been elucidated from a magnificent book called **Hacker's Delight** by Henry S. Warren Jnr. If you want comprehensive coverage of low-level, highly efficient tricks on various processor architectures this is a book you must read.

- To turn off the rightmost 1-bit in x: x & (x-1)
- To isolate the rightmost bit in x: x & 1
**To isolate the rightmost 1-bit in x: x & (-x)**

- To isolate the rightmost 0-bit in x: x & (x+1)
- To create a mask for all trailing zeros: ~x & (x-1) (~ is the one's complement, aka bitwise NOT)
- To turn off the
*k*-th bit for the right in x: x & ~(1<<k) - To check if x is a power of 2: x & (x - 1) == 0
- To toggle multiple bits at one: cretae a mask for the bits to toggle, M, and use: x ^ M
- To find the max value of a data type T, cast zero as T then complement: ~(T)0

*14 Apr 2009*
*Damien Wintour*

on 30 May 2012 at 4:50 am 1Calculating the number of bits in a Subnet Mask in C# | PHP Developer Resource[...] Bit counting algorithm taken from: http://www.necessaryandsufficient.net/2009/04/optimising-bit-counting-using-iterative-data-driven-de... [...]