## Reverse Bits

https://leetcode.com/problems/reverse-bits/

Reverse bits of a given 32 bits unsigned integer.

For example, given input 43261596 (represented in binary as 00000010100101000001111010011100), return 964176192 (represented in binary as00111001011110000010100101000000).

If this function is called many times, how would you optimize it?

This question is much related with Number of 1 Bits

Solution 1:

```class Solution {
public:
uint32_t reverseBits(uint32_t n)
{
uint32_t i;
uint32_t value = 0;
for (i = 0; i < 32; ++i)
{
uint32_t tmp = (uint32_t)(n & ((uint32_t)1 << (31 - i))) ? 1 : 0;
value |= tmp << i;
}
return value;
}
};```

Solution 2:

```uint32_t reverse(uint32_t x)
{
x = ((x >> 1) & 0x55555555u) | ((x & 0x55555555u) << 1);
x = ((x >> 2) & 0x33333333u) | ((x & 0x33333333u) << 2);
x = ((x >> 4) & 0x0f0f0f0fu) | ((x & 0x0f0f0f0fu) << 4);
x = ((x >> 8) & 0x00ff00ffu) | ((x & 0x00ff00ffu) << 8);
x = ((x >> 16) & 0xffffu) | ((x & 0xffffu) << 16);
return x;
}```

## Number of 1 Bits

https://leetcode.com/problems/number-of-1-bits/

Write a function that takes an unsigned integer and returns the number of ’1′ bits it has (also known as the Hamming weight).

For example, the 32-bit integer ’11′ has binary representation `00000000000000000000000000001011`, so the function should return 3.

The normal solution is mostly like the following:

```class Solution {
public:
int hammingWeight(uint32_t n)
{
unsigned int count = 0;
while(n)
{
count += n & 1;
n >>= 1;
}
return count;
}
};
```

When searching from the stackoverflow, there is an interesting solution:

This is known as the ‘Hamming Weight‘, ‘popcount’ or ‘sideways addition’.

The ‘best’ algorithm really depends on which CPU you are on and what your usage pattern is.

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. The parallel instructions will almost certainly be fastest, however, the single-instruction algorithms are ‘usually microcoded loops that test a bit per cycle; a log-time algorithm coded in C is often faster’.

A pre-populated table lookup method can be very fast if your CPU has a large cache and/or you are doing lots of these instructions in a tight loop. However it can suffer because of the expense of a ‘cache miss’, where the CPU has to fetch some of the table from main memory.

If you know that your bytes will be mostly 0’s or mostly 1’s then there are very efficient algorithms for these scenarios.

I believe a very good general purpose algorithm is the following, known as ‘parallel’ or ‘variable-precision SWAR algorithm’. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):

```int NumberOfSetBits(int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}```

This is because it has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it.

References:

http://graphics.stanford.edu/~seander/bithacks.html

http://en.wikipedia.org/wiki/Hamming_weight

http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/

http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count)