Comparing two bitstrings

An often used technique to model genes, chromosomes, or DNA-strings is by making use of a bitstring. A bitstring is a long list of zeroess and ones. A naive way of implementing this would be as the following:

#include < vector> std::vector< bool> bitString; for(int i = 0; i < bitString_size; ++i) bitString.push_back( std::rand() % 2);

This, however, is relatively slow, especially when performing calculations on this vector. A much faster way is by using boosts "dynamic bitset" property. This is an array of booleans (1 or 0), optimized for only being 0 or 1. Initializing a bitset is more or less identical:

#include < boost/dynamic_bitset.hpp> boost::dynamic_bitset<> bitSet; for(int i = 0; i < bitString_size; ++i) bitSet.push_back( std::rand() % 2);

Often, the number of different bits between two bitsets has to be counted, in order to calculate the genetic similarity between two "genomes". When using vectors, one might write the following code to calculate these differences:

#include < vector> #include < iostream> int calcDiff(const std::vector& focal, const std::vector & other) { int differences = 0; if(focal.size() == other.size()) { for(int i = 0; i < focal.size(); ++i) if(focal[i] != other[i]) differences++; } else std::cout << "genomes mismatch!\n"; return differences; }

If we now use the dynamic bitset, we can easily and quickly compare two bitstrings, and count the differences between them. When using vectors with booleans, the bitstring will have to compared for every element, a computationally heavy task. Dynamic bitsets however can make use of lower level arythmetic, relieving the computer from quite some computational load:

#include < boost/dynamic_bitset.hpp> int calcDiff(const boost::dynamic_bitset<>& focal, const boost::dynamic_bitset<>& other) { boost::dynamic_bitset<> temp; temp = focal ^ other; return temp.count(); }

Here I have compared the performance of both algorithms, compared to the performance of calculating differences in a vector, using the above described "sequential" algorithm. All times are scaled to the time needed to do 1e4 repetitions of calculating differences between two vectors of length 100.

As we see from the figure, using a vector wil exponentially increase the computing time with size of the vector (note that the axes are 10_log transformed). Using a dynamic bitset does not increase as fast, although the difference does get smaller as the size of the vector increases past 1e6. Performance increase is still roughly 10x (even >40x for vector of size 1e5!)

Web Site (C) Thijs Janzen 2011-2014

Lake Tanganyika pictures courtesy Jen Reynolds