## Billions of Integers per Second with SIMD

Came across some good work from Daniel Lemire, CS Professor at University of Quebec, for highly-efficient, fast compression/encoding/decoding of integers. Full details here:

Decoding billions of integers per second through vectorization

It's a very nice use of vectorisation and Streaming SIMD Extensions (SSE2/3/4) in modern chipsets. Their SIMD-BP128 vectorised scheme makes use of vectorisation for bit packing prior to compress/decompress operations, and also for the delta encoding phase where, instead of storing the native integers, the differences are stored, with these differences calculated in a vectorised manner. Their scheme is not patented, and they have posted their code on github. You can find C, C++ and Java libraries for usage.

For the specific case of sorted integers, they have published a separate paper, given below, that also applies algorithms based on available SIMD instructions to boost compression speed. They claim their S4-BP128-D4 scheme "uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression".

SIMD Compression and the Intersection of Sorted Integers

Conveniently they provide the code for it on github: https://github.com/lemire/SIMDCompressionAndIntersection.

In finance most market data is effectively an append-only, immutable time-series where every record has a new timestamp, but timestamps are just long offsets from a time epoch (such as midnight 1/1/1900). Thus, using columnar storage, we get a column of "time" values which is a vector of non-descending integers/longs. These are very amenable to delta-encoding or some more specialised form of sorted integer compression.

## Ensemble Models in Machine Learning

Ensemble models have been around for a while. In essence co-operative ensemble models take many independent predictive models and combine their results for greater accuracy and generalisation. Think "the wisdoms of crowds" applied to predictive models.

For example, the well known Netflix prize, where competitors had to advance the existing RMSE scores on a movie recommendation challenge, was won by a team that employed a combination of 107 different models. Whilst they noted in the detailed write-up of their approach that they could have dramatically cut it back to far fewer models, they do note that even the reduced combination of models that yielded an 88% RMSE involved numerous approaches: k-NN, factorisation, restricted Boltzmann Machines (RBMs) and asymmetric factor models.

Whilst there are no rigorous guidelines for what models to use as constituent models - that is the "weak learners", the ideal scenario would be a set of models where each of the models generalise well, and when they do make errors on new data, these errors are not shared with any other models. i.e the models are not strongly correlated. In other words, diversity matters!

Enter the Kaggle Homesite Quote Conversion challenge. This open challenge hosted by Kaggle.com was a binary classification problem to predict which customers would purchase a given home insurance quote, thus "converting" them in marketing terms. More interestingly is that the top 2 entries have both provided detailed write up of their approaches, with both teams making use of ensemble learning.

Here is a detailed write up of the winners submission. They started with 500 models before using 125 of these in an 3-meta layer ensemble. Unsurprisingly they put a lot of effort into identifying good algorithms/parameters for single model performance, as well investigating options to ensure model diversity. Their approach is depicted below:

Here is a detailed write-up of the submission from the 2nd place team. They used 600 base models, of which 500 were created via a robot that automatically trained models using a variety of techniques including gradient boosting (XGBoost), logistic regression and neural networks, all on randomly chosen features. They used 3 different types of blenders all of which outperformed the best single model. Their approach is depicted below:

Ensemble techniques seem to be rather prevalent on Kaggle amongst the winning submissions. If you want to know more about employing this approach on Kaggle check out the Kaggle ensembling guide.

Ensemble learning... some may call it data snooping, others call it progress.

## Size Matters: Empirical Evidence of the Importance of Training Set Size in Machine Learning

There is much hype around "big data" these days - how it's going to change the world - which is causing data scientists to get excited about big data analytics, and technologists to scramble to understand how they employ scalable, distributed databases and compute clusters to store and process all this data.

Interestingly, Gartner dropped "big data" out of their infamous hype cycles charts in 2015...

Gartner’s 2012 Hype Cycle for Emerging Technology

Gartner’s 2013 Hype Cycle for Emerging Technology

Gartner’s 2014 Hype Cycle for Emerging Technology

so how useful is big data going to be? Not to mention the obvious question of what constitutes "big data".

Clearly acquiring and storing data alone produces no meaningful benefit to the organisation collecting/housing the data. However, "monetising the data" is where the quants and data scientists come in. Typically, the value-add from quants is often thought to come from sophisticated models crafted and tuned by extremely clever PhD-laden mathematicians with deep knowledge in a particular field. That might or might not be true, but it certainly raises the question...

Is the quality of the algorithm used in the model the most important ingredient?

If this were true, then:

• organisations should put the bulk of their effort into developing the best models possible;
• organisations employing the best data scientists should therefore be better equipped to monetise big data;
• organisations need not purse enormous data capture efforts.

BUT with all this hype around big data it is prudent to consider the relative importance of the size of training data available to quants.

In this paper by Banko and Brill, 2001 two researchers from Microsoft investigated how important training set size was in building a model for a problem from the NLP domain, confusion set disambiguation. Quoting the authors, "Confusion set disambiguation is the problem of choosing the correct use of a word, given a set of words with which it is commonly confused. Example confusion sets include: {principle, principal}, {then, than}, {to,two,too}, and {weather,whether}." The authors reviewed a variety of approaches which were considered state-of-the-art at the time of publication (2001) and examined model performance, as measured by accuracy, when various training set sizes were employed in those models. The chart below, extracted from their research paper, shows some interesting observations...

As the training set size, measured on the X-axis, increases by an order of magnitude we can see that even the worst-performing model often produces greater accuracy than the best-performing model that had less training data.

Whilst this paper is for a specific problem in a specific domain, it is widely recognised that across a variety of machine learning fields that a "dumber" model with more data will outperform a smarter model that has less data.

On this very topic, Pedro Domingos, a well-respected and leading researcher in machine learning, published a paper on A Few Useful Things to Know about Machine Learning where in Section 9, titled "More Data Beats a Cleverer Algorithms" he notes...

"... pragmatically the quickest path to success is often to just get more data. As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it. (After all, machine learning is all about letting data do the heavy lifting.) This does bring up another problem, however: scalability. In most of computer science, the two main limited resources are time and memory. In machine learning, there is a third one: training data. Which one is the bottleneck has changed from decade to decade. In the 1980’s it tended to be data. Today it is often time. Enormous mountains of data are available, but there is not enough time to process it, so it goes unused. This leads to a paradox: even though in principle more data means that more complex classifiers can be learned, in practice simpler classifiers wind up being used, because complex ones take too long to learn. Part of the answer is to come up with fast ways to learn complex classifiers, and indeed there has been remarkable progress in this direction."

The comments from Professor Domingos give us an insight into the evolution of learning systems based on the availability of (big) data and compute clusters:

1. Researchers add more data to improve the performance of a learning algorithm;
2. As training set size increases for a given model, training TIME also increases;
3. Researchers turn to more efficient learning algorithms to reduce the training time;
4. The availability of large, cost-effective compute grids in the cloud, and HPC technologies like GPUs allow researchers to deploy even "bigger" models (more models, more features).

### The Rise of Deep Learning

Indeed the above cycle has led to the rise of deep learning. The scale of available data and processing capacity is enabling large models, often Neural Networks, to train on large amounts of data, with sophisticated tools that allow researchers to still do experiments with reasonably short feedback loops. With near-unlimited data and compute power it becomes more important to pick models that scale well with available training data, and the current sentiment in academia is that deep-learning is the approach which scales best (see second image from a recent presentation by Andrew Ng).

Taken from this video, Andrew Ng has a nice picture explaining the rise of Deep Learning:

Next »