Archive for June, 2016

Uncategorized

Billions of Integers per Second with SIMD

Came across some good work from Daniel Lemire, CS Professor at University of Quebec, for highly-efficient, fast compression/encoding/decoding of integers. Full details here:

Decoding billions of integers per second through vectorization

It's a very nice use of vectorisation and Streaming SIMD Extensions (SSE2/3/4) in modern chipsets. Their SIMD-BP128 vectorised scheme makes use of vectorisation for bit packing prior to compress/decompress operations, and also for the delta encoding phase where, instead of storing the native integers, the differences are stored, with these differences calculated in a vectorised manner. Their scheme is not patented, and they have posted their code on github. You can find C, C++ and Java libraries for usage.

For the specific case of sorted integers, they have published a separate paper, given below, that also applies algorithms based on available SIMD instructions to boost compression speed. They claim their S4-BP128-D4 scheme "uses as little as 0.7 CPU cycles per decoded integer while still providing state-of-the-art compression".

SIMD Compression and the Intersection of Sorted Integers

Conveniently they provide the code for it on github: https://github.com/lemire/SIMDCompressionAndIntersection.

In finance most market data is effectively an append-only, immutable time-series where every record has a new timestamp, but timestamps are just long offsets from a time epoch (such as midnight 1/1/1900). Thus, using columnar storage, we get a column of "time" values which is a vector of non-descending integers/longs. These are very amenable to delta-encoding or some more specialised form of sorted integer compression.

Uncategorized

Ensemble Models in Machine Learning

Ensemble models have been around for a while. In essence co-operative ensemble models take many independent predictive models and combine their results for greater accuracy and generalisation. Think "the wisdoms of crowds" applied to predictive models.

For example, the well known Netflix prize, where competitors had to advance the existing RMSE scores on a movie recommendation challenge, was won by a team that employed a combination of 107 different models. Whilst they noted in the detailed write-up of their approach that they could have dramatically cut it back to far fewer models, they do note that even the reduced combination of models that yielded an 88% RMSE involved numerous approaches: k-NN, factorisation, restricted Boltzmann Machines (RBMs) and asymmetric factor models.

Whilst there are no rigorous guidelines for what models to use as constituent models - that is the "weak learners", the ideal scenario would be a set of models where each of the models generalise well, and when they do make errors on new data, these errors are not shared with any other models. i.e the models are not strongly correlated. In other words, diversity matters!

Enter the Kaggle Homesite Quote Conversion challenge. This open challenge hosted by Kaggle.com was a binary classification problem to predict which customers would purchase a given home insurance quote, thus "converting" them in marketing terms. More interestingly is that the top 2 entries have both provided detailed write up of their approaches, with both teams making use of ensemble learning.

Here is a detailed write up of the winners submission. They started with 500 models before using 125 of these in an 3-meta layer ensemble. Unsurprisingly they put a lot of effort into identifying good algorithms/parameters for single model performance, as well investigating options to ensure model diversity. Their approach is depicted below:

Here is a detailed write-up of the submission from the 2nd place team. They used 600 base models, of which 500 were created via a robot that automatically trained models using a variety of techniques including gradient boosting (XGBoost), logistic regression and neural networks, all on randomly chosen features. They used 3 different types of blenders all of which outperformed the best single model. Their approach is depicted below:

Ensemble techniques seem to be rather prevalent on Kaggle amongst the winning submissions. If you want to know more about employing this approach on Kaggle check out the Kaggle ensembling guide.

Ensemble learning... some may call it data snooping, others call it progress. :-)