Archive for April, 2016


Size Matters: Empirical Evidence of the Importance of Training Set Size in Machine Learning

There is much hype around "big data" these days - how it's going to change the world - which is causing data scientists to get excited about big data analytics, and technologists to scramble to understand how they employ scalable, distributed databases and compute clusters to store and process all this data.

Interestingly, Gartner dropped "big data" out of their infamous hype cycles charts in 2015...

Gartner’s 2012 Hype Cycle for Emerging Technology

Gartner’s 2013 Hype Cycle for Emerging Technology

Gartner’s 2014 Hype Cycle for Emerging Technology

so how useful is big data going to be? Not to mention the obvious question of what constitutes "big data".

Clearly acquiring and storing data alone produces no meaningful benefit to the organisation collecting/housing the data. However, "monetising the data" is where the quants and data scientists come in. Typically, the value-add from quants is often thought to come from sophisticated models crafted and tuned by extremely clever PhD-laden mathematicians with deep knowledge in a particular field. That might or might not be true, but it certainly raises the question...

Is the quality of the algorithm used in the model the most important ingredient?

If this were true, then:

  • organisations should put the bulk of their effort into developing the best models possible;
  • organisations employing the best data scientists should therefore be better equipped to monetise big data;
  • organisations need not purse enormous data capture efforts.

BUT with all this hype around big data it is prudent to consider the relative importance of the size of training data available to quants.

In this paper by Banko and Brill, 2001 two researchers from Microsoft investigated how important training set size was in building a model for a problem from the NLP domain, confusion set disambiguation. Quoting the authors, "Confusion set disambiguation is the problem of choosing the correct use of a word, given a set of words with which it is commonly confused. Example confusion sets include: {principle, principal}, {then, than}, {to,two,too}, and {weather,whether}." The authors reviewed a variety of approaches which were considered state-of-the-art at the time of publication (2001) and examined model performance, as measured by accuracy, when various training set sizes were employed in those models. The chart below, extracted from their research paper, shows some interesting observations...

As the training set size, measured on the X-axis, increases by an order of magnitude we can see that even the worst-performing model often produces greater accuracy than the best-performing model that had less training data.

Whilst this paper is for a specific problem in a specific domain, it is widely recognised that across a variety of machine learning fields that a "dumber" model with more data will outperform a smarter model that has less data.

On this very topic, Pedro Domingos, a well-respected and leading researcher in machine learning, published a paper on A Few Useful Things to Know about Machine Learning where in Section 9, titled "More Data Beats a Cleverer Algorithms" he notes...

"... pragmatically the quickest path to success is often to just get more data. As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it. (After all, machine learning is all about letting data do the heavy lifting.) This does bring up another problem, however: scalability. In most of computer science, the two main limited resources are time and memory. In machine learning, there is a third one: training data. Which one is the bottleneck has changed from decade to decade. In the 1980’s it tended to be data. Today it is often time. Enormous mountains of data are available, but there is not enough time to process it, so it goes unused. This leads to a paradox: even though in principle more data means that more complex classifiers can be learned, in practice simpler classifiers wind up being used, because complex ones take too long to learn. Part of the answer is to come up with fast ways to learn complex classifiers, and indeed there has been remarkable progress in this direction."

The comments from Professor Domingos give us an insight into the evolution of learning systems based on the availability of (big) data and compute clusters:

  1. Researchers add more data to improve the performance of a learning algorithm;
  2. As training set size increases for a given model, training TIME also increases;
  3. Researchers turn to more efficient learning algorithms to reduce the training time;
  4. The availability of large, cost-effective compute grids in the cloud, and HPC technologies like GPUs allow researchers to deploy even "bigger" models (more models, more features).

The Rise of Deep Learning

Indeed the above cycle has led to the rise of deep learning. The scale of available data and processing capacity is enabling large models, often Neural Networks, to train on large amounts of data, with sophisticated tools that allow researchers to still do experiments with reasonably short feedback loops. With near-unlimited data and compute power it becomes more important to pick models that scale well with available training data, and the current sentiment in academia is that deep-learning is the approach which scales best (see second image from a recent presentation by Andrew Ng).

Taken from this video, Andrew Ng has a nice picture explaining the rise of Deep Learning:


Perfect Bridge Hand

Question: Bridge is a game where you are dealt 13 cards. Assuming the deck is a standard 52 card deck, and it is well shuffled, and no one else is in the deal, what are the chances that you would be dealt 13 cards of the same suit ?

Answer:We have 13 events of interest - that being drawing 13 individual cards. Each card draw is an independent event so we can simply multiply the probabilities of each of the 13 events happening.

More concretely, the first card we draw from the 52-card deck has no real constraints, thus the probability of this is 1. Any card will do. The second event is a card drawn from the remaining deck which has 51 cards, and it must be one of the other 12 cards that have the same suit as that in the first drawn card. Thus the probability for the second event is 12/51. Similarly, the likelihood of drawing the 3rd card from the remaining deck and it having the same suit as the first two cards drawn is 11/50.

Continuing this for all 13 cards we get the answer:

(1).(1251)(1150)(1049)(948)(847)(746)(645)(544)(443)(342)(241)(140) = (12!)(39!)51!

Alternately, we can consider the problem in terms of the total number of combinations. The number of different ways we can get 13 cards from a deck of 52 is: 52C13 = 52! / (39!)(13!). since there are 4 suits there are only 4 ways to get 13 cards of the same suit, thus the probability of getting one of these 4 perfect bridge hands is:

4 / (52! / (39!)(13!)) = 4 (39!)(13!) / 52! = (39!)(13!)/ (13. 51!) = (39!)(12!)/51!

So about 1 in 158,753,389,900.