Sunday, 8 September 2013

Return of the Machine (Learning)

1400OS_cov[1] After more than a year since my last blog post, I decided to resume blogging.

Why? Well, because many things have happened: I have opened up new venues of excitement (online courses!), I delved more into a very interesting and dynamic field of applied Computer Science (Derivatives Finance!) and I embarked on a serious personal programme of reading as many and as varied technical books as possible (more about it below).

But the true reason for resuming blogging is that blogging is so much fun and I was really missing that fun.

So, not surprisingly, my future blog posts will cover three important aspects:

  • Courses I’ve found and/or pursued and interesting things I’ve learned from them
  • Topics on numeric processing I’ve encountered
  • Books, articles or personal projects that captured my interest (how about “Is there any use for a XAML processor for Java?”)

For starters, I will write (in future posts) about two courses which I’ve just completed:

  1. MongoDB for Developers (M101P) generously offered by 10gen (now MongoDB.com)
  2. Model Thinking coming from University of Michigan through Coursera.

But before that, I want to dedicate this rejuvenating post to a partial review of a book which I started reading recently: Building Machine Learning Systems with Python by Willi Richert and Luis Pedro Coelho from by Packt Publishing.

Disclaimer:

I am in no way associated with Packt Publishing and I do not collect royalties on this work. Packt Publishing has kindly provided me with a reviewer’s copy in response to my interest in it. These reviews are, in part, my way to return the favour. My objectivity is guaranteed by the fact that, free copy or not, Machine Learning is a fascinating subject to me. 

Why Machine Learning? Why Python?

Machine Learning is an interesting field in its own right and, with the recent surge in Big Data, it becomes even more so.

For a developer in Finance like me, interest in Machine Learning may come from ardent questions like: can we build systems that churn (better) the huge amount of market data that we currently have? If we have smarter systems, can we use them to minimize risk more effectively? *)

The association between Python and Machine Learning seems quite surprising (Python is interpreted and slow, Machine Learning – as a sub-field of AI – requires high speed by its nature) yet given the beauty of its syntax, the large amount of libraries and the propensity of its users towards experimentation, Python proves to be an adequate tool for the job.

That being said, let’s see what Building Machine Learning Systems with Python has to offer.

Chapter 1: getting started

The first chapter of the book starts in the usual manner: who the book is for, what you get from reading it, how to install Python, how to download the necessary libraries and how to make sure everything works for what’s to come.

In the second part of the chapter the book invites the reader to build the first ML system straight away, with very little theoretical preparation. While this method of “learn it while doing it” works for lots of people, there’s a stubborn minority which would prefer the underlying theory to be presented (and understood) first and only then to jump on the practicalities.**)

The hands-on approach, however, doesn't fail me (unlike many other works) since it is nicely peppered with theoretical explanations, the "practical" instructions are kept to a minimum and the results get quickly correlated with the actual expectations emerging from theory. As a bonus, the writing style is alert, crisp and entertaining – quite in contrast to what we sometimes find in academic papers (remember that the authors are both PhDs).

To make a long story short, this first chapter really teaches the reader how to build a very simple ML system based on binary discrimination, how to train it, how to perform testing on it and how to evaluate the quality of its learning abilities.

Chapter 2: real-world examples

The second chapter builds on the good foundation laid by the first one: it teaches the reader how to build a more (qualitatively) complex system, using real data. It does a fair job by introducing the reader to the art of multi-classification. The style remains nice and crisp and the goal is (in my opinion) completely achieved.

Chapter 3: slight disappointment

The third chapter keeps on building by exploring a more complex subject, that of clustering. It chooses a difficult topic (clustering texts!) and does a good job in explaining how to build the metrics that allow us to compute clustering on large amounts of texts of normal size (say, feed posts or articles).

After such a good performance, here comes a slight disappointment: for clustering is presents only the centroid algorithm. Having just finished the course on Model Thinking from Coursera / U. of Michigan, I am surprised that algorithms with more subtle metrics (like clustering coefficient from graph theory or diversity index from information science) are not being used – or even presented.

To make justice to this chapter, though, I must say that despite only one algorithm getting its place here, that one algorithm is presented well and the reader, endowed with basic knowledge on clustering, can employ any algorithm on his own (provided it blends well with the tools).

Chapter 4: dancing in the dark

MV5BMTIxNDMwNTQ3NV5BMl5BanBnXkFtZTYwMDU4MzQ5._V1_SX214_[1]Saying that I was confused when reading this chapter is somewhat a sub-statement: in one hand you have a very interesting subject (how to build a system that learns what a large collection of texts – like Wikipedia – talks about) and on another hand you have very little on what topic modelling really is. Puzzling!

The chapter is short (only fourteen pages) and it starts with defining what a topic is. After that it introduces the reader to the latest Dirichlet allocation (LDA) algorithm - but without saying much about it (with the exception that it assigns probabilistic weights to words when assigned to topics; beyond that, it refers the reader to Wikipedia).

After not saying much about topic modelling, the chapter jumps right into coding. It presents the reader with a few lines of Python code consisting in a single method call and then a histogram. The problem is that these pieces of information can be anything. If one were to treat the same way the probability distribution of the return of a financial asset, for example, the rendering would be very similar. Where’s the topic modelling, then? 

The chapter redeems itself a little bit by providing the reader with an interesting fact: once we have the topics learned, we can query them in order to obtain not so obvious results. For example, given a certain article, what’s the closest article to it, according to a topic? Think about it: such kind of result can prove invaluable when building systems that guide people, like wizards or teaching systems. This means that, instead of forcing the user to follow a fixed predefined path, the system helps the user with the most relevant information based on what the user already knows or wants.

After a small breath of fresh air, the chapter steers to a big subject: modelling the whole Wikipedia by topics. While rather useless from the teaching point of view, churning Wikipedia proved the algorithms are effective; they can really process very large amounts of (mostly) unstructured data. The decision to choose Wikipedia is to relate to something well known. A wise choice (from the marketing point of view).

Further chapters: here and beyond

Someone said once: “I don’t need to eat an omelette to the end to know it’s rotten”. It’s a principle I’ve applied over and over: I’ve tried many things in the past and most of them I dropped - soon after beginning.

Building Machine Systems with Python does not fall in this infamous category. While having its own, rather small, drawbacks, the book is definitely worth reading further. Therefore, I promise that in a week’s time (give or take) I’ll come up with a new post on:

  • How to detect poor answers in classification
  • Advanced classification: sentiment analysis
  • Regression

In the meantime, happy (machine) learning!


*) Risk is the fundamental problem in Finance, as the global recession of the past 5+ years has abundantly proved it.

**) I mention this stubborn minority of people because I belong to it.

No comments:

Post a Comment