How many things have happened since my last post! To name a few:
- I’ve learned (mostly by accident) about a new programming language of exotic name and many strengths: Ceylon. More about it in later posts.
- I’ve started two new courses: Functional Programming Principles in Scala and Computing for Data Analysis. I’ll dedicate two future posts to these courses.
- I’ve finished Building Machine Learning Systems with Python. Yay!
And, in fact, this post is the continuation of my review on this book. Therefore, without much ado, I’ll continue with …
Chapter 5: classification – detecting poor answers
This chapter is a rehashing Chapter 2, but in a much more complex context: classifying answers on Q/A websites (such as Stack Overflow) in good/bad , useful/not useful, etc. The techniques which this chapter describes rely, as expected, on measuring first the usefulness of an answer (not an easy problem). Then the techniques consist in applying two algorithms (nearest neighbour and logistic regression) to train the classifier and improve its performance. More importantly, the chapter introduces the reader into debugging ML systems – activity which requires fine-tuning and ad-hoc tweaking.
Not a bad chapter, but the problem that I see is that it does not achieve the initial goal: the classifier only partially satisfies all the criteria.
Chapter 6: classification II – sentiment analysis
This chapter deals with a simple yet powerful method of classification: Bayesian inference. The principle is to calculate posterior probabilities (“this belongs/does not belong to a class”) based on prior probabilities (“we see these features/evidence”).
To achieve the goal, the chapter employs a large collection of Twitter posts, defines two features (aptly named named F1 and F2, “number of ‘awsome’ occurrences in a post” and “number of ‘crazy’ occurrences in a post) and one class C (“the post is ‘positive’”).
We’re introduced a little bit into Bayesian calculus and we’re shown how the classifier is “naive” by assuming the features are probabilistically independent**). We learn about “smoothing” (to account for missed data when estimating prior probabilities) as well as about replacing probabilities with their logarithms in order to avoid the underflows resulted from multiplying many small numbers.
After this theoretical preparation, the chapter jumps right into coding and it makes use, of course, of libraries – many of them. The performance of learning is evaluated and, after initially rather deceptive results, the method proceeds to improve the system. Then it continues with extending it (using parts of speech to make classification more intelligent) and in the end we get conclusions.
Chapter 6 is a rich, exciting chapter. It’s only drawback, I would say, is that it relies too much on “tweaking”, more specifically on tweaking the parameters for the libraries that this chapter so abundantly uses.
While understandable and pragmatic, I think this approach prevents the reader from really understanding the “inner nature” of the system he is building. In other words, it is turning him into some sort of technician of numbers and not into a master of the details of Bayesian classification.
Chapters 7-8: regression applied
The next two chapters of the book are dedicated to applying the method of regression to machine learning. For someone accustomed to mathematics, the content is quite obvious: how to determine the conditionality between various parameters assuming several forms of the dependency (linear, linear with penalty, non-linear, etc).
What’s interesting though, is that the book handles the case of insufficient data. More precisely, how can the machine learn when the number of example is smaller than the number of features it needs to learn?***) The suggested answer is to use increased penalties to avoid the over-fitting that comes from having less data that what we need to learn (the same applies to when we solve a system of two equations with three variables; we get an infinite number of solutions that match perfectly, but it doesn’t mean that the solutions we get stand further tests).
This is quite an important aspect of ML and it is all too bad that Chapter 7 doesn’t present the actual results but it switches quickly to another topic: recommendation and prediction rating. This continues in Chapter 8 by presenting improved methods for recommendations, including basket analysis (looking at what people buy together to understand what they like).
In conclusion, I find the treatment of regressions rather poor. The one topic which is more difficult (learning in the conditions of paucity of examples) is handled rather scarcely and the switch to something else comes quite abruptly in the book.
Chapters 9-10: delving into form recognition
Being able to recognize forms (visual, auditory or symbolic – like meaning) is crucial in Artificial Intelligence. When we want to build machines that learn from/about those forms the challenge is even greater: the machine needs to learn from information inherently hard to process.
That’s why I approached these two chapters with great interest. I was not disappointed: they present rather well two classifiers (one for music, another one for images) that, essentially, depend of how the form (music or image) gets transformed into something that can be processed. In the first case it is about analysing the sound’s harmonics and in the second case it’s about extracting features from the image’s pixels. How these features are being extracted greatly depends on what we want to learn from the forms.
Apart from that, these two chapters are not particularly intriguing. They rely (as before) on Python libraries and they show (as if it was any more necessary) how making a ML system that really works is more of an art, requiring a lot of fine tuning of carefully chosen parameters.
Chapter 11: do not believe everything
This chapter, “Dimensionality reduction”, is an important one; I would have put it somewhere closer to the beginning. It consists in the very sensitive subject of reducing the dimensionality of data for the sake of avoiding over-fitting and of building faster, better ML systems. Reducing the dimensionality of data essentially consists in ignoring some of the features that we use in training.
How the features are being ignored largely depends on what we want to learn but one thing is for sure: we don’t want to ignore essential features. This is called feature extraction and the chapter presents two linear methods to implement it. Another method (multi-dimensional scaling) based on not processing the observation points but rather on the distance between them, is also presented.
The merit of Chapter 11 is not in the richness of the information it gives (although it does a good job at explaining just enough to understand) but in the fact that it clearly shows that in Machine Learning, like in many other fields, more is not always better.
Chapter 12: taking it to the cloud
The final chapter of the book deals with how to address the fact that ML problems are computationally intensive. The suggested solution is to take them to the cloud! For this purpose is presents the Jug framework (useful to split work into tasks that can be run in parallel on a massive infrastructure) and then it suggests Amazon Web Services as that necessary infrastructure.
Finally, it suggests using starcluster for generating/managing the clusters in the cloud.
The best comes for the most patient
Just as a dessert always comes after and not before a good meal so the most delicious part of “Building Machine Learning Systems with Python” consists in its Appendix: “where to learn more about Machine Learning”. This is no wonder because this book, being an introductory text in ML, leaves the reader with a great desire to find out more. So, from this perspective, I think it truly fulfils its purpose.
I will not say more about the content because I believe that it is well worth reading the book in order to get there. What I want to say is that thanks to this text my interest in both Python and Machine Learning was definitely and – I would dare to say, irrevocably - rekindled.
More to come
As mentioned before, I want to write more about my latest experiences in computing. I will do it over future posts:
- How intelligence beautifully shines in its simplicity: Principles of Functional Programming with Scala at Coursera
- How one can speak but not teach and another one can listen but not learn: Computing for Data Analysis at Coursera
- How what’s good doesn’t have to be hard: Model Thinking at Coursera
- How the good ones often die young: the Ceylon programming language, with references to Dart and Go programming languages
- How we’re playing with the big boys now: Principles of Reactive Programming at Coursera
- … and many other subjects …
Till then, happy programming!
*) The title comes from this famous film (moment 6’52”).
**) This is an assumption used in financial models, too. Without it, probabilistic calculations can become intractable. If "naive Bayes" works well in the case of ML classifiers (as this book shows), I find that independence between random variables is a rather gratuitous assumption in financing.
***) For example, how can a machine learn English (20000 words) from a limited number of texts (say, 5000 articles)?
No comments:
Post a Comment