AdaBoost Sparse Input Support
For introduction see: Sparse Support for scikit-learn Project Outline
This week as part of my work on the scikit-learn code base I implemented sparse input support with AdaBoost. This work is being done in pull request 3161. I will give an demonstration of the value of AdaBoost and how my contributions improved the scikit-learn implementation of the classifier. In addition with the goal of implementing sparse output support in scikit-learn I have been working on a this pull request 3203 for sparse label binarization, building off of code written previously by Rohit Sivaprasad. Of course I had help and I would like to thank Arnaud Joly, Joel Nothman, and Olivier Grisel, for reviewing my code to help finalize and verify the correctness!
What is AdaBoost?
AdaBoost is a meta classifier, it operates by repeatedly training many base classifiers that are not very accurate and pooling their results together to make a more accurate classifier. This is a common ensemble method known as boosting. AdaBoost in addition looks for examples that most base classifiers are having trouble getting right and it increases the focus on these examples in hopes of improving overall prediction accuracy.
Sparse Input Results
My improvements to AdaBoost came as runtime improvements by modifying the classifier to accept sparse input data when the base classifiers do as well. Here I demonstrate the elapsed time for training the classifier and using it to make predictions and the differences sparse vs dense data create.
Using the 20 Newsgroups data we benchmark the performance, the dataset used has 200 features and 1000 samples. The AdaBoost classifier is made up of 50 SVM classifiers. Running the demonstration and using python to time the results we get that the training and predict time both come down considerably when using the sparse input data feature.
Low Training and Prediction time is important because it allows us to refactor experiments more rapidly but also is necessary for quick realtime applications of prediction such as facial recognition, or handwriting recognition from a live video stream.
Sparse Label Binarization
My work on sparse label binarization has been a small part of a bigger goal to get sparse output support for One vs. Rest classifiers in scikit-learn. This functionality is used to take target data such as what categories or labels an example falls under and standardize it by transformation to a representation that uses only on an on or off indicator for each category or label. Typically this data before transformation is what is called a sequence of sequences. This format is hard to work with and reason about efficiently. Sparse output support is an important part of my proposal which I will expand on in coming blog posts when I am further along and able to demonstrate some examples and performance where I utilize the changes I made to the label binarizer.