Hamzeh Alsalhi's blog

Scikit-learn Sparse Output Improvements

For introduction see: Sparse Support for scikit-learn Project Outline


Now at the end of this GSoC I have contributed four pull requests that have been merged into the code base. There is one planed pull request that has not been started and another pull request nearing its final stages. The list below gives details of each pull request and what was done or needs to be done in the future. The contributions I have made are now in the scikit-learn codebase and anyone using one of the classifiers mentioned will be benefiting from the processing speed boost and lowered memory consumption as a result.

This GSoC has been an excellent experience. I want to thank the members of the scikit-learn community, most of all Vlad, Gael, Joel, Oliver, and my mentor Arnaud, for their guidance and input which improved the quality of my projects immeasurably.

Sparse Input for Ensemble Methods

Sparse Input for AdaBoost, Status: Merged (+241 -27)

The ensemble/weighted_boosting class was edited to avoid densifying the input data and to simply pass along sparse data to the base classifiers to allow them to proceed with training and prediction on sparse data. Tests were written to validate correctness of the AdaBoost classifier and AdaBoost regressor when using sparse data by making sure training and prediction on sparse and dense formats of the data gave identical results, as well verifying the data remained in sparse format when the base classifier supported it.

Sparse input Gradient Boosted Regression Trees (GBRT), Status: Not started

Very similar to sparse input support for AdaBoost, the classifier will need modification to support passing sparse data to its base classifiers and similar tests will be written to ensure correctness of the implementation. The usefulness of this functionality depends on the sparse support for decision trees which is a pending mature pull request #3173.

Sparse Output Support

Sparse Label Binarizer, Status: Merged (+968 -857)

The label binarizing function in scikit-learns label code was modified to support conversion from sparse formats and helper functions to this function from the utils module were modified to be able to detect the representation type of the target data when it is in sparse format.

Sparse Output One vs. Rest, Status: Merged (+252 -27)

The fit and predict functions for one vs. rest classifiers modified to detect sparse target data and handle it without densifying the entire matrix at once, instead the fit function iterates over densified columns of the target data and fits an individual classifier for each column and the predict uses binarizaion on the results from each classifier individually before combining the results into a sparse representation. A test was written to ensure that classifier accuracy was within a suitable range when using sparse target data.

Sparse Output Dummy Classifier, Status: Merged (+466 -32)

The fit and predict functions were adjusted to accept the sparse format target data. To reproduce the same behavior of prediction on dense target data first a sparse class distribution function was written to get the classes of each column in the sparse matrix, second a random sampling function was created to provide a sparse matrix of randomly drawn values from a user specified distribution.

Sparse Output KNN Classifier, Status: Open

In the predict function of the classifier the dense target data is indexed one column at a time. The main improvement made here is to leave the target data in sparse format and only convert a column to a dense array when it is necessary. This results in a lower peak memory consumption, the improvement is proportional to the sparsity and overall size of the target matrix.