Hamzeh Alsalhi's blog

Sparse Target Data for One vs. Rest Classifiers

For introduction see: Sparse Support for scikit-learn Project Outline


Going forward with sparse support for various classifiers I have been working on a pull request for sparse one vs. rest classifiers that will allow for sparse target data formats. This will results in a significant improvement in memory usage when working with large amount of sparse target data, a benchmark is given bellow to measure the. Ultimately what this means for users is that using the same amount of system memory it will be possible to train and predict with a ovr classifier on a larger target data set. A big thank you to both Arnaud and Joel for the close inspection of my code so far and the suggestions for improving it!

Implementation

The One vs. Rest classier works by binarizing the target data and fitting an individual classifier for each class. The implementation of sparse target data support improves memory usage because it uses a sparse binarizer to give a binary target data matrix that is highly space efficient.

By avoiding a dense binarized matrix we can slice the one column at a time required for a classifier and densify only when necessary. At no point will the entire dense matrix be present in memory. The benchmark that follows illustrates this.

Benchmark

A significant part of the work on this pull request has involved devising benchmarks to validate intuition about the improvements provided, Arnaud has contributed the benchmark that is presented here to showcase the memory improvements.

By using the module memory_profiler we can see how the fit and predict functions of the ovr classifier affect the memory consumption. In the following examples we initialize a classifier and fit it to the train dataset we provide in one step, then we predict on a test dataset. We first run a control benchmark which shows the state of one vs. rest classifiers as they are without this pull request. The second benchmark repeats the same steps but instead of using dense target data it passes the target data to the fit function in a sparse format.

The dataset used is generated with scikit-learns make multilabel classification, and is generated with the following call: from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(sparse=True, return_indicator=True,
                                      n_samples=20000, n_features=100,
                                      n_classes=4000, n_labels=4,
                                      random_state=0)

This results in a densely formatted target dataset with a sparsity of about 0.001

Control Benchmark

est = OneVsRestClassifier(MultinomialNB(alpha=1)).fit(X, y) consumes 179.824 MB

est.predict(X) consumes -73.969 MB. The negative value indicates that data has been deleted from memory.

Sparse OvR PR Benchmark

est = OneVsRestClassifier(MultinomialNB(alpha=1)).fit(X, y) consumes 27.426 MB

est.predict(X) consumes 0.180 MB

Improvement

Considering the memory consumption for each case as 180 MB and 30 MB we see a 6x improvement in peak memory consumption with the data set we benchmarked.

Upcoming Pull Request

The next focus in my summer of code after finalizing the sparse one vs. rest classifier will be to introduce sparse target data support for the knn and dummy classifiers which have built in support for multiclass target data. I have begun the knn pull request 3350. Implementing a row wise mode calculation for sparse matrices will be the main challenge of the knn PR.