Hamzeh Alsalhi's blog

Sparse Support for scikit-learn Project Outline

All blog posts covering this project:

  1. Sparse Support for scikit-learn Project Outline
  2. AdaBoost Sparse Input Support
  3. Sparsely Formated Target Data
  4. Sparse Target Data For One vs Rest Classifiers
  5. Scikit-learn Sparse Output Improvements

As a Google Summer of Code 2014 project I am going to improve sparsity support in a Python machine learning library scikit-learn.

Large data sets are becoming ubiquitous, they come in text, image, audio, and in rarer cases even video format. Text datasets are the most widely used for their versatility, corporations use text data to store consumer information and scientific research projects can use text data to represent their experimental results. To give an example of how data is generated for data analysis projects we walk through the steps used in analyzing text data and converting it to a form useful as input to machine learning algorithms.

We look at an example based of Wikipedia bag of words entry. In our example we begin with three documents, and we process them to get a numerical representation of the data.

Document 1   David likes to paint. Mary likes to sail.
Document 2   David also likes to write.
Document 3   Mary also likes to paint.

We take every unique word as we encounter it and insert it into a dictionary as an indexed entry.

Index
Entry
1
2
3
4
5
6
7
8
David
likes
to
paint
Mary
sail
also
write

We now look back over each original document and count the number of times an entry occurs in it. We represent each document as a set of eight numbers, one for the count of each dictionary entry.

Document 1  [1, 2, 2, 1, 1, 1, 0, 0]
Document 2  [1, 0, 1, 0, 0, 0, 1, 1]
Document 3  [1, 2, 2, 1, 1, 1, 0, 0]

We now look at a second example designed to illustrate data sparsity. Our three documents this time will be full Wikipedia articles, Instead of pasting the text of the articles here I will put a link in place.

Document ATardigrade
Document BCinco de Mayo
Document CSoutheast Asia

Using the same approach as above we build a dictionary and find it has 2820 words.

Index
Entry
1
2
3
4
5
6
...
2820
A
A.D
API
ASEAN
Abagatan
Abackay
...
Zone

Each document will now have significantly more zeros in the vector representation since there are many words in the dictionary that a document might not use. As we keep adding more and more documents we will have a larger dictionary, but the number of non zero entries in our vectors will remain the same for each document since it still only uses the words it contains.

Document A  [877 non-zero entires, 1943 zeros]
Document B  [706 non-zero entries, 2114 zeros]
Document C  [1790 non-zero entires, 1030 zeros]

This means the parts of our numeric data that encode the information in each document have become a smaller part of the numerical representation. As data grows with more documents this trend continues until the number of empty data or zeros would overwhelm the non-zero data that represents the words present.

In practical contexts where these documents correspond to text heavy encyclopedia entries or new articles we also have some metadata about the document such as topics or categories it falls under. One document could have the labels: biology, nature, and wilderness. Another document could fall under history, Britain, and nature.

With machine learning, the process of a computer improving with respect to a test as it sees more data, we would want to teach software to recognize documents and be able to accurately label them. The matrix we generated above to describe the documents would be called the input data. The output data, some of which we would provide initially to give examples of correct answers, would be the labels or categories each document falls under.

A classifier is the piece of code we train to place data into categories, one such classifier is a decision tree which gives its answers to new instances by using the most influential signs it can as identified by previous examples it has seen. In machine learning a common way to get the best performance from training classifiers is to train many of them and then combine them as a population to make one final decision, this is referred to as ensemble learning. The first method for doing this is boosting which builds a strong classifier by combining identifiably weak classifiers which do better than random guessing. Bagging is another ensemble method, where classifiers are trained independently on random pieces of the data and combined resulting in an increased accuracy.

Data Density

Consider a set of approximately 2.4 million documents where each document can be labeled with a set of labels from a collection of 325,000 labels. This is exactly the data set behind the competition: Large Scale Hierarchical Text Classification (LSHTC) held on Kaggle, a host for machine learning competitions. With data sets this large we accumulate a massive dictionary because of the vast vocabulary encountered across all the different groups of the 2 million documents. Consequently each document only uses a small subset of the entries in the dictionary, and the documents term counts for most of the words in dictionary is zero. This means our data will consist of mostly zeros. Sparsity is the fraction of zeros with respect to all elements.

The LSTHC train data set consists of 3.8 trillion elements, only 100 million of which are non zero. Assuming 4 bytes per element represented as an integer, a computer would need 15.31 TB to load this data set into memory. Since this is not all realistic, it is useful to represent the data in a sparse format by excluding all the zeros. This format only records the non zero elements and their locations. Since the data is approximately 99.9% zeros this shrinks the memory require way down to 700 MB.

Improvements With Sparsity

In the following example we see the runtime benefits of sparse representation by training a KNN classifier on the same data once in sparse format and once in dense format.

We run an experiment on a subset of the 20 Newsgroups sparse data set. By counting the number of zeros in the input data and dividing by the total number of elements The data consists of 99% zeros. With 39.9 million zeros and 40 million elements. In the two following benchmarks we time the time the training step of the classifier in addition to the classification of three new examples. The two code comparisons below differ in the first two lines X_train = X_train.toarray() and X_test = X_test.toarray() these two lines in the dense format run convert the data to a matrix that explicitly includes all the zeros. In the lines that follow the code for both cases is identical. We initialize the classifier, begin a timer, train the classifier, predict on three examples (index slices the list for three examples), finally we stop the timer and print the result. Notice the densification of the data is not taken to be part of timed section so we are really only comparing the classifiers training and predict time.

Dense Format
Sparse Format
# Densify data, originally sparse
X_train = X_train.toarray()
X_test = X_test.toarray()

# Initialize K-nn
neigh = KNeighborsClassifier(n_neighbors=3)

# Begin Timing
start = time.clock()

# Train on data
neigh.fit(X_train, Y_train)

# Predict
neigh.predict(X_test[index])

return (time.clock() - start)
Result: 0.84 Seconds
# Do not densify data



# Initialize K-nn
neigh = KNeighborsClassifier(n_neighbors=3)

# Begin Timing
start = time.clock()

# Train on data
neigh.fit(X_train, Y_train)

# Predict
neigh.predict(X_test[index])

return (time.clock() - start)
Result: 0.01 Seconds

The sparse format shows a speedup of close to two orders of magnitude on the same data. To see how sparse and dense performance vary with data size we rerun the above experiment with different sizes of the data trained on and plot the progression of performance.

Sparsity occurs in images, video and audio through appropriate data representation, images can be encoded in wavelets and all formats can be encoded sparsely with key feature extraction.

GSoC 2014 - Key Points

My GSoC Project will have two main goals:

  1. To improve input sparsity support with ensemble methods and the underlying decision tree classifier.
  2. Improve multioutput sparse support.

The sparse input support for decision tree is under good progress thanks to @fareshedayati (see pull request 2984). Within the ensemble methods, sparse support to bagging has been completed by @glouppe (see pull request 2375) and tested by @msalahi (see pull request 3076). Note that the documentation still needs to be updated. However, lots of things remains for boosting meta-estimators adaboost and gradient tree boosting. For sparse multi-output, especially sparse multi-label, there is no support yet. This will be the next challenge during my GSOC. Finally, I will improve sample weight support which is important in the ensemble context.