Hamzeh Alsalhi's blog

Sparsely Formated Target Data

For introduction see: Sparse Support for scikit-learn Project Outline


There are different ways to represent target data, this week I worked on a system that converts the different formats to an easy to use matrix format. The work is being done in pull request 3202. My work on this system introduced support to have this data matrix optionally be represented sparsely. The final result when the pull request is completed will be support for sparse target data ready to be used by the up and coming sparse One vs. Rest classifiers.

The function of this data converter is to take multiclass or multilabel target data and represent it in a binary fashion so classifiers that work on binary data can be used with no modification. For example, target data might come from the user like this. With integer class: 1 for Car, 2-Airplane, 3-Boat, 4-Helicopter. We label each of the following 5 images with the appropriate class.

2
1
3
2
4

This data in a list of list format would look like this, we list each images labels one after the other:

This Label binarizer will give a matrix where each column is an indicator for the class and each row is an image/example.

Before my pull request all conversions from label binarizer would give the above matrix in dense format as it appears. My pull request has made it so that the user can specify if they would like the matrix to be returned in sparse format, if so the matrix will be a sparse matrix and has the potential to save a lot of space and runtime depending on how sparse the target data is.

These two calls to the label binarizer illustrate how sparse output can be enabled, the first call will print a dense matrix the second call will return a sparse matrix.

Dense Format

  Input:

Y_bin = label_binarize(y,classes=[1,2,3,4])
print(type(Y_bin))
print(Y_bin)

  Output:

<type 'numpy.ndarray'>
[[0 1 0 0]
 [1 0 0 0]
 [0 0 1 0]
 [0 1 0 0]
 [0 0 0 1]]

Sparse Format

The label binarizer now has sparse output support after my pull request

 Input:

Y_bin = label_binarize(y,classes=[1,2,3,4],sparse_output=True)
print(type(Y_bin))
print(Y_bin)

 Output:

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 1)    1
  (1, 0)    1
  (2, 2)    1
  (3, 1)    1
  (4, 3)    1

The next pull request for sparse One vs. Rest support is what motivated this update because we want to overcome runtime constraints on datasets with large amounts of labels causing extreme runtime and space requirements.

Thank you to the reviewers Arnaud, Joel, and Oliver for their comments this week and to Rohit for starting the code which I based my changes off of.