 By

# Splitting a multilabel dataset with Python

## Implementing the Iterative Stratifier from Sechidis et al., 2011 to sample from a multilabel dataset

1. The specificity of multilabel datasets
2. Iterative Stratifier

# The specificity of multilabel datasets

A label is the (discrete) information you which to infer from your dataset. This label can be of multiple classes: it is a binary classification task if you have 2 classes, a multiclass problem if you have more.

The multilabel case is quite different from the number of values it can take. It means that each sample in your dataset can have multiple target values (each of these being of a different class, possibly with replacement).

Still not clear? Let’s look at an example (click to expand)

Say you want to predict pieces of information about a book :

• Single Label, Binary Classification: is the book a good read, or not?
• Single Label, Multiclass Classification: is the book about Love? Adventure? Philosophy? History? You may only choose one answer
• Multi Label, Multiclass Classification: is the book about Love? Adventure? Philosophy? History? You may choose several answers

I’ll let you think about the multi label, binary classification case.

In most Machine Learning problems you need to split your data into at least two sets, ideally three. You need a test set to evaluate your model trained on the training set. And you need these two to have the same label distributions!

## Why not go the easy way?

In the single-label situation, the usual and easy way to keep the datasets’ statistics equal is to sample independently each class of the original dataset. It’s a valid procedure in this case: if you want 70% of your data in the train set, you take 70% of samples with class A, 70% of samples with class B and so on.

How would you do that if each sample can be of multiple classes simultaneously? The single-label procedure is only valid as long as you can sample independantly each class, which is no longer possible!

Say you have 3 classes and each sample can have up to 2 labels, and want a 60/40 split:

``````0: [C]
1: [A, B]
2: [B]
3: [A, C]
4: [A, C]
5: [C]
6: [A, C]
7: [A, C]
8: [A]
9: [A, B]
``````

You want 4 and 3 `A` per split. How do you chose which one to put in which dataset? They are not equivalent! How do you chose how to put 2 `B` in a set, 1 in the other? It depends on how you chose the repartition of `A` and the distribution of `C` will be as tricky.

## Stratifying

In their 2011 paper, Sechidis et. al propose a (partial) solution to this problem: first focus on the least represented labels, as they are the most likely to suffer from distriution variations, then put these samples in the subset which need it the most.

Here is an example for such a procedure (click to expand)

In our previous example, we want to sets, 60/40, which would ideally contain:

• Train set:
• `0.6 * 7 = 4.2` samples with a label `A`
• `0.6 * 3 = 1.8` samples with a label `B`
• `0.6 * 6 = 3.6` samples with a label `C`
• Test set:
• `0.4 * 7 = 2.8` samples with a label `A`
• `0.4 * 3 = 1.2` samples with a label `B`
• `0.4 * 6 = 2.4` samples with a label `C`

The stratifying procedure says:

• Label count in the dataset: `A: 7, B: 3, C: 6`
• Choose label `B`
• The train set needs `1.8` so put sample `1: [A, B]` in the train set.
• It now needs `0.8` labels `B` and `3.2` labels `A`
• The test set now needs more labels `B`: `1.2 > 0.8` so put sample `2: [B]` in the test set
• It now needs `0.2` labels `B`
• The train set now needs more labels `B`: `0.8 > 0.2` so put sample `9: [A, B]` in the train set
• It now needs `2.2` labels `A` and `-0.2` labels `B` but it does not matter, we are done with `B`
• Label count in the dataset: `A: 5, B: 0, C: 6`
• Choose label `A` (surprise! even though `C` was more rare in the original dataset, `B`’s repartition makes it so that we should focus on `A` more, now)
• Do the same thing

## Checking the validity of the procedure

### Labels and Examples Distributions

From the paper (lower is better): ### Kullback-Leibler divergence

This one is not mentionned in the paper but I think it’s also a good indicator that the procedure is valid. The Kullback-Leibler divergence is a measure of (dis-)similarity between two probabilty distributions.

The distributions we’ll compare are those of the flattened class distributions: each sample’s labels count one for each class. So in out previous example, `0: [A, C]` adds 1 to the count of both `A` and `C`.

A low KL-divergence of the class distributions is not a guarantee but it’s still a good indicator in my opinion.

# Iterative Stratifier

## Implementation

This implementation only relies on `numpy` and Python 3.

## Experiment

I created a synthetic dataset, with `100` classes drawn from a decreasing exponential distribution. Each example in the dataset, `100 000` in total, has up to `10` labels (at least one, and non-repeating), each one being drawn from the class distribution.

The following figures show

• The target class distribution and their distribution in the synthetic dataset
• The synthetic dataset class distribution and the 2 sampled datasets’   Zooming in, you’d see no difference, lines overlap perfectly! Computed KL-divergences agree: `KL(original, train) = KL(original, test) = -0.0001`. Final repartition is `69912 | 14890 | 15198`, i.e. `train: 69.9% | val: 14.9% | test: 15.2%`.

Experiments with different distributions, number of classes, number of labels and number of examples agree, this is not a best-case scenario.