# Splitting a multilabel dataset with Python

## Implementing the Iterative Stratifier from Sechidis et al., 2011 to sample from a multilabel dataset

# The specificity of multilabel datasets

A *label* is the (discrete) information you which to infer from your dataset. This label can be of multiple *classes*: it is a *binary* classification task if you have 2 classes, a *multiclass* problem if you have more.

The **multilabel** case is quite different from the number of values it can take. It means that each sample in your dataset can have *multiple* target values (each of these being of a different class, possibly with replacement).

## Still not clear? Let’s look at an example (click to expand)

Say you want to predict pieces of information about a book :

- Single Label, Binary Classification: is the book a good read, or not?
- Single Label, Multiclass Classification: is the book about Love? Adventure? Philosophy? History? You may only choose one answer
- Multi Label, Multiclass Classification: is the book about Love? Adventure? Philosophy? History? You may choose several answers

I’ll let you think about the multi label, binary classification case.

In most Machine Learning problems you need to split your data into *at least* two sets, ideally three. You *need* a test set to evaluate your model trained on the training set. And you *need* these two to have the **same label distributions**!

## Why not go the easy way?

In the *single-label* situation, the usual and easy way to keep the datasets’ statistics equal is to sample independently each class of the original dataset. It’s a valid procedure in this case: if you want 70% of your data in the train set, you take 70% of samples with class *A*, 70% of samples with class *B* and so on.

How would you do that if each sample can be of multiple classes simultaneously? The single-label procedure is only valid as long as you can sample **independantly** each class, which is no longer possible!

Say you have 3 *classes* and each sample can have up to 2 *labels*, and want a 60/40 *split*:

```
0: [C]
1: [A, B]
2: [B]
3: [A, C]
4: [A, C]
5: [C]
6: [A, C]
7: [A, C]
8: [A]
9: [A, B]
```

You want 4 and 3 `A`

per split. How do you chose which one to put in which dataset? They are not equivalent! How do you chose how to put 2 `B`

in a set, 1 in the other? It depends on how you chose the repartition of `A`

and the distribution of `C`

will be as tricky.

## Stratifying

In their 2011 **paper**, Sechidis *et. al* propose a (partial) solution to this problem: **first** focus on the least represented labels, as they are the most likely to suffer from distriution variations, **then** put these samples in the subset which need it the most.

## Here is an example for such a procedure (click to expand)

In our previous example, we want to sets, 60/40, which would ideally contain:

- Train set:
`0.6 * 7 = 4.2`

samples with a label`A`

`0.6 * 3 = 1.8`

samples with a label`B`

`0.6 * 6 = 3.6`

samples with a label`C`

- Test set:
`0.4 * 7 = 2.8`

samples with a label`A`

`0.4 * 3 = 1.2`

samples with a label`B`

`0.4 * 6 = 2.4`

samples with a label`C`

The stratifying procedure says:

- Label count in the dataset:
`A: 7, B: 3, C: 6`

**Choose label**`B`

- The train set needs
`1.8`

so put sample`1: [A, B]`

in the train set.- It now needs
`0.8`

labels`B`

and`3.2`

labels`A`

- It now needs
- The test set now needs more labels
`B`

:`1.2 > 0.8`

so put sample`2: [B]`

in the test set- It now needs
`0.2`

labels`B`

- It now needs
- The train set now needs more labels
`B`

:`0.8 > 0.2`

so put sample`9: [A, B]`

in the train set- It now needs
`2.2`

labels`A`

and`-0.2`

labels`B`

but it does not matter, we are**done**with`B`

- It now needs

- The train set needs

- Label count in the dataset:
`A: 5, B: 0, C: 6`

**Choose label**(`A`

*surprise!*even though`C`

was more rare in the original dataset,`B`

’s repartition makes it so that we should focus on`A`

more, now)- Do the same thing

## Checking the validity of the procedure

### Labels and Examples Distributions

From the paper (lower is better):

### Kullback-Leibler divergence

This one is not mentionned in the paper but I think it’s also a good indicator that the procedure is valid. The Kullback-Leibler divergence is a measure of (dis-)similarity between two probabilty distributions.

The distributions we’ll compare are those of the flattened **class** distributions: each sample’s labels count one for each class. So in out previous example, `0: [A, C]`

adds 1 to the count of **both** `A`

and `C`

.

A low KL-divergence of the class distributions is not a guarantee but it’s still a good indicator in my opinion.

# Iterative Stratifier

## Implementation

This implementation only relies on `numpy`

and Python 3.

## Experiment

I created a synthetic dataset, with `100`

classes drawn from a decreasing exponential distribution. Each example in the dataset, `100 000`

in total, has up to `10`

labels (at least one, and non-repeating), each one being drawn from the class distribution.

The following figures show

- The target class distribution and their distribution in the synthetic dataset
- The synthetic dataset class distribution and the 2 sampled datasets’

Zooming in, you’d see no difference, lines overlap perfectly! Computed KL-divergences agree: `KL(original, train) = KL(original, test) = -0.0001`

. Final repartition is `69912 | 14890 | 15198`

, *i.e.* `train: 69.9% | val: 14.9% | test: 15.2%`

.

Experiments with different distributions, number of classes, number of labels and number of examples agree, this is not a best-case scenario.

## More info

- This implementation can handle string labels just as well
- For some reason, sampling 3 datasets (train, val, test) works best by sequentially stratifying than specifying 3 ratios. Could be because of the low classes
- I could not reproduce the paper’s metrics on the dataset. I’m not far from them but I may have made a mistake in the metrics’ code. I did raise the issue. But looking at the distribution, my implementation can’t be so far off
- I implemented the Second Order Iterative Stratifier by Szymański
*et. al*, 2017 but I do not use it because its complexity is squared in the number of labels and it does not seem to result in much more than slightly lowering the variance of the algorithm. If you wanted to, it would be quite easy to code from the regular Iterative Stratifier above