Which loss should you use? How to use the
tf.data.Dataset API with a train and a validation set? How to use streaming metrics? Here are my answers.
This post is about the specifics of the multilabel setting, and a little about how to handle sequences of sequences. It is not about an NLP pipeline nor is it about the model you should use. The overall idea is aimed at using the
Dataset API and about a few gotchas to pay attention to when handling multilabel data. Check out the table of contents for more details.
A lot of the content here comes from
Feel free to comment on what’s written here: typos, suggestions, other changes I’ve missed or errors I’ve made! Also, some things may not be optimal, I’m open to improvements to my solutions!
Preparing the dataset
Train, Validation, Test: Sampling
In the single-label situation, the usual and easy way to keep the datasets’ statistics equal is to sample independently each class of the original dataset. It’s a valid procedure in this case: if you want 70% of your data in the train set, you take 70% of samples with class A, 70% of samples with class B and so on.
How would you do that if each sample can be of multiple classes simultaneously? The single-label procedure is only valid as long as you can sample independantly each class, which is no longer possible!
Check out my blog post on sampling multilabel datasets to appropriately do so.
Text: sequences of sequences
You obviously need to prepare tour text according to standard nlp pipelines. As we’ll use the
tf.data.Dataset API, we’ll simply write our texts to a text file, one text to be classified per line. Something like:
My work involves working with 2-level sequences: the Hierarchical Attention Network requires the data to be processed as documents wich are lists of sentences which are lists of words. If you don’t need this hierarchical structure, do move forward. If you need it, my solution is to still write one document per line, but separate sentences with a fixed token.
For instance, we’d write such a text file:
This is a comment from the yelp dataset .|&|It reprensents a document .|&|Its sentences are separated by a token . This is another comment .|&|It says that John's pizzas are great .|&|The author would go back .|&|Nice staff . ...
This will allow us to split documents on
|&| and then sentences on whitespaces
In any case we need to write a text file with the vocabulary. Each line should contain a word and the line’s number will be the word’s index in the vocabulary.
Again, we’ll write labels in a text file:
- Find all possible labels
- Assign them an index
- One-hot encode lists of labels
- Write to text file.
For instance say you have 4 classes, up to 3 labels and 5 samples:
1, 0, 0, 1 0, 1, 1, 1 1, 1, 0, 0 0, 0, 0, 1 1, 1, 0, 0
Loading the data with
Feeding sequences of sequences inside a Tensorflow Dataset
In the regular situation, all we have to do is split texts into words. In the hierarchical situation, we also need to split sentences:
Processing the labels
We need to read the one-hot encoded text file and turn it into tensors:
Dataset and input Tensors
Now we need to zip the labels and texts datasets together so that we can shuffle them together, batch and prefetch them:
Repeating for several epochs will be done manually at run time for more flexibility.
padded_shapes? (click to expand)
padded_shapes is a tuple. The first
shape will be used to pad the features (i.e. the 3D Tensor with the list of word indexes for each sentence in each document), and the second is for the labels.
The labels won’t require padding as they are already a consistent 2D array in the text file which will be converted to a 2D Tensor. But Tensorflow does not know it won’t need to pad the labels, so we still need to specify the
padded_shape argument: if need be, the Dataset should pad each sample with a 1D Tensor (hence
tf.TensorShape([None])). For instance if a label was
[0, 1, 0, 0, 1] and the next one was
[0, 1] then the padding would be
[0, 0, 0] as we said that the
padding_value should be
The features on the other hand will need padding as within a batch (a list of documents, first dimension of the 3D batch Tensor), all documents won’t have the same number of sentences (2nd dimension of the Tensor) and all sentences within the batch won’t have the same number of words (last dimension). The Dataset may therefore need to patch 2 dimensions (sentences and words), hence
tf.TensorShape([None, None]). And as we put the padding token first in the vocabulary, then its index is 0 and the
padding_value is also 0.
Lastly, we need the types of
padding_values to be consistent with the types of the
labels tensors produced by the
text_dataset and the
labels_dataset, which is why I used
Handling the validation data
Actually what you should do is have a dataset for your training data and another one for your validation data. This will allow you to use one without affecting the other. People usually use only one dataset and re-initialize it with validation data at the end of each epoch. I like having 2 datasets because I don’t want to wait for the end of an epoch to validate.
To do so, just do the previous procedure inside a
with tf.variable_scope(train_or_val): and use
tf.cond to chose the dataset:
We want each dimension of our model’s logits to be an independant logistic regression. We’ll therefore use
Notice how the loss is summed accross classes before it is averaged over the batch. I’m not 100% positive, if you have an opinion, do checkout the discussion on github here.
Unlike the single-label case, we should not output a softmax probability distribultion as labels are classified independently. We need just apply a
sigmoid on the logits as they are independant logistic regressions:
I highly recommend you learn about streaming metrics. Also, checkout my previous blogpost about streaming f1-score in the multilabel setting to understand
streaming_f1. Here is a function meant to gather training and validation metrics:
Here is the skeletton for a training procedure.