My day to day occupation involves multilabel text classification with Tensorflow. It’s quite new to me and not so common so I stumble upon a variety of problems. I share my solutions with you these days, including How to sample a multilabel dataset and today’s about how to compute the f1 score in Tensorflow.
The first section will explain the difference between the single and multi label cases, the second will be about computing the multi label f1 score from the predicted and target values, the third section will be about how to deal with batch-wise data and get an overall final score and lastly I’ll share a piece of code proving it works!
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: F1 = 2 * (precision * recall) / (precision + recall)
In a multi-label setting there a 3 main ways of extending this definition:
Define the precision and recall globally, for all labels: it is the micro f1 score
for a positive prediction if 10 labels are predicted correctly but 1 is missing then it is a False Positive
Define them per class and average:
if every class has the same importance, the f1 score is the mean of f1 scores per class: it is the macro f1 score
if each class should be weighted according to the number of samples with this class (the support), it is called the weighted f1 score
This section is about implementing a multi-label f1 score in Tensorflow, in a similar way as Scikit-Learn. Most content originally comes from my answer on Stackoverflow.
I believe almost eveything is straightforward except the axis in tf.count_nonzero. If something else is unclear, ask in the comments.
axis = None means that we count all non-zero locations in the matrix (y_pred * y_true for instance).
axis = 0 means that we count non-zero locations per class, which you could see as summing over lines, yielding 1 f1 score per column.
Streaming Multilabel f1 score
Now what if you can’t compute y_true and y_pred for your whole dataset? You need to have a way to aggregate these results and then only compute the final f1 score. Obviously you can’t just sum up f1 scores across batches. But you can sum counts! So we’ll sum TP, FP and FN over batches and compute the f1 score accordingly
Need a primer on Tensorflow metrics and updates to these? (click to expand)
When you compute a streaming metric you need 2 things: the Tensor holding the value you’re looking for, and an update_op to feed new values to the metric, typically once per batch.
Here is an example, counting the sum of natural numbers to 100:
Sum of ints to 0: 0
Sum of ints to 25: 325
Sum of ints to 50: 1275
Sum of ints to 75: 2850
Sum of ints to 100: 5050
Should be: 5050
There are therefore 3 steps:
Define the Variables which will hold the values of TP, FP and FN.
a. We should have 2 for each because the micro ones will be summed over all dimensions but the macro ones only on the first.
b. We don’t need to define specific TP, FP and FN for the weighted f1 score as it can be inferred from the macro f1 score provided we have the weights
Define the update_ops
Define the final f1 scores from the Variables
Here are 3 functions: one to compute and update the counts, the other to compute the f1 scores from the counts. The metric_variable function comes from Tensorflow’s core code and helps us define a Variable more easily as we know it’ll hold a metric.
In this section we’ll put to use the above functions. We’ll first generate some data and comptue the f1 scores per batch of data
Using the previously defined functions, running the following code would prove the implementation is valid!
Total, overall f1 scores: 0.665699032365699 0.6241802918567532 0.686824189759798
Streamed, batch-wise f1 scores: 0.665699032365699 0.6241802918567532 0.686824189759798
For reference, scikit f1 scores: 0.665699032365699 0.6241802918567531 0.6868241897597981