By

# Streaming f1-score in Tensorflow: the multilabel setting

## Going further than tf.contrib

My day to day occupation involves multilabel text classification with Tensorflow. It’s quite new to me and not so common so I stumble upon a variety of problems. I share my solutions with you these days, including How to sample a multilabel dataset and today’s about how to compute the f1 score in Tensorflow.

The first section will explain the difference between the single and multi label cases, the second will be about computing the multi label f1 score from the predicted and target values, the third section will be about how to deal with batch-wise data and get an overall final score and lastly I’ll share a piece of code proving it works!

TL;DR -> Checkout the gist : https://gist.github.com/Vict0rSch/…

# Multi label f1 score

From Scikit-Learn:

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: `F1 = 2 * (precision * recall) / (precision + recall)`

In a multi-label setting there a 3 main ways of extending this definition:

• Define the precision and recall globally, for all labels: it is the micro f1 score
• for a `positive` prediction if 10 labels are predicted correctly but 1 is missing then it is a `False Positive`
• Define them per class and average:
• if every class has the same importance, the f1 score is the mean of f1 scores per class: it is the macro f1 score
• if each class should be weighted according to the number of samples with this class (the support), it is called the weighted f1 score

# Tensorflow implementation

This section is about implementing a multi-label f1 score in Tensorflow, in a similar way as Scikit-Learn. Most content originally comes from my answer on Stackoverflow.

I believe almost eveything is straightforward except the `axis` in `tf.count_nonzero`. If something else is unclear, ask in the comments.

1. `axis = None` means that we count all non-zero locations in the matrix (`y_pred * y_true` for instance).
2. `axis = 0` means that we count non-zero locations per class, which you could see as summing over lines, yielding 1 f1 score per column.

# Streaming Multilabel f1 score

Now what if you can’t compute `y_true` and `y_pred` for your whole dataset? You need to have a way to aggregate these results and then only compute the final f1 score. Obviously you can’t just sum up f1 scores across batches. But you can sum counts! So we’ll sum `TP`, `FP` and `FN` over batches and compute the f1 score accordingly

Need a primer on Tensorflow metrics and updates to these? (click to expand)

When you compute a streaming metric you need 2 things: the `Tensor` holding the value you’re looking for, and an `update_op` to feed new values to the metric, typically once per batch.

Here is an example, counting the sum of natural numbers to `100`:

``````Sum of ints to 0: 0
Sum of ints to 25: 325
Sum of ints to 50: 1275
Sum of ints to 75: 2850
Sum of ints to 100: 5050
Should be:  5050
``````

There are therefore 3 steps:

1. Define the `Variables` which will hold the values of `TP`, `FP` and `FN`. a. We should have 2 for each because the `micro` ones will be summed over all dimensions but the `macro` ones only on the first. b. We don’t need to define specific `TP`, `FP` and `FN` for the `weighted` f1 score as it can be inferred from the `macro` f1 score provided we have the `weights`
2. Define the `update_ops`
3. Define the final f1 scores from the `Variables`

## Functions

Here are 3 functions: one to compute and update the counts, the other to compute the f1 scores from the counts. The `metric_variable` function comes from Tensorflow’s core code and helps us define a `Variable` more easily as we know it’ll hold a metric.

## Example

In this section we’ll put to use the above functions. We’ll first generate some data and comptue the f1 scores per batch of data

# Corectness

Using the previously defined functions, running the following code would prove the implementation is valid!

``````Total, overall f1 scores:               0.665699032365699 0.6241802918567532 0.686824189759798

Streamed, batch-wise f1 scores:         0.665699032365699 0.6241802918567532 0.686824189759798

For reference, scikit f1 scores:        0.665699032365699 0.6241802918567531 0.6868241897597981
``````

## Python file with everything

Check out the complete code here: https://gist.github.com/Vict0rSch/…