# Streaming f1-score in Tensorflow: the multilabel setting

## Going further than tf.contrib

My day to day occupation involves multilabel text classification with Tensorflow. It’s quite new to me and not so common so I stumble upon a variety of problems. I share my solutions with you these days, including How to sample a multilabel dataset and today’s about how to compute the f1 score in Tensorflow.

The first section will explain the *difference between the single and multi label cases*, the second will be about computing the *multi label f1 score* from the predicted and target values, the third section will be about how to deal with *batch-wise data* and get an overall final score and lastly I’ll share a piece of code proving *it works*!

**TL;DR** -> Checkout the gist : https://gist.github.com/Vict0rSch/…

# Multi label f1 score

From Scikit-Learn:

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

`F1 = 2 * (precision * recall) / (precision + recall)`

In a multi-label setting there a 3 main ways of extending this definition:

- Define the precision and recall globally, for all labels: it is the
**micro**f1 score- for a
`positive`

prediction if 10 labels are predicted correctly but 1 is missing then it is a`False Positive`

- for a
- Define them per class and average:
- if every class has the same importance, the f1 score is the mean of f1 scores per class: it is the
**macro**f1 score - if each class should be weighted according to the number of samples with this class (the
*support*), it is called the**weighted**f1 score

- if every class has the same importance, the f1 score is the mean of f1 scores per class: it is the

# Tensorflow implementation

This section is about implementing a multi-label f1 score in Tensorflow, in a similar way as Scikit-Learn. Most content originally comes from my answer on Stackoverflow.

I believe almost eveything is straightforward except the `axis`

in `tf.count_nonzero`

. If something else is unclear, ask in the comments.

`axis = None`

means that we count**all**non-zero locations in the matrix (`y_pred * y_true`

for instance).`axis = 0`

means that we count non-zero locations**per class**, which you could see as summing over lines, yielding 1 f1 score per column.

# Streaming Multilabel f1 score

Now what if you can’t compute `y_true`

and `y_pred`

for your whole dataset? You need to have a way to aggregate these results and *then* only compute the final f1 score. Obviously you can’t just sum up f1 scores across batches. But you can sum counts! So we’ll sum `TP`

, `FP`

and `FN`

over batches and compute the f1 score accordingly

## Need a primer on Tensorflow metrics and updates to these? (click to expand)

When you compute a streaming metric you need 2 things: the `Tensor`

holding the value you’re looking for, and an `update_op`

to feed new values to the metric, typically once per batch.

Here is an example, counting the sum of natural numbers to `100`

:

```
Sum of ints to 0: 0
Sum of ints to 25: 325
Sum of ints to 50: 1275
Sum of ints to 75: 2850
Sum of ints to 100: 5050
Should be: 5050
```

There are therefore 3 steps:

- Define the
`Variables`

which will hold the values of`TP`

,`FP`

and`FN`

. a. We should have 2 for each because the`micro`

ones will be summed over all dimensions but the`macro`

ones only on the first. b. We don’t need to define specific`TP`

,`FP`

and`FN`

for the`weighted`

f1 score as it can be inferred from the`macro`

f1 score provided we have the`weights`

- Define the
`update_ops`

- Define the final f1 scores from the
`Variables`

## Functions

Here are 3 functions: one to compute and update the counts, the other to compute the f1 scores from the counts. The `metric_variable`

function comes from Tensorflow’s core code and helps us define a `Variable`

more easily as we know it’ll hold a metric.

## Example

In this section we’ll put to use the above functions. We’ll first generate some data and comptue the f1 scores per batch of data

# Corectness

Using the previously defined functions, running the following code would prove the implementation is valid!

```
Total, overall f1 scores: 0.665699032365699 0.6241802918567532 0.686824189759798
Streamed, batch-wise f1 scores: 0.665699032365699 0.6241802918567532 0.686824189759798
For reference, scikit f1 scores: 0.665699032365699 0.6241802918567531 0.6868241897597981
```

## Python file with everything

Check out the complete code here: https://gist.github.com/Vict0rSch/…