# Lasagne Feedforward Tutorial

## Code walk-through - Jan. 2016

This section will walk you through the code of `feedforward_lasagne_mnist.py`

(here), which I suggest you have open while reading. This tutorial is widely based on the Lasagne mnist example. This official example is really well built and detailed, especially the comments in the code. The purpose here is to simplify a little bit the original code, make it similar to our Keras example and understand in details what is happenning, when and why.

If you are not yet familiar with what mnist is, please spend a couple minutes there. It is basically a set of hadwritten digit images of size 28*28 (= 784) in greyscale (0-255). There are 60,000 training examples and 10,000 testing examples. The training examples could be also split into 50,000 training examples and 10,000 validation examples.

By the way, Lasagne’s documentation is really good, detailed and cites papers. Also the community answers fast to questions or implementation problems.

#### Lasagne Documentation

#### Lasagne’s Github

**/!\** Be aware that Lasagne relies heavily on Theano and that understanding it is **necessary** to be able to use Lasagne. The introduction is the minimum required but knowing Theano in greater details could be a good idea…

# Recognizing handwritten digits with Lasagne

## Table of Contents

## General organization

Lasagne is much more “*hands on*” than Keras. This means the Lasagne Library is all about the **networks** (layers, optimizers, initializations and so on) but that’s it. You have to build everything else yourself, which is a big plus if you want control over your code. This also means concepts like callbacks are useless since you have an open training code.

First we **import** everything we’ll need (as usual). Then we define a **loading** function `load_data()`

which we will not look at in details since all that matters is that it returns the expected data.

Then we define two other helper functions: one to build the network itself (`build_mlp()`

), the other to generate the mini-batches from the loaded data (`iterate_minibatches()`

).

The main function is `run_network()`

. It does everything you expect from it: load the data, build the model/network, compile the needed Theano functions, train the network and lastly test it.

As in the Keras example the main function is within a `try/except`

so that you can interrupt the training without losing everything.

## Imports

`sys`

,`os`

,`time`

and`numpy`

do not need explanations.- We import
`theano`

and`theano.tensor`

because we’ll use Theano variables and a few of it’s built-in functions. - Then, we import the
`lasagne`

library as a whole `rmsprop`

is the optimizer we’ll use, just like in the Keras example. We use it mainly because it is one of the algorithm that scale the learning rate according to the gradient. To learn more see here G. Hinton’s explanatory video and there the slides- Just like in Keras,
`layers`

are the core of the networks. Here we’ll only use`Dense`

and`Dropout`

layers. The`InputLayer`

is a specific`layer`

that takes in the data to be forwarded in the network. - Again, we’ll use the
`softmax`

and rectified linear unit (`rectify`

) activation functions - Last but not least, the cost/loss/objective function is a
`categorical_crossentropy`

## Loading the data

We will not get into the details of this function, since the only important thing to understand is what it returns. You could load the data another way if you do not want to re-download the mnist dataset. For instance you could use the one you downloaded doing the Keras example.

`loading_data()`

returns numpy `ndarrays`

of `numpy.float32`

values with shapes:

For the inputs (`X`

), the dimensions are as follows : `(nb_of_examples, nb_of_channels, image_first_dimension, image_second_dimension)`

. This means if you had colored images in `rgb`

you’d have a `3`

instead of a `1`

in the `number_of_channels`

. Also if we reshaped the images like in the Keras example to have vector-like inputs, we’d have `784, 1`

instead of `28, 28`

as image dimension.

The targets are `ndarrays`

with one dimension, filled with the labels as `numpy.uint8`

values.

## Creating the network

Here we stack layers to build a network. Each `layer`

takes as argument the previous `layer`

. This is how Theano works: one step at a time, we define how variables depend on each other. Basically we say: the input layer will be modified as follows by the first hidden layer. The next layer will do the same etc. So the whole network is contained in the `l_out`

object, which is an instance of `lasagne.layers.dense.DenseLayer`

and is basically a Theano expression that depends only on the `input_var`

.

**To summarize**, this function takes a Theano Variable as input and says how the **forward** pass in our network affects this variable.

The network in question is as follows:

- The
`InputLayer`

expects 4-dimentional inputs with shapes`(None, 1, 28 ,28)`

. The`None`

means the number of example to pass forward is not fixed and the network is can take any batch size. - The first hidden layer is has 500 units, rectified linear unit activation function and 40% of dropout (
`l_hid1_drop`

). Weights and Biases are initialized according to the`GlorotUniform()`

distribution (which is default). - The second hidden layer has 300 units, rectified linear unit activation function and 40% of dropout and same initialization.
- The output layer has 10 units (because we have 10 categories / labels in mnist), no dropout (of course…) and a softmax activation function to output a probability.
`softmax`

output +`categorical_crossentropy`

is standard for multiclass classification. - This structure 500-300-10 comes from Y. LeCun’s website citing G. Hinton’s unpublished work

## Throwing one batch at a time

Again, we won’t dive into the Python code since it’s just a helper function, rather we’ll look at what it does.

This function takes data (`input`

and `target`

) as input and generates (random) subsets of this data (of length `batchsize`

). The point here is to iterate over the datasets without reloading them in memory each time we start with a new batch. Understand python’s `yield`

and generators.

The point here is to generate batches to learn from (either to train, validate or test the model/network).

## Running the network

This is the core of our training, the function we’ll call to effectively train a network. It first loads the data and builds the network, then it defines the Theano expressions we’ll need to train (mainly train and test losses, the updates and the accuracy calculation) before compiling them. Then we switch to the ‘numerical’ applications by iterating over our training and validation data `num_epoch`

times. Finally we evaluate the network on the test data.

#### Data

The validation phase is often split into two parts:

In the first part you just look at your models and select the best performing approach using the validation data (=validation)

Then you estimate the accuracy of the selected approach (=test).

Another way to see it is that you use the *validation* data to check that your network’s parameters don’t overfit your training data. Then, the *test* data is used to check that you have not overfitted your hyper parameters to the validation data.

Because you may not want to reload the whole dataset each time you modify your network, you can optionnaly pass data as an argument to `run_network()`

#### Theano variables: creating the network and the losses

There is a lot going on here so we’ll go line by line.

Lines 2 and 3 we create the Theano variables that will be propagated into the network.

Line 6, we build the network from the `input_var`

Theano variable. As stated before network is an instance of `lasagne.layers.dense.DenseLayer`

stating how the forward pass into our network affects `input_var`

.

Line 8 we get the Theano variable generated by `network`

from `input_var`

. It is an instance of `theano.tensor.var.TensorVariable`

.

Line 9 and 10 we evaluate the loss. Again, be aware we are still talking “*literally* “, at this point no number is involved. What happends is we compute how the loss depends on `prediction`

and `target_var`

Lines 12 to 15, the same thing happens except this time there is a parameter `deterministic=True`

which basically means no dropout because we are testing our network, not training it.

Line 16 we evaluate the accuracy of our network on the validation data. Within the `mean`

we count the number of times the right number is predicted.

#### Compiling the graph : Theano functions

Here we need to look at a (slightly) bigger picture. The point of *training* a network is to forward examples, evaluate the cost function and then update the weights and biases according to an aupdate algorithm (`rmsprop`

here).

This is what the Theano function `train_fn `

line 5 does: given the input (`input_var`

) and its target (`target_var`

), evaluate the cost function and then update the weights and biases accordingly.

The updates are defined lines 1 and 2 and triggered in the Theano function (`updates=updates`

):
First we get all the networks parameters that can be trained, that is to say the weights and biases. In our case, it will be a list of 3 weights and 3 biases shared variables. Dig into it if you’re not clear with shared variables (see also Quora).

The `val_fn `

on the other hand only computes the loss and accuracy of the data it is given. It can therefore be the validation or the test data.

When we declare those Theano functions, the graph linking variables and expressions through operations is computed, which could take some time.

#### Actual training in the epoch loop

For each epoch we train over the whole training data and evaluate the training loss. Then we go over the validation data and evaluate both the validation loss and validation accuracy.

What happens is we get a batch of examples which we divide into `inputs`

and `targets`

. We give these numerical inputs to the associated Theano function (`train_fn`

or `val_fn`

) that computes the associated results.

Everything else is about averaging the losses and accuracies regarding the number of batches fed to the network.

We can see here that you are completely free of doing *whatever* you want during the training easily since you have access to both the epoch and batch loops.

#### Test and return the network

With everything we’ve seen so far, this part is a piece of cake. We simply test the network feeding `val_fn`

with the test data and not the validation data.

Finally we print the relevant quantities and return the network (which is, again, an instance of `lasagne.layers.dense.DenseLayer`

.

As an exercise (very easy…) you could try to implement the LossHistory callback from the Keras example.

A more difficult example is to modify the code so as to be able to retrain a network (passing `network=None`

as parameters to `run_network()`

is the easiest part).

## Usage

if you do not want to reload the data every time:

Using an Intel i7 CPU at 3.5GHz and an NVidia GTX 970 GPU, we achieve 0.9829 accuracy (1.71% error) in 32.2 seconds of training using this implementation (including loading and compilation).

## Quick Exercise

Ok, now you’ve seen how Lasagne uses Theano. To make sure you’ve got the concepts as a whole here is a little exercise. Say I give you the last layer of a network and an example. How would you predict the associated number using this already trained network?

For instance write the function `get_class()`

here :