This section will walk you through the code of
feedforward_lasagne_mnist.py (here), which I suggest you have open while reading. This tutorial is widely based on the Lasagne mnist example. This official example is really well built and detailed, especially the comments in the code. The purpose here is to simplify a little bit the original code, make it similar to our Keras example and understand in details what is happenning, when and why.
If you are not yet familiar with what mnist is, please spend a couple minutes there. It is basically a set of hadwritten digit images of size 28*28 (= 784) in greyscale (0-255). There are 60,000 training examples and 10,000 testing examples. The training examples could be also split into 50,000 training examples and 10,000 validation examples.
By the way, Lasagne’s documentation is really good, detailed and cites papers. Also the community answers fast to questions or implementation problems.
/!\ Be aware that Lasagne relies heavily on Theano and that understanding it is necessary to be able to use Lasagne. The introduction is the minimum required but knowing Theano in greater details could be a good idea…
Recognizing handwritten digits with Lasagne
Table of Contents
Lasagne is much more “hands on” than Keras. This means the Lasagne Library is all about the networks (layers, optimizers, initializations and so on) but that’s it. You have to build everything else yourself, which is a big plus if you want control over your code. This also means concepts like callbacks are useless since you have an open training code.
First we import everything we’ll need (as usual). Then we define a loading function
load_data() which we will not look at in details since all that matters is that it returns the expected data.
Then we define two other helper functions: one to build the network itself (
build_mlp()), the other to generate the mini-batches from the loaded data (
The main function is
run_network(). It does everything you expect from it: load the data, build the model/network, compile the needed Theano functions, train the network and lastly test it.
As in the Keras example the main function is within a
try/except so that you can interrupt the training without losing everything.
numpydo not need explanations.
- We import
theano.tensorbecause we’ll use Theano variables and a few of it’s built-in functions.
- Then, we import the
lasagnelibrary as a whole
rmspropis the optimizer we’ll use, just like in the Keras example. We use it mainly because it is one of the algorithm that scale the learning rate according to the gradient. To learn more see here G. Hinton’s explanatory video and there the slides
- Just like in Keras,
layersare the core of the networks. Here we’ll only use
InputLayeris a specific
layerthat takes in the data to be forwarded in the network.
- Again, we’ll use the
softmaxand rectified linear unit (
rectify) activation functions
- Last but not least, the cost/loss/objective function is a
Loading the data
We will not get into the details of this function, since the only important thing to understand is what it returns. You could load the data another way if you do not want to re-download the mnist dataset. For instance you could use the one you downloaded doing the Keras example.
loading_data() returns numpy
numpy.float32 values with shapes:
For the inputs (
X), the dimensions are as follows :
(nb_of_examples, nb_of_channels, image_first_dimension, image_second_dimension). This means if you had colored images in
rgb you’d have a
3 instead of a
1 in the
number_of_channels. Also if we reshaped the images like in the Keras example to have vector-like inputs, we’d have
784, 1 instead of
28, 28 as image dimension.
The targets are
ndarrays with one dimension, filled with the labels as
Creating the network
Here we stack layers to build a network. Each
layer takes as argument the previous
layer. This is how Theano works: one step at a time, we define how variables depend on each other. Basically we say: the input layer will be modified as follows by the first hidden layer. The next layer will do the same etc. So the whole network is contained in the
l_out object, which is an instance of
lasagne.layers.dense.DenseLayer and is basically a Theano expression that depends only on the
To summarize, this function takes a Theano Variable as input and says how the forward pass in our network affects this variable.
The network in question is as follows:
InputLayerexpects 4-dimentional inputs with shapes
(None, 1, 28 ,28). The
Nonemeans the number of example to pass forward is not fixed and the network is can take any batch size.
- The first hidden layer is has 500 units, rectified linear unit activation function and 40% of dropout (
l_hid1_drop). Weights and Biases are initialized according to the
GlorotUniform()distribution (which is default).
- The second hidden layer has 300 units, rectified linear unit activation function and 40% of dropout and same initialization.
- The output layer has 10 units (because we have 10 categories / labels in mnist), no dropout (of course…) and a softmax activation function to output a probability.
categorical_crossentropyis standard for multiclass classification.
- This structure 500-300-10 comes from Y. LeCun’s website citing G. Hinton’s unpublished work
Throwing one batch at a time
Again, we won’t dive into the Python code since it’s just a helper function, rather we’ll look at what it does.
This function takes data (
target) as input and generates (random) subsets of this data (of length
batchsize). The point here is to iterate over the datasets without reloading them in memory each time we start with a new batch. Understand python’s
yield and generators.
The point here is to generate batches to learn from (either to train, validate or test the model/network).
Running the network
This is the core of our training, the function we’ll call to effectively train a network. It first loads the data and builds the network, then it defines the Theano expressions we’ll need to train (mainly train and test losses, the updates and the accuracy calculation) before compiling them. Then we switch to the ‘numerical’ applications by iterating over our training and validation data
num_epoch times. Finally we evaluate the network on the test data.
The validation phase is often split into two parts:
In the first part you just look at your models and select the best performing approach using the validation data (=validation)
Then you estimate the accuracy of the selected approach (=test).
Another way to see it is that you use the validation data to check that your network’s parameters don’t overfit your training data. Then, the test data is used to check that you have not overfitted your hyper parameters to the validation data.
Because you may not want to reload the whole dataset each time you modify your network, you can optionnaly pass data as an argument to
Theano variables: creating the network and the losses
There is a lot going on here so we’ll go line by line.
Lines 2 and 3 we create the Theano variables that will be propagated into the network.
Line 6, we build the network from the
input_var Theano variable. As stated before network is an instance of
lasagne.layers.dense.DenseLayer stating how the forward pass into our network affects
Line 8 we get the Theano variable generated by
input_var. It is an instance of
Line 9 and 10 we evaluate the loss. Again, be aware we are still talking “literally “, at this point no number is involved. What happends is we compute how the loss depends on
Lines 12 to 15, the same thing happens except this time there is a parameter
deterministic=True which basically means no dropout because we are testing our network, not training it.
Line 16 we evaluate the accuracy of our network on the validation data. Within the
mean we count the number of times the right number is predicted.
Compiling the graph : Theano functions
Here we need to look at a (slightly) bigger picture. The point of training a network is to forward examples, evaluate the cost function and then update the weights and biases according to an aupdate algorithm (
This is what the Theano function
train_fn line 5 does: given the input (
input_var) and its target (
target_var), evaluate the cost function and then update the weights and biases accordingly.
The updates are defined lines 1 and 2 and triggered in the Theano function (
First we get all the networks parameters that can be trained, that is to say the weights and biases. In our case, it will be a list of 3 weights and 3 biases shared variables. Dig into it if you’re not clear with shared variables (see also Quora).
val_fn on the other hand only computes the loss and accuracy of the data it is given. It can therefore be the validation or the test data.
When we declare those Theano functions, the graph linking variables and expressions through operations is computed, which could take some time.
Actual training in the epoch loop
For each epoch we train over the whole training data and evaluate the training loss. Then we go over the validation data and evaluate both the validation loss and validation accuracy.
What happens is we get a batch of examples which we divide into
targets. We give these numerical inputs to the associated Theano function (
val_fn) that computes the associated results.
Everything else is about averaging the losses and accuracies regarding the number of batches fed to the network.
We can see here that you are completely free of doing whatever you want during the training easily since you have access to both the epoch and batch loops.
Test and return the network
With everything we’ve seen so far, this part is a piece of cake. We simply test the network feeding
val_fn with the test data and not the validation data.
Finally we print the relevant quantities and return the network (which is, again, an instance of
As an exercise (very easy…) you could try to implement the LossHistory callback from the Keras example.
A more difficult example is to modify the code so as to be able to retrain a network (passing
network=None as parameters to
run_network() is the easiest part).
if you do not want to reload the data every time:
Using an Intel i7 CPU at 3.5GHz and an NVidia GTX 970 GPU, we achieve 0.9829 accuracy (1.71% error) in 32.2 seconds of training using this implementation (including loading and compilation).
Ok, now you’ve seen how Lasagne uses Theano. To make sure you’ve got the concepts as a whole here is a little exercise. Say I give you the last layer of a network and an example. How would you predict the associated number using this already trained network?
For instance write the function
get_class() here :