## Train a Convolutional Neural Network as a Classifier

This tutorial deals with training a classifier using convolutional neural networks.

## Introduction

In this tutorial we try to teach you how to implement a simple neural network image classifier using Convolutional Neural Networks(CNNs). The main goal of this post is to show hot to train a CNN classifier using TensorFlow deep learning framework developed by Google. The deep learning concepts such as the details of CNNs will not be discussed here. In order to get a better idea of convolutional layers and realize how the work please refer to this post. In the next section we start to describe procedure of learning the classifier.

## Input Pipeline

The dataset that we work on that in this tutorial is the MNIST dataset probably the most famous dataset in computer vision because of its simplicity! The main dataset consist of 60000 training and 10000 test images. However there might be different setups for these images. The one we use is the same in the test set but we split the training set to 55000 images as train and 5000 images as validation set in the case that using cross-validation for determining some hyper-parameters is desired. The images are 28x28x1 which each of them represent a hand-written digit from 0 to 9. Since this tutorial is supposed to be ready-to-use, we provided the code to download and extract the MNIST data as a data object. Thanks to TensorFlow its code is already written and is ready to use and its source code is available at this repository . The code for downloading and extracting MNIST dataset is as is as below:


from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

# The 'input.provide_data' is provided to organize any custom dataset.
data = input.provide_data(mnist)



The above code download and extract MNIST data in the MNIST_data/ folder in the current directory that we are running the python script. The reshape flag is set to False because we want the image format as it is which is 28x28x1. The reason is because we are aimed to train a CNN classifier which takes images as input. If the one_hot flag is set to True it returns class labels as a one_hot label. However we set the one_hot flag to False for customized preprocessing and data organization. The input.provide_data function is provided to get any data with specific format separated by training and testing sets and return the structured data object for further processing. From now on we consider data as the data object which has the following structure:

In any of the train, validation and test attributes, sub-attributes of images and labels exist. The have just not been depicted for the simplicity of the above chart presentation. As an example if data.train.imege is called its shape is [number_of_training_sample,28,28,1]. It is recommended to play around a little bit with data object to grasp a better idea of how it works and what is its output. The codes are available in the GitHub repository for this post.

## Network Architecture

After explanation of the data input pipeline, Now it's the time to go through the neural network architecture used for this tutorial. The implemented architecture is very similar to LeNet although our architecture is implemented in a fully-convolutional fashion, i.e., there is no fully-connected layer and all fully-connected layers are transform to corresponding convolutional layer. In order to grasp a better idea of how to go from a fully-connected layer to a convolutional one and vice versa please refer to this link. The general architecture schematic is as below:

Figure 1: The general architecture of the network.

The image is depicted by Tensorboard as a visualization tool for TensorFlow. Later on in this tutorial the way of using Tensorboard and make the most of it will be explained. As it can be seen by the figure, the convolutional layers are followed by pooling layers and the last fully-connected layer is followed by a dropout layer to decrease the overfitting. The dropout will only be applied in the training phase. The code for designing the architecture is as below:

Network Architecture

import tensorflow as tf
slim = tf.contrib.slim

def net_architecture(images, num_classes=10, is_training=False,
dropout_keep_prob=0.5,
spatial_squeeze=True,
scope='Net'):

# Create empty dictionary
end_points = {}

with tf.variable_scope(scope, 'Net', [images, num_classes]) as sc:
end_points_collection = sc.name + '_end_points'

# Collect outputs for conv2d and max_pool2d.
with tf.contrib.framework.arg_scope([tf.contrib.layers.conv2d, tf.contrib.layers.max_pool2d],
outputs_collections=end_points_collection):

# Layer-1
net = tf.contrib.layers.conv2d(images, 32, [5,5], scope='conv1')
net = tf.contrib.layers.max_pool2d(net, [2, 2], 2, scope='pool1')

# Layer-2
net = tf.contrib.layers.conv2d(net, 64, [5, 5], scope='conv2')
net = tf.contrib.layers.max_pool2d(net, [2, 2], 2, scope='pool2')

# Layer-3
net = tf.contrib.layers.conv2d(net, 1024, [7, 7], padding='VALID', scope='fc3')
net = tf.contrib.layers.dropout(net, dropout_keep_prob, is_training=is_training,
scope='dropout3')

# Last layer which is the logits for classes
logits = tf.contrib.layers.conv2d(net, num_classes, [1, 1], activation_fn=None, scope='fc4')

# Return the collections as a dictionary
end_points = slim.utils.convert_collection_to_dict(end_points_collection)

# Squeeze spatially to eliminate extra dimensions.
if spatial_squeeze:
logits = tf.squeeze(logits, [1, 2], name='fc4/squeezed')
end_points[sc.name + '/fc4'] = logits
return logits, end_points



def net_arg_scope(weight_decay=0.0005):
#Defines the default network argument scope.

with tf.contrib.framework.arg_scope(
[tf.contrib.layers.conv2d],
weights_regularizer=slim.l2_regularizer(weight_decay),
weights_initializer=tf.contrib.layers.variance_scaling_initializer(factor=1.0, mode='FAN_AVG',
uniform=False, seed=None,
dtype=tf.float32),
activation_fn=tf.nn.relu) as sc:
return sc



### Default Parameters and Operations

The function net_arg_scope is defined to share some attributes between layers. It is very useful in the cases which some attributes like 'SAME' padding(which is zero-padding in essense) are joint between different layer. It basically does the sharing variable with some pre-definitions. Basically it enables us to specify different operations and/or a set of arguments to be passed to any of the defined operations in the arg_scope. So for this specific case the argument tf.contrib.layers.conv2d is defined and so all the convolutional layers default parameters(which are ser by the arg_scope) are as defined in the arg_scope. The is more work to use this useful arg_scope operation and it will be explained in the general TensorFlow implementation details later on in this tutorial. It is worth noting that all the parameters defined by arg_scope, can be overwritten locally in the specific layer definition. As an example take a look at 'line 28'. While defining the tf.contrib.layers.conv2d layer(the convolutional layer), the padding is set to 'VALID' although its default been set to 'SAME' by the arg_scope operation. Now it's the time to explain the architecture itself by describing of how to create cnvolutional and pooling layers.

ReLU has been used as the non-linear activation function for all the layers except the last layer(embedding layer). The famous xavier initialization has not been used for initialization of the network and instead the Variance-Scaling-Initializer has been used which provided more promising results in the case of using ReLU activation. It's advantage is to keep the scale of the input variance constant, so it is claimed that it does not explode or diminish by getting to the final layer[reference]. There are different types of variance-scaling initializers. The one we used in is the one proposed by the paper Understanding the difficulty of training deep feedforward neural networks and provided by the TensorFlow. is the one proposed by the paper Understanding the difficulty of training deep feedforward neural networks and provided by the TensorFlow.

### Convolution and Pooling Layers

Now it's the time to build our convolutional architecture using convolution and pooling layers which are defined in the net_architecture panel in the above python script. It is worth noting that since the output of layers(output tensors) are different by the size the output sizes decrease gradually as we go through the depth of the network, the matching between inputs-outputs of the layers must be considered and in the end the output of the last layer should be form into a feature vector in order to be fed to the embedding layer.

Defining pooling layers is straightforward as it is shown in 'line 29'. The defined pooling layer has the kernel size of 2x2 and a stride of 2 in each dimension. This is equivalent to extract the maximum in each 2x2 windows and the stride makes no overlapping in the chosen windows for max pooling operation. In order to have a better understanding of pooling layer please refer to this link.

Convolution layers can be defined as of written in 'line 20' using tf.contrib.layers. The default padding is set to 'SAME' as mentioned before. loosely speaking, 'SAME' padding equals to same spatial dimensions for output feature map and input feature map which contains zero padding to matching the shapes and theoretically it is done equally on every side of the input map. One the other hand, 'VALID' means no padding. The overall architecture of the convolution layer is as depicted below:

Figure 2: The operations in convolutional layer.

Let's get back to 'line 20'. The number of output feature maps is set to 32 and the spatial kernel size is set to [5,5]. The stride is [1,1] by default. The scope argument is for defining the name for the layer which is useful in different scenarios such as returning the output of the layer, fine-tuning the network and graphical advantages like drawing a nicer graph of the network using Tensorboard. Basically it is the representative of the layer and adds all the operations into a higher-level node.

In 'line 29' we overwritten the padding type. It is changed to 'VALID' padding. The reason is behind the characteristics of the convolutional layer defined in 'line 28'. It is operating as a fully-connected layer. It is not because of the 'VALID' padding though. The 'VALID' padding is just part of the mathematical operation. The reason is because the input to this layer has the spatial size of 7x7 and the kernel size of the layer is the same. This is obvious because when the input size of the convolutional layer equals to its kernel size and 'VALID' pooling is used, the output is only one single neuron if the number of output feature map equals to 1. So if the number of output feature maps is equals to 1024, this layer operates like and filly-connected layer with 1024 output hidden units!

### Dropout Layer

The dropout is one of the most famous methods in order to prevent over-fitting. This operation randomly kills a portion of neuron to stochastically force the neuron to learn more useful information. Although the method is stochastic but it's been widely used in neural network architecture and presented promising results. The layer is defined in line '29'. The dropout_keep_prob argument determines the portion of the neurons which remains untouched and will not be disables by the dropout layer. Moreover the flag is_training is supposed to active and deactive the dropout layer which force the dropout to be active in the training phase and deactivate it in the test/evaluation phase.

### Embedding Layer

Convolutional layers results a 4-dimensional tensor with dimensions as [batch_size, width, height, channel]. As a result, the embedding layer combines all the channels except the first one indicating the batches. So the dimension of [batch_size, width, height, channel] becomes [batch_size, width x height x channel]. Let's go to 'line 33'. This is the last fully-connected layer prior to softmax which the number of its output units must be equal to the number of classes. The output of this layer has the dimensionality of [batch_size, 1, 1, num_classes]. 'Line 39-42' does the embedding operation which its output dimension is [batch_size, num_classes]. It is worth noting that the scope of the last layer overwrite the scope='fc4'.

## The TensorFlow Graph

At this time, after describing the network design and different layers, it is the time to present how to implement this architecture using TensorFlow. With TensorFlow everything should be defined on something called GRAPH. The graphs has the duty to tell the TensorFlow backend to what to do and how to do the desired operations. TensorFlow uses Session to run the operations.

The graph operations are executed in session environment which contains state of variables. For running each created session a specific graph is needed because each session can only be operated on a single graph. So multiple graphs cannot be used in a single session. If the users does not explicitly use a session by its name, the default session will be used by TensorFlow.

A graph contains tensors and the operations defined on that graph. So the graph can be used on multiple sessions. Again like the sessions, if a graph is not explicitly defined by the user, the TensorFlow itself set a default graph. Although there is no harm working with the default graph, but explicitly defining the graph is recommended. The general graph of out experimental setup is as below:

Figure 3: The TensorFlow Graph.

The graph is explicitly defined in our experiments. The following script, panel by panel, shows the graph design of our experiments:

TensorFlow Graph Design

graph = tf.Graph()
with graph.as_default():




# global step
global_step = tf.Variable(0, name="global_step", trainable=False)

# learning rate policy
decay_steps = int(num_train_samples / FLAGS.batch_size *
FLAGS.num_epochs_per_decay)
learning_rate = tf.train.exponential_decay(FLAGS.initial_learning_rate,
global_step,
decay_steps,
FLAGS.learning_rate_decay_factor,
staircase=True,
name='exponential_decay_learning_rate')


# Place holders
image_place = tf.placeholder(tf.float32, shape=([None, height, width, num_channels]), name='image')
label_place = tf.placeholder(tf.float32, shape=([None, FLAGS.num_classes]), name='gt')
dropout_param = tf.placeholder(tf.float32)


# MODEL
arg_scope = net.net_arg_scope(weight_decay=0.0005)
with tf.contrib.framework.arg_scope(arg_scope):
logits, end_points = net.net_architecture(image_place, num_classes=FLAGS.num_classes, dropout_keep_prob=dropout_param,
is_training=FLAGS.is_training)

# Define loss
with tf.name_scope('loss'):
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=label_place))

# Accuracy
with tf.name_scope('accuracy'):
# Evaluate model
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(label_place, 1))

# Accuracy calculation
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))


# Define optimizer by its default values

with tf.name_scope('train'):


arr = np.random.randint(data.train.images.shape[0], size=(3,))
tf.summary.image('images', data.train.images[arr], max_outputs=3,
collections=['per_epoch_train'])

# Histogram and scalar summaries sammaries
for end_point in end_points:
x = end_points[end_point]
tf.summary.scalar('sparsity/' + end_point,
tf.nn.zero_fraction(x), collections=['train', 'test'])
tf.summary.histogram('activations/' + end_point, x, collections=['per_epoch_train'])

# Summaries for loss, accuracy, global step and learning rate.
tf.summary.scalar("loss", loss, collections=['train', 'test'])
tf.summary.scalar("accuracy", accuracy, collections=['train', 'test'])
tf.summary.scalar("global_step", global_step, collections=['train'])
tf.summary.scalar("learning_rate", learning_rate, collections=['train'])

# Merge all summaries together.
summary_train_op = tf.summary.merge_all('train')
summary_test_op = tf.summary.merge_all('test')
summary_epoch_train_op = tf.summary.merge_all('per_epoch_train')


Each of the above panels, will be explained in the following subsections using the same naming convention for convenience.

### Graph Default

As mentioned before, it is recommended to set the graph manually and in that section, we named the graph to be graph. Later on it will be notice that this definition is useful because we can pass the graph to other functions and sessions and it will be recognized.

### Parameters

Different parameters are necessary for the learning procedure. The global_step defined in 'line 4' is one of which. The reason behind defining the global_step is to have a track of where we are in the training procedure. It is a non-learnable tensor and should be incremented per each gradient update which will be done over each batch. The decay_steps defined in 'line 7' determines after how many steps or epochs the learning rate should be decreased by a predefined policy. As can be seen num_epochs_per_decay defines the decay factor which is restricted to the number of passed epochs. The learning_rate tensor defined in 'line 9' determines the learning rate policy. Please refer to TensorFlow official documentation for grasping a better idea of the tf.train.exponential_decay layer. It is worth noting that the tf.train.exponential_decay layer takes global_step as its counter to realize when it has to change the learning rate.

### Place Holders

The tf.placeholder operation, creates a placeholder variable tensor which will be fed to the network in testing/training phase. The images and labels must have placeholders because they are in essence the inputs to the network in training/testing phase. The type and shape of the place holders must be defined as required parameters. As can be seen in line '14', the first dimension of the shape argument is set to None which allows the place holder to get any dimension. The first dimension is the batch_size and is flexible.

The dropout_param placeholder, takes the probability of keeping a neuron active. The reason behind defining a placeholder for dropout parameter is to enable the setup to take this parameter in running each any session arbitrary which enrich the experiment to disable it when running the testing session.

### Model and Evaluation Tensors

The 'lines 17-20' provides the default parameters determines by arg_scope operator. The tf.nn.softmax_cross_entropy_with_logits on the un-normalized logits is used as the loss function. This function computes the softmax activation internally which makes it more stable. Finally in 'lines 27-32' the accuracy is computed.

### Training Tensors

Now it's the time to define the training tensors. As defined in 'line 34' the Adam Optimizer is used as one of the best current optimization algorithms which is widely used and is famous because of its adaptive characteristics. As is defined in 'lines 37-39', the gradients must be computed using the defined loss tensor and those computations must be added as the train operations to the graph. Basically 'train_op' is an operation that is run for gradient update on parameters. Each execution of 'train_op' is a training step. By passing 'global_step' to the optimizer, each time that the 'train_op' is run, TensorFlow update the 'global_step' and increment it by one!

### Summaries

In this section we describe how to create summary operations and save them into allocated tensors. Eventually the summaries should be presented in Tensorboard in order to visualize what is happening inside of the network blackbox. There are different types of summaries. Three type of image, scalar and histogram summaries are used in this implementations. In order to avoid this post to becoming too verbose, we do not go in depth of the explanation for summary operations and we will get back to it in another post.

In 'lines 33-35' image summaries are created which has the duty of visualize the input elements to the summary tensor. This elements here are 3 random images from the train data. In 'lines 38-42' the outputs of different layers will be fed to the relevent summary tensor. Finally in 'lines 45-48' some scalar summaries are created in order to track the training convergence and testing performance. The collections argument in summary definitions is a supervisor which direct each summary tensor to the relevent operation. For example some summaries only needs to be generated in training phase and some are only needed in testing. We have a collection named 'per_epoch_train' too and the summaries which only have to be generated once per epoch in the training phase, will be stored in this list. Eventually 'lines 51-53' are defined with the goal of gathering the summaries in the corresponding summary lists using the collections key.

## Training

Now it's the time to go through the training procedure. In consists of different steps which starts by session configuration to saving the model checkpoint.

### Configuration and Initialization

First of all the tensors should be gathered for convenience and the session must be configured. The code is as below:

Session Configuration

tensors_key = ['cost', 'accuracy', 'train_op', 'global_step', 'image_place', 'label_place', 'dropout_param',
'summary_train_op', 'summary_test_op', 'summary_epoch_train_op']
tensors = [loss, accuracy, train_op, global_step, image_place, label_place, dropout_param, summary_train_op,
summary_test_op, summary_epoch_train_op]
tensors_dictionary = dict(zip(tensors_key, tensors))

# Configuration of the session
session_conf = tf.ConfigProto(
allow_soft_placement=FLAGS.allow_soft_placement,
log_device_placement=FLAGS.log_device_placement)
sess = tf.Session(graph=graph, config=session_conf)


As it is clear from 'line 58', all the tensors are store in a dictionary to be used later by the corresponding keys. 'Line 61-63' are dedicated to configuring the network. The allow_soft_placement flag, allows the switching back-and-forth between different devices. This is useful when the user allocated 'GPU' to all operations without considering the fact that not all operations are supported by GPU using the TensorFlow. In this case if the allow_soft_placement operator is disabled, errors can be show up and the user must start the debugging process but using the active flag prevent this issue by automatically switch from a non-supported device to the supported one. The log_device_placement flag is to present which operations are set on what devices. This is useful for debugging and it projects a verbose dialog in the terminal. Eventually in 'line 64' the session is taken using the defined graph. The training phase start using the following script:

Trainig Operations

with sess.as_default():
# Run the saver.
# 'max_to_keep' flag determines the maximum number of models that the tensorflow save and keep. default by TensorFlow = 5.
saver = tf.train.Saver(max_to_keep=FLAGS.max_num_checkpoint)

# Initialize all variables
sess.run(tf.global_variables_initializer())

###################################################
############ Training / Evaluation ###############
###################################################
train_evaluation.train(sess, saver, tensors_dictionary, data,
train_dir=FLAGS.train_dir,
finetuning=FLAGS.fine_tuning,
num_epochs=FLAGS.num_epochs, checkpoint_dir=FLAGS.checkpoint_dir,
batch_size=FLAGS.batch_size)

train_evaluation.evaluation(sess, saver, tensors_dictionary, data,
checkpoint_dir=FLAGS.checkpoint_dir)


In 'line 68' the tf.train.Saver is run in order to provide an operation to save and load the models. The max_to_keep flags determines the maximum number of the saved models that the TensorFlow keeps and its default is set to '5' by TensorFlow. In 'line 73' the session is run in order to initialize all the variable which is necessary. Finally in 'line 76' train_evaluation function is provided to run the training/tesing phase.

### Training Operations

The training function is as below:
Training Function

from __future__ import print_function
import tensorflow as tf
import numpy as np
import progress_bar
import os
import sys

def train(sess, saver, tensors, data, train_dir, finetuning,
num_epochs, checkpoint_dir, batch_size):
"""
This function run the session whether in training or evaluation mode.
:param sess: The default session.
:param saver: The saver operator to save and load the model weights.
:param tensors: The tensors dictionary defined by the graph.
:param data: The data structure.
:param train_dir: The training dir which is a reference for saving the logs and model checkpoints.
:param finetuning: If fine tuning should be done or random initialization is needed.
:param num_epochs: Number of epochs for training.
:param checkpoint_dir: The directory of the checkpoints.
:param batch_size: The training batch size.

:return:
Run the session.
"""

# The prefix for checkpoint files
checkpoint_prefix = 'model'

###################################################################
########## Defining the summary writers for train /test ###########
###################################################################

train_summary_dir = os.path.join(train_dir, "summaries", "train")
train_summary_writer = tf.summary.FileWriter(train_summary_dir)

test_summary_dir = os.path.join(train_dir, "summaries", "test")
test_summary_writer = tf.summary.FileWriter(test_summary_dir)

# If fie-tuning flag in 'True' the model will be restored.
if finetuning:
saver.restore(sess, os.path.join(checkpoint_dir, checkpoint_prefix))
print("Model restored for fine-tuning...")

###################################################################
########## Run the training and loop over the batches #############
###################################################################
for epoch in range(num_epochs):
total_batch_training = int(data.train.images.shape[0] / batch_size)

# go through the batches
for batch_num in range(total_batch_training):
#################################################
########## Get the training batches #############
#################################################

start_idx = batch_num * batch_size
end_idx = (batch_num + 1) * batch_size

# Fit training using batch data
train_batch_data, train_batch_label = data.train.images[start_idx:end_idx], data.train.labels[
start_idx:end_idx]

########################################
########## Run the session #############
########################################

# Run optimization op (backprop) and Calculate batch loss and accuracy
# When the tensor tensors['global_step'] is evaluated, it will be incremented by one.
batch_loss, _, train_summaries, training_step = sess.run(
[tensors['cost'], tensors['train_op'], tensors['summary_train_op'],
tensors['global_step']],
feed_dict={tensors['image_place']: train_batch_data,
tensors['label_place']: train_batch_label,
tensors['dropout_param']: 0.5})

########################################
########## Write summaries #############
########################################

# Write the summaries

# # Write the specific summaries for training phase.

#################################################
########## Plot the progressive bar #############
#################################################

progress = float(batch_num + 1) / total_batch_training
progress_bar.print_progress(progress, epoch_num=epoch + 1, loss=batch_loss)

# ################################################################
# ############ Summaries per epoch of training ###################
# ################################################################
train_epoch_summaries = sess.run(tensors['summary_epoch_train_op'],
feed_dict={tensors['image_place']: train_batch_data,
tensors['label_place']: train_batch_label,
tensors['dropout_param']: 0.5})

# Put the summaries to the train summary writer.

#####################################################
########## Evaluation on the test data #############
#####################################################

# WARNING: In this evaluation the whole test data is fed. In case the test data is huge this implementation
#          may lead to memory error. In presence of large testing samples, batch evaluation on testing is
#          recommended as in the training phase.
test_accuracy_epoch, test_summaries = sess.run([tensors['accuracy'], tensors['summary_test_op']],
feed_dict={tensors['image_place']: data.test.images,
tensors[
'label_place']: data.test.labels,
tensors[
'dropout_param']: 1.})
print("Epoch " + str(epoch + 1) + ", Testing Accuracy= " + \
"{:.5f}".format(test_accuracy_epoch))

###########################################################
########## Write the summaries for test phase #############
###########################################################

# Returning the value of global_step if necessary
current_step = tf.train.global_step(sess, tensors['global_step'])

# Add the counter of global step for proper scaling between train and test summaries.

###########################################################
############ Saving the model checkpoint ##################
###########################################################

# # The model will be saved when the training is done.

# Create the path for saving the checkpoints.
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)

# save the model
save_path = saver.save(sess, os.path.join(checkpoint_dir, checkpoint_prefix))
print("Model saved in file: %s" % save_path)

############################################################################
########## Run the session for pur evaluation on the test data #############
############################################################################
def evaluation(sess, saver, tensors, data, checkpoint_dir):

# The prefix for checkpoint files
checkpoint_prefix = 'model'

# Restoring the saved weights.
saver.restore(sess, os.path.join(checkpoint_dir, checkpoint_prefix))
print("Model restored...")

# Evaluation of the model
test_accuracy = 100 * sess.run(tensors['accuracy'], feed_dict={tensors['image_place']: data.test.images,
tensors[
'label_place']: data.test.labels,
tensors[
'dropout_param']: 1.})

print("Final Test Accuracy is %% %.2f" % test_accuracy)



The input parameters to the function are described by the comments in 'lines 12-20'. In 'lines 33-39' the summary writers are defined separately for train and test phases. In 'lines 45-47' the program checks if fine-tuning is desired then the model is loaded and the operation will be continued afterward. In 'lines 58-63' The batches are extracted from training data. In 'lines 71-76' for a single training step, the model is evaluated on a batch of data and the model parameter and weights will be updated. Same applies for the test set starting in 'line 113' however at this time only evaluation will be done and the model won't be updated. In 'line 143' the model will be saved.

## Training Summaries and Results

The training loops saves the summaries in the train summary part. By using the Tensorboard and pointing to the directory that the logs are saved, we can visualize the training procedure. The loss and accuracy for the train are depicted jointly as below:

Figure 4: The loss and accuracy curves for training.
The activation of the last fully-connected layer will be depicted in the following figure:
Figure 5: The activation of the last layer.

For the last layer it is good to have a visualization of the distribution of the neurons outputs. By using the histogram summary the distribution can be shown over the whole training steps. The result is as below:

Figure 6: The histogram summary of the last layer.
Eventually the test accuracy per step is plotted as the following curve:
Figure 7: Test Accuracy.
A representation of the terminal progressive bar for the training phase is as below:
Figure 8: Terminal scene in training phase.
Few things needs to be considered in order to clarify the results:
• The initial learning rate by the Adam optimizer has been set to a small number. By setting that to a larger number, the speech of accuracy increasing could go higher. We deliberately set that to a small number to be able to track the procedure easier.
• The histogram summaries are saved per each epoch and not per step. Since the generation of histogram summaries are very time-consuming, there are only generated per epoch of training.
• While the training is under process, per each epoch an evaluation will be performed over the whole test set. If the test set is too big, isolated evaluation is recommended in order to avoid the memory exhaustion issue.

## Summary

In this tutorial we train a neural network classifier using convolutional neural networks. MNIST data has been used for simplicity and its wide usage. The TensorFlow has been used as the deep learning framework. The main goal of this tutorial was to present an easy ready-to-use implementation of training classifiers using TensorFLow. Lots of the tutorials in this category looks like to be too verbose in code or too short in explanations. My effort was to provide a tutorial to be easily understandable in the sense of coding and be comprehensive in the sense of description. Some of the details about some TensorFlow(like summaries) and data-input-pipeline have been ignored for simplicity. We get back to them in the future posts. I hope you enjoyed it.

Go Top