Are Neural Networks too long? Top Tips to Make them Smaller

Losing sleep over too long neural networks? look at these tips to make them smaller.

The first and most popular method for deep neural networks is to train neural networks that are lightweight by design. A good example of this is MobileNet by Google. MobileNet’s architecture only features 4.2 million parameters while achieving a 70% accuracy on ImageNet. This compression maintains reasonable performance and is achieved by the introduction of depthwise convolution layers which help reduce model size and complexity.

A modern approach to reducing generalization error is to use a larger model that may be required to use regularization during training that keeps the weights of the model small. These techniques not only reduce overfitting but can also lead to faster optimization of the model and better overall performance.

Tips for Training Deep Neural Networks

Before you start building your network architecture, the first thing you need to do is to verify your input data into the network if an input (x) corresponds to a label (y). In case of dense prediction, make sure that the ground-truth label (y) is properly encoded to label indexes (or one-hot encoding). If not, the training won’t work.
Always use normalization layers in your network. If you train the network with a large batch size (say 10 or more), use the BatchNormalization layer. Otherwise, if you train with a small batch size (say 1), use the InstanceNormalization layer instead. Note that major authors found out that BatchNormalization gives performance improvements if they increase the batch size and it downgrades the performance when the batch size is small. However, InstanceNormalization gives slight performance improvements if they use a small batch size. Or you may also try GroupNormalization.
Another regularization technique is to constraint or bound your network weights. This can also help prevent the gradient explosion problem in your network since the weights are always bounded. In contrast to L2 regularization where you penalize high weights in your loss function, this constraint regularizes your weights directly.
Always shuffle your training data, both before training and during training, in case you don't benefit from temporal data. This may help improve your network performance.
To capture contextual information around objects, use multi-scale features pooling modules. This can further help improve the accuracy and this idea is successfully used in semantic segmentation or foreground segmentation.
Opt-out void labels (or ambiguous regions) from your loss or accuracy computation if any. This can help your network to be more confident in the prediction.

Intuitions Behind Network Pruning

Pruning is a simple, intuitive algorithm. There are many variants, but the basic idea works on any neural network.

The first one is weight redundancy. This means that several neurons or filters in the particular case of CNNs will be activated by very similar input values. Consequently, most networks are over-parameterized and we can safely assume that deleting redundant weights won’t harm performance too much.

The second one is that not all weights contribute equally to the output prediction. Instinctively, we can assume that lower magnitude weights will have lower importance to the network, Y. LeCun calls them low saliency weights. Indeed, all things being equal, lower magnitude weights will have a lower effect on the network’s training error.