Hyperparameter tuning, Regularization and Optimization

21 Nov 2018

Initialization

Different initializations lead to different results
Random initialization is used to break symmetry and make sure different hidden units can learn different things
Don’t intialize to values that are too large
He initialization works well for networks with ReLU activations.

What is L2-regularization actually doing?:

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.

The implications of L2-regularization on:

The cost computation:
- A regularization term is added to the cost
The backpropagation function:
- There are extra terms in the gradients with respect to weight matrices
Weights end up smaller (“weight decay”):
- Weights are pushed to smaller values.

Dropout

Finally, dropout is a widely used regularization technique that is specific to deep learning. It randomly shuts down some neurons in each iteration.

Dropout is a regularization technique.

You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
Apply dropout both during forward and backward propagation.
During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
Regularization will help you reduce overfitting.
Regularization will drive your weights to lower values.
L2 regularization and Dropout are two very effective regularization techniques.

Stochastic Gradient Descent (SGD)

It is equivalent to mini-batch gradient descent where each mini-batch has just 1 example.
What changes is that you would be computing gradients on just one training example at a time, rather than on the whole training set.
In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly. Note also that implementing SGD requires 3 for-loops in total:
1. Over the number of iterations
2. Over the $m$ training examples
3. Over the layers (to update all parameters, from $(W^{[1]},b^{[1]})$ to $(W^{[L]},b^{[L]})$)

Mini-batch gradient descent

It uses an intermediate number of examples for each step. With mini-batch gradient descent, you loop over the mini-batches instead of looping over individual training examples.

Shuffle: Create a shuffled version of the training set (X, Y). Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the $i^{th}$ column of X is the example corresponding to the $i^{th}$ label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches.
Partition: Partition the shuffled (X, Y) into mini-batches of size mini_batch_size. Note that the number of training examples is not always divisible by mini_batch_size. The last mini batch might be smaller, but you don’t need to worry about this.

What you should remember:

The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
You have to tune a learning rate hyperparameter $\alpha$.
With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

Momentum

Using momentum can reduce oscillations. Momentum takes into account the past gradients to smooth out the update.

How do you choose $\beta$?
The larger the momentum $\beta$ is, the smoother the update because the more we take the past gradients into account. But if $\beta$ is too big, it could also smooth out the updates too much.
Common values for $\beta$ range from 0.8 to 0.999. If you don’t feel inclined to tune this, $\beta = 0.9$ is often a reasonable default.
Tuning the optimal $\beta$ for your model might need trying several values to see what works best in term of reducing the value of the cost function $J$.