Unbelievable Usefulness of Batchnorm

Here’s some basic and shallow experience with batchnorm, an incredibly powerful weapon in deep neural networks, and some pitfalls in using it with tensorflow.

Proposed by Szegedy et al. in 2015, batchnorm has since gain its popularity among deep learning architectures. It was designed to address the problem of killed gradients in deep networks. Before this technique came out, training deep networks is a really tricky task, because the gradients are easily killed in the last layers.
Take a simple example as $y = x .* w + b$, which denotes a single fully-connected layer. In the beginning we use small random numbers to initialize the weights $w$, and when the layers get stacked up, the output $y$ in each layer get smaller and smaller, thus, in the backward pass, the gradient of $w$, which is some coeffient times $x$, will be very small, and even zero in the first layers.
So what kinds of input does the layers want? A good answer may be unit gaussian, like this:
a desirable layer inputs
So what the batchnorm layer actually does is to add a step to transform the data to unit gaussian distribution. More intuitively as follows:
$$
\Large \hat{x}^{(k)}=\frac{x^{(k)}-E[x^{(k)}]}{ \sqrt{VAR[x^{(k)}]} }
$$
Actually when you implements this transformation, a little coefficient $\epsilon$ is added in the denominator to avoid devided by zero. The reason you can do this is, that this function is differenciable, and you can apply learnable coefficients on $E$ and $VAR$, so that the network can learn to do little in these layers or do a lot depends on the need.
I encounterd the problem of killed gradients in a task, whose model is an 11-layered CNN-based model. At first I find that the network learns nothing after several batches, so obviously the gradients are killed. I then applied batchnorm to some layers, and the network start to learn as a charm.
But still there are things to be careful with, when you actually implement batchnorm.

def batchnorm_for_affine(layer):
    D = layer.get_shape()[-1]
    scale = tf.Variable(tf.ones([D]))
    beta = tf.Variable(tf.zeros([D]))
    pop_mean = tf.Variable(tf.zeros([D]), trainable=False)
    pop_var = tf.Variable(tf.ones([D]), trainable=False)
    epsilon = 1e-3
    decay = 0.999

    if(args["test_mode"] <= 0):
        batch_mean, batch_var = tf.nn.moments(layer,[0])
        train_mean = tf.assign(pop_mean,
            pop_mean * decay + batch_mean * (1 - decay))
        train_var = tf.assign(pop_var,
            pop_var * decay + batch_var * (1 - decay))
        with tf.control_dependencies([train_mean, train_var]):
            return tf.nn.batch_normalization(layer,
                batch_mean, batch_var, beta, scale, epsilon)
    else:
        return tf.nn.batch_normalization(layer,
            pop_mean, pop_var, beta, scale, epsilon)

Here scale and beta are the learnable parameters. First you need to have two sets of means and vars, because at testtime, you cannot use the same procedure as training time, or you get zero outputs. So you need to set a running mean and var, which you will use at testtime.
Second thing is, the batchnorm function is different for different kinds of layers. For example you need to flatten the matrix when you are applying batchnorm to output of a CNN layer. Because batchnorm is doing normalization on your mini-batch, more specifically normalize every individual features according to others in the batch.

So, when you are using random small number initialization, or simpel Xavier initialization, make smart use of batchnorm is almost always a good choice. You can also use more intricate initialization, and use activation functions other than RELU, say leaky RELU or something else.