A easy guide to gradients of softmax cross entropy loss

2021-03-06

Word count: 278 | Reading time≈ 1 min

A easy guide to gradients of softmax cross entropy loss

https://www.jianshu.com/p/c02a1fbffad6

¶Softmax function

The softmax function, generally in neural networks, can be used as the output layer of a classification task. In fact, you can think of softmax as outputting the probability of several category selections. For example, if I have a classification task to be divided into three categories, the softmax function can output the probability of three category selections based on their relative sizes, and the probability sums to 1. The input to softmax function is often called logits z_i. The function comes in the following form.
$$
a_i = \frac{e^{{z_i}}{\sum_ke}{z_k}}
$$
So it’s easy to see that this function is a multiple-input-multiple-output mapping, whose gradients will be a Jacobian matrix with element,
$$
\frac{\partial a_i}{\partial z_j}
$$

¶Cross-entropy loss function

The loss function can take many forms, and the cross-entropy function is used here mainly because this derivative is relatively simple and easy to compute, and cross-entropy solves the problem of slow learning of certain loss functions. The cross-entropy function looks like,
$$
L(z_i,y_i) = -\sum_iy_ilna_i
$$
where y_iis so called labels standing for the true category each sample input falls into. The loss Lis a multivariant function, thus the gradients would flow into both logits and labels. But since labels are not usually trainable variables, to disallow back propagation into labels, it’s practical to pass label tensors through tf.stop_gradient before feeding it to this function in Tensorflow.

¶Chain Rule

Simply applying chain rule of,
$$
\frac{\partial L}{\partial z_i} = \frac{\partial L}{\partial a_i}\frac{\partial a_i}{\partial z_}
$$

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.