Backpropagation

01 Background

Here, we discuss the the math behind backpropagation, the driving force of Neural Networks. In essence, we use gradient descent to minimize the cost function. The chain rule is applied to update the weights and biases (learnable parameters) throughout the neural network after each forward pass of training examples (or batch of training examples).

Below, we answer the questions:

What is gradient descent?
What is its use in deep learning?
How to calculate gradient descent?

Computation Graph

Let’s look review the fundamentals of a chain rule using a computation graph. Let’s assume we have the following computation graph for a simple function $J = 3 (a + b c)$ . We can define the computational graph is center Let’s calculate $J$ with respect to $v$ , which gives us the “first step” of backpropagation, “one step backward in the graph”. $J$ is also known as “final output variable”.

\frac{\partial J}{\partial v} = 3

We then find $J$ with respect to the other variables.

\frac{\partial J}{\partial a} = \frac{\partial J}{\partial v} \frac{\partial v}{\partial a} = 3 \times 1 = 3

\frac{\partial J}{\partial u} = \frac{\partial J}{\partial v} \frac{\partial v}{\partial u} = 3 \times 1 = 3

\frac{\partial J}{\partial b} = \frac{\partial J}{\partial v} \frac{\partial v}{\partial u} \frac{\partial u}{\partial b} = 3 \times 1 \times 2 = 6

\frac{\partial J}{\partial c} = \frac{\partial J}{\partial v} \frac{\partial v}{\partial u} \frac{\partial u}{\partial c} \times b = 3 \times 1 \times 3 = 9

Here, we have calculated all the gradients the function $J$ with respect to $v, a, u, b, c$ using the chain rule. In a machine learning model, $J$ could be a Cost Function that we are trying to minimize. In code, we often shorten $\frac{\partial J}{\partial v}$ as $d v$ .

Linear Regression

Logistic Regression

Logistic Regression Let’s decipher back-propagation for a logistic regression task. Let’s recap the related functions. These are the “forward” computations.

z = w^{T} + b (1)

\overset{y}{^} = a = σ (z) (2)

L (a, y) = - [y lo g a + (1 - y) lo g (1 - a)] (3)

This loss function $(3)$ is known as the logistic loss or log-loss function. Logistic Regression uses a sigmoid activation function, which can be defined as:

\overset{y}{^} = a = σ (z) = \frac{1}{1 + e ^{- z}} (4)

Example with only Two Features Let’s look at a simple example with only $n = 2$ features. Our computation graph is shown below. center In the backward computations, we want to compute the derivatives of Loss with respect to $a$ .

d a = \frac{\partial L}{\partial a} = - \frac{y}{a} + \frac{1 - y}{a}

d z = \frac{\partial L}{\partial z} = \frac{\partial L}{\partial a} \frac{\partial a}{\partial z} = - (\frac{y}{a} + \frac{1 - y}{a}) (a (1 - a)) = a - y

想法如何得到 $\frac{\partial a}{\partial z}$ ？

Then, use the chain rule to compute how much you need to change $w$ and $b$ .

d w_{1} = \frac{\partial L}{\partial w _{1}} = \frac{\partial L}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial w _{1}} = x_{1} d z

Then, we would update $w_{1}$ as

w_{1} = w_{1} + α d w_{1}

The same goes for $w_{2}$ and $b$

w_{2} = w_{2} + α d w_{2}

d b = \frac{\partial L}{\partial b} = \frac{\partial L}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial b} = d z, ∴ b = b + α d z

We would initialize $J = 0, d w_{1} = 0, d w_{2} = 0, d b = 0$ , and use a for loop over the training set and compute the derivative of the loss function with respect to each weight, updating the weights after each training set example.

Thus let’s define our algorithm.

\begin{algorithm}
    \caption{Logistic Regression Foward and Backpropagation} 
    \begin{algorithmic}
      \Procedure{LogisticRegression}{$x, b, w$}
	    \For{$i=1$ to $m$} \Comment{For all test cases $m$}
	    \State $z^{(i)} \gets w^Tx^{(i)}+b$
	    \State $a^{(i)} \gets \sigma(z^{(i)})$
	    \State $J \gets J + -[y^{(i)}\log{a^{i}}+ (1-y^{(i)})\log{(1-a^{i})}] $
	    \State $dz^{(i)} \gets \frac{\partial L}{\partial z} = \frac{\partial L}{\partial a} \frac{\partial a}{\partial z} = a^{(i)}-y^{(i)}$
	    \State $dw_{1}\gets\frac{\partial L}{\partial w_{1}} = \frac{\partial L}{\partial w_{1}} + \frac{\partial L}{\partial a} \frac{\partial a}{\partial z}\frac{\partial z}{\partial w_{1}} = x^{(i)}_1dz^{(i)}$
	     \State $dw_{2} \gets \frac{\partial L}{\partial w_{2}}  = \frac{\partial L}{\partial w_{2}} + \frac{\partial L}{\partial a} \frac{\partial a}{\partial z}\frac{\partial z}{\partial w_{2}} = x^{(i)}_2dz^{(i)}$
	     \State $db \gets \frac{\partial L}{\partial b} = \frac{\partial L}{\partial b} + \frac{\partial L}{\partial a} \frac{\partial a}{\partial z}\frac{\partial z}{\partial b} = dz^{(i)}$
	    \EndFor 
	    
	    \Comment{Divide by $m$ to get average}
	    \State $\partial w_{1} \gets \frac{\partial w_1}{m}$ \Comment{Average for $w_1$}
	    \State $\partial w_{2} \gets \frac{\partial w_2}{m}$ \Comment{Average for $w_2$}
	    \State $\partial b \gets \frac{\partial b}{m}$
	    \State $\partial L \gets \frac{\partial L}{m}$ 
      \EndProcedure
      \end{algorithmic}
    \end{algorithm}

Errors 14? We are using $d w_{1}$ as the derivate of overall cost function with respect to $w_{1}$ .

d w_{1} = \frac{\partial J}{\partial w _{1}}

So in order to implement one step of gradient descent.

w_{1} := w_{1} = α d w_{1}

w_{2} := w_{2} = α d w_{1}

b := b - α d b

If we want to implement multple steps. Then we run this function multiple times

Okay, how would we improve this. Notice how their is essentially two for loops, which isn’t very efficient. We can use vectorization to speed it up. First let’s note the inner loop. In this graph, we only use two features, and thus two weights. What if there are $n$ weights. Then we would require a for loop to loop through all input features. We show generalization below.

center

Vectorize Note that $(4)$ can be used to substitute in $(3)$ , to give the loss $L (\overset{y}{^}^{(i)}, y^{(i)}) = (y^{i} lo g (\overset{y}{^}^{(i )}) + (1 - y^{(i )}) lo g (1 - \overset{y}{^}^{(i )})$ for the $i$ th example. Let’s vectorize this cost function over $m$ training examples and get the total cost.

J (w, b) = \frac{1}{m} i = 1 \sum m L (\overset{y}{^}^{(i)}, y^{(i)}) = - \frac{1}{m} ​ i = 1 \sum m (y^{(i)} lo g (\overset{y}{^}^{(i ​)}) + (1 - y^{(i ​)}) lo g (1 - \overset{y}{^}^{(i ​)}))

where

Our goal is to find $\nabla J, \frac{\partial J}{\partial w _{j}}, \frac{\partial J}{\partial b}$

We know that $\frac{\partial J}{\partial a ^{(i)}} = a^{(i)} - y^{(i)}$ as calculated above. Thus if we were to generalize and sum over $i$ is.

\frac{\partial J}{\partial a} = \frac{1}{m} i = 1 \sum m (\overset{y}{^}^{(i)} - y^{(i)})

Then it’s easy to get

\frac{\partial J}{\partial w _{j}} = \frac{\partial J}{\partial a} \frac{\partial z}{\partial w _{j}} = \frac{1}{m} i = 1 \sum m (\overset{y}{^}^{(i)} - y^{(i)}) x_{i, j}

where $x_{i, j}$ is the j-th feature of the i-th example. Lets assume each example has $n$ features. So $\frac{\partial J}{\partial w _{j}}$ represents the partial derivative with respect to the j-th weight. The process is similar for $b$ .

\frac{\partial J}{\partial b} = \frac{\partial J}{\partial a} \frac{\partial z}{\partial b} = \frac{1}{m} i = 1 \sum m (\overset{y}{^}^{(i)} - y^{(i)})

Neural Network (Multi-Layer Perceptron)

Math

Here we discuss the math behind backpropagation for a simple multi-layer perception network (2 layers). Notice, we used vectorized version, with $m$ training examples in one pass. A training example can have $d$ dimensionality, or features. Let’s review the basic notation first. center The output of the nodes in layer 1 can be reflected as a function $a^{(1)} = σ (z^{(1)})$ , where $z^{(1)}$ is

z^{(1)} = W^{(1)} X^{(0)} + b

and where the input (or outputs from previous layer) is denoted by $X$

X_{n \times m} = ⋮ x_{1} ⋮ ⋮ x_{2} ⋮ ⋮ x_{3} ⋮ \dots ⋮ x_{m} ⋮

where $x_{m}$ is a single example or test case. Each $x_{i}, i = 1, 2, \dots, m$ has dimension $n \times 1$ , where $n$ represents the number of nodes in the previous layer, or 3 in this case. In the case of the input layer only, $n$ also represents $d$ . The inputs for test case example $i$ would be

i = 1, \dots m, x_{i} = x_{1, i} x_{2, i} x_{3, i}_{3 \times 1}

If we look at a single test case,

α^{(1)} = σ (z^{(1)}) = σ (z^{(1)}) σ (z^{(2)}) σ (z^{(3)}) σ (z^{(4)})_{4 \times 1} = w^{(1) T} x^{(1)} = w_{1, 1} w_{1, 2} w_{1, 3} w_{2, 1} w_{2, 2} w_{2, 3} w_{3, 1} w_{3, 2} w_{3, 3} w_{4, 1} w_{4, 2} w_{4, 3}_{3 \times 4}^{T} x_{1} x_{2} x_{3}_{3 \times 1}

Notice in $W$ , each column represents the weights connected to the same node in the current layer. In other terms, $w_{(To, From)}$ . For example, the output for the first node of the hidden layer is $a_{1}^{(1)} = σ (z_{1}^{(1)}), z_{1}^{(1)} = w_{1, 1}^{(1)} x_{1}^{(0)} + w_{1, 2}^{(1)} x_{2}^{(0)} + w_{1, 3}^{(1)} x_{3}^{(0)} + b$ Notice. we would then have to take the transpose for the math to work out correctly. Thus, the vectorized formula can be simplified into $z_{k \times m}^{(1)} = W_{n \times k}^{(1) T} X_{n \times m}^{(0)} + b_{k \times 1}$ where $k$ is the number of nodes in the new layer.

Forward Propagation For Layer 1 (Layer after input layer), where Layer 0 has $k$ is number of nodes in new layer and $n$ is number of node in the previous layer.

α_{0}^{(1)} = σ (w_{0, 0} α_{0}^{(0)} + w_{0, 1} α_{1}^{(0)} + w_{0, 2} α_{2}^{(0)} + \dots + w_{0, n} α_{n}^{(0)})

α_{1}^{(1)} = σ (w_{1, 0} α_{0}^{(0)} + w_{1, 1} α_{1}^{(0)} + w_{1, 2} α_{2}^{(0)} + \dots + w_{1, n} α_{n}^{(0)})

α_{k}^{(1)} = σ (w_{k, 0} α_{0}^{(0)} + w_{k, 1} α_{1}^{(0)} + w_{k, 2} α_{2}^{(0)} + \dots + w_{k, n} α_{n}^{(0)})

In terms of matrix

[w_{k, n}]_{k \times n}^{(0)} [α_{n}]_{n \times 1}^{(0)} + [β_{k}]_{k \times 1}^{(0)} = [α_{k}]_{k \times 1}^{(1)}

Cost Function Remember, the cost function or loss function of deep a model is

need to consider the average cost over all training examples
the inputs of the cost function are all the weights + biases and spits out a number describing how bad the weights and biases are.
the full cost function involves averaging a certain cost-per-example for all terms for all training examples, the way we adjust all the weights and biases for a single step also depends on every single example.

Below, we first explain how to calculate the gradient descent with respect to its input parameters of a simplified model.

Gradient and Gradient Descent The gradient $\nabla C$ tells us which direction we should take to increase the cost function most quickly. Thus, the negative of that should give us which direction we should step to decrease the function most quickly. The length of gradient vector tells us how steep that slope is. $- \nabla C$ “Gradient descent” is the algorithm for minimizing this function. Take a step downhill and repeat.

- \nabla C (W) = []_{N}

Back propagation is an algorithm for calculating the negative gradient.

The magnitude of each component of the gradient tells you how sensitive the cost function is to each corresponding weights and bias.

Updating the Weights and Biases

Example Let’s calculate the Gradient with Backpropagation on an extremely simple network. center Let’s define the final node as $a^{(L)}$ and second to final as $a^{(L - 1)}$ . The cost function therefore will be defined as $C_{o} = (a^{(L)} - y)^{2}$ Where $y$ is the desired output. Let’s define some functions. We already know $a^{(L)} = σ (w^{(L)} a^{(L - 1)} + b^{(L)})$ Let’s set $z^{(L)} = w^{(L)} a^{(L - 1)} + b^{(L)}$ Therefore $a^{(L)} = σ (z^{(L)})$

Let’s define $\frac{\partial C _{o}}{\partial w ^{(L)}}$ . The following formula tells us how a nudge to a particular weight in the last layer will effect the cost for that one particular training example. $\frac{\partial C _{o}}{\partial w ^{(L)}} = \frac{\partial z ^{(L)}}{\partial w ^{(L)}} \frac{\partial a ^{(L)}}{\partial z ^{(L)}} \frac{\partial C _{o}}{\partial a ^{(L)}}$ We can calculate $\frac{\partial C}{\partial a ^{(L)}} = 2 (a^{(L)} - y), \frac{\partial z ^{(L)}}{\partial w ^{(L)}} = a^{(L - 1)}, \frac{\partial a ^{(L)}}{\partial z ^{(L)}} = σ^{'} (z^{(L)})$

The gradient for the bias is $\frac{\partial C _{o}}{\partial b ^{(L)}} = \frac{\partial z ^{(L)}}{\partial b ^{(L)}} \frac{\partial a ^{(L)}}{\partial z ^{(L)}} \frac{\partial C _{o}}{\partial a ^{(L)}}$ and if we simplify it, it equals 1.

The full cost function for the network is $C = \frac{1}{n} \sum_{k = 0}^{n - 1} C_{k}$ where $n$ is the number of training examples. Thus, with some matrix multiplication, the above equations work on updating all weights and biases in the last layer.

But this is just the last layer, how would we update the weights for the previous layers? Well, if you notice how $a$ is defined $a =$

Let’s summarize this process. center

    \begin{algorithm}
    \caption{DeepLearning}
    \begin{algorithmic}
	\Input{$a^{[l-1]}$}
	\Output{$a^{[l]}, \text{cache}\left(z^{[l]}\right)$}
      \Procedure{Forward}{} \comment{for Layer l}
        \State $z^{[l]} \gets w^{[l]}a^{[l-1]} + b^{[l]}$
        \State $l \gets g^{[l]}\left(z^{[l]}\right)$
      \EndProcedure
      \Input{$da^{[l]}$}
      \Output{$da^{[l-1]}, dW^{[L]}, db^{[l]}$}
      \Procedure{Backword}{}
        \State $dz^{[l]} \gets da^{[l]}*g^{[L]'}\left(z^{[l]}\right)$ \comment{element wise product}
        \State $dw^{[l]} \gets dz^{[L]}\cdot a^{[l-1]T}$
        \State $db^{[l]} \gets dz^{[l]}$
        \State $da^{[l-1]} = w^{[l]T} \cdot dz^{[l]}$
      \EndProcedure
      \end{algorithmic}
    \end{algorithm}

If vectorized, we can use $A$ instead of $a$ , $Z$ instead of $z$ , $W$ instead of $w$ .

MLP From Scratch

Here we discuss how a MLP implemented in scratch, with simply numpy. First, let’s visualize our simple neural network with one node in the hidden layer. center Let’s mathematically define our neural network.

We define input $X = [x_{1}, x_{2}, \dots, x_{m}]$
We define weights $W = [w_{1}, w_{2}, \dots, w_{m}]$
We define the activation function $σ$ to be the ReLU function.
We define the output of the hidden layer before the activation to be $z = (W^{T} X + b)$
We define the output of the hidden layer after the activation to be $h = R e LU (z)$
We define the output be layer $\overset{y}{^} = h$
We define the loss function to be the mean squared error $L = \frac{1}{m} i = 1 \sum m (y^{(i)} - \overset{y}{^}^{(i)})^{2}$ , where $i$ is the specific test case.

Before building or model, let’s calculate our gradients.

\frac{\partial J}{\partial y ^} = \frac{2}{m} i = 1 \sum m (\overset{y}{^}^{(i)} - y^{(i)}) (1)

\frac{\partial J}{\partial z} = \frac{\partial J}{\partial y ^} \frac{\partial y ^}{\partial h} \frac{\partial h}{\partial z} = \frac{\partial J}{\partial y ^} (1) (ReLU’(z)) (2)

\frac{\partial J}{\partial W} = \frac{\partial J}{\partial y ^} \frac{\partial y ^}{\partial h} \frac{\partial h}{\partial z} \frac{\partial z}{\partial W} = \frac{\partial J}{\partial z} (X) (3)

\frac{\partial J}{\partial W} = \frac{\partial J}{\partial y ^} \frac{\partial y ^}{\partial h} \frac{\partial h}{\partial z} \frac{\partial z}{\partial b} = \frac{\partial J}{\partial z} (4)

Where the gradient of $ReLU$ is ${10 if x > 0 if x \leq 0$ Let’s now define our model.

And that’s a MLP network!

Remark: Here, we try to generalize the math behind the np.sum in db1, dW1 to a MLP network with multiple hidden neurons and multiple outputs, but still one layer. Notice that the gradient with respect to db1 is a summation, so that’s why we perform np.sum. What about the summation for dW1? Well, we know that $\frac{\partial J}{\partial W} = \frac{2}{m} i = 1 \sum m (\overset{y}{^}^{(i)} - y^{(i)}) (1) (ReLU(z)) (X)$ Let’s break down this formula. We first define $X$ and $W$ as

w_{1, 1} w_{2, 1} ⋮ w_{n, 1} w_{1, 2} w_{2, 2} ⋮ w_{n, 2} \dots \dots ⋱ \dots w_{1, k} w_{2, k} ⋮ w_{n, k}

Where $k$ is the number of features for each test case and $n$ is the number of nodes in the hidden layer. Then we get $Z$ as $Z = W X + b$ , this gives us a $(n, m)$ matrix. Note that $b$ is a $(n, 1)$ matrix. This means that $H = ReLU’ (Z)$ also a $(n, m)$ matrix. But, we can ignore this in calculations below because it has the same shape as our output. Essentially, equation $(2)$ shown above has a shape of $(n, m)$ as well.

$y$ or the ground truth is a matrix with $n \times m$ matrix, where $n$ is the number of outputs (output layer) . Thus, we can represent $\overset{y}{^} - y$ as $(\overset{y}{^} - y)_{(n, m)} = (\overset{y}{^}_{1}^{1} - y_{1}^{1}) (\overset{y}{^}_{2}^{1} - y_{2}^{1}) ⋮ (\overset{y}{^}_{n}^{1} - y_{n}^{1}) (\overset{y}{^}_{1}^{2} - y_{1}^{2}) (\overset{y}{^}_{2}^{2} - y_{2}^{2}) ⋮ (\overset{y}{^}_{n}^{2} - y_{n}^{2}) \dots \dots ⋱ \dots (\overset{y}{^}_{1}^{m} - y_{1}^{m}) (\overset{y}{^}_{2}^{m} - y_{2}^{m}) ⋮ (\overset{y}{^}_{n}^{m} - y_{n}^{m})$ Note that $\overset{y}{^}^{i}$ does not mean to the power of, but the test case. Note that the gradient with respect to $W$ has to result in the same shape of $W$ , so we can update every weight in the first layer. We perform basic matrix multiplication, reflected by np.dot.

\frac{\partial J}{\partial W}_{(n, k)} = \frac{2}{m} \rowcolor l i g h t b l u e (\overset{y}{^}_{1}^{1} - y_{1}^{1}) (\overset{y}{^}_{2}^{1} - y_{2}^{1}) ⋮ (\overset{y}{^}_{n}^{1} - y_{n}^{1}) (\overset{y}{^}_{1}^{2} - y_{1}^{2}) (\overset{y}{^}_{2}^{2} - y_{2}^{2}) ⋮ (\overset{y}{^}_{n}^{2} - y_{n}^{2}) \dots \dots ⋱ \dots (\overset{y}{^}_{1}^{m} - y_{1}^{m}) (\overset{y}{^}_{2}^{m} - y_{2}^{m}) ⋮ (\overset{y}{^}_{n}^{m} - y_{n}^{m}) \columncolor l i g h t b l u e x_{1, 1} x_{2, 1} ⋮ x_{m, 1} \columncolor l i g h t g ree n x_{1, 2} x_{2, 2} ⋮ x_{m, 2} \dots \dots ⋱ \dots x_{1, k} x_{2, k} ⋮ x_{m, k}

which gives us $\frac{\partial J}{\partial W}_{(n, k)} \frac{1}{m} [\cellcolor l i g h t b l u e x_{1, 1} (\overset{y}{^}_{1}^{1} - y_{1}^{1}) + x_{2, 1} (\overset{y}{^}_{1}^{2} - y_{1}^{2}) + \dots + x_{m, 1} (\overset{y}{^}_{1}^{m} - y_{1}^{m}) \cellcolor l i g h t g ree n i = 1 \sum m x_{i, 2} (\overset{y}{^}_{2}^{(i)} - y_{2}^{(i)}) \dots]$ Certainly, here is the transpose of the matrix without row coloring, with switched row and column indexes: $i = 1 \sum m x_{i, 1} (\overset{y}{^}_{1}^{(i)} - y_{1}^{(i)})$

🧠

Explorer

Backpropagation

Backpropagation

01 Background

Computation Graph

Linear Regression

Logistic Regression

Neural Network (Multi-Layer Perceptron)

Math

MLP From Scratch

Graph View

Table of Contents

Backlinks