BackPropogation In NL

Backpropagation in Neural Networks: A Step-by-Step Guide

Backpropagation is the core algorithm for training neural networks. It efficiently computes gradients of the loss function with respect to each weight using the chain rule of calculus, enabling optimization via gradient descent.

1. Key Concepts

A. Neural Network Basics

Layers: Input → Hidden → Output.
Neurons: Apply weights, bias, and activation (e.g., ReLU, Sigmoid).
Forward Pass: Compute predictions from input to output.
Loss Function: Measures prediction error (e.g., Mean Squared Error, Cross-Entropy).

B. Backpropagation Intuition

Goal: Adjust weights to minimize loss.
Method: Propagate error backward from output to input, computing gradients for each weight.

2. Mathematical Foundations

A. Chain Rule

For a composite function ( f(g(x)) ): [ \frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx} ]

B. Gradient Descent Update Rule

[ w_{new} = w_{old} - \eta \cdot \frac{\partial L}{\partial w} ]

( \eta ): Learning rate.
( \frac{\partial L}{\partial w} ): Gradient of loss ( L ) w.r.t weight ( w ).

3. Step-by-Step Backpropagation

Consider a simple 2-layer NN:

Input Layer → Hidden Layer (activation: Sigmoid ( \sigma )).
Hidden Layer → Output Layer (activation: Linear for regression).

Step 1: Forward Pass

Input to Hidden: [ z_h = w_1 x + b_1, \quad a_h = \sigma(z_h) ]
Hidden to Output: [ z_o = w_2 a_h + b_2, \quad \hat{y} = z_o \quad (\text{Linear}) ]
Loss (MSE): [ L = \frac{1}{2}(y - \hat{y})^2 ]

Step 2: Backward Pass (Gradient Calculation)

1. Compute ( \frac{\partial L}{\partial w_2} ) (Output Layer)

[ \frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_o} \cdot \frac{\partial z_o}{\partial w_2} ] [ \frac{\partial L}{\partial \hat{y}} = - (y - \hat{y}), \quad \frac{\partial \hat{y}}{\partial z_o} = 1, \quad \frac{\partial z_o}{\partial w_2} = a_h ] [ \boxed{\frac{\partial L}{\partial w_2} = - (y - \hat{y}) \cdot a_h} ]

2. Compute ( \frac{\partial L}{\partial w_1} ) (Hidden Layer)

[ \frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_o} \cdot \frac{\partial z_o}{\partial a_h} \cdot \frac{\partial a_h}{\partial z_h} \cdot \frac{\partial z_h}{\partial w_1} ] [ \frac{\partial z_o}{\partial a_h} = w_2, \quad \frac{\partial a_h}{\partial z_h} = \sigma(z_h)(1 - \sigma(z_h)), \quad \frac{\partial z_h}{\partial w_1} = x ] [ \boxed{\frac{\partial L}{\partial w_1} = - (y - \hat{y}) \cdot w_2 \cdot \sigma’(z_h) \cdot x} ]

3. Update Biases (Similar Logic)

[ \frac{\partial L}{\partial b_2} = - (y - \hat{y}), \quad \frac{\partial L}{\partial b_1} = - (y - \hat{y}) \cdot w_2 \cdot \sigma’(z_h) ]

4. Generalization to Deep Networks

For a network with ( L ) layers:

Forward Pass: Compute activations ( a^{(l)} ) for all layers.
Backward Pass:
- Initialize error term at output: [ \delta^{(L)} = \nabla_{\hat{y}} L \odot f’(z^{(L)}) ]
- Propagate backward: [ \delta^{(l)} = (w^{(l+1)})^T \delta^{(l+1)} \odot f’(z^{(l)}) ]
- Compute gradients: [ \frac{\partial L}{\partial w^{(l)}} = \delta^{(l)} (a^{(l-1)})^T, \quad \frac{\partial L}{\partial b^{(l)}} = \delta^{(l)} ]

5. Practical Considerations

A. Activation Derivatives

Activation	Derivative ( f’(z) )
Sigmoid	( \sigma(z)(1 - \sigma(z)) )
ReLU	( 1 \text{ if } z > 0 \text{ else } 0 )
Tanh	( 1 - \tanh^2(z) )

B. Common Loss Functions

Loss	Derivative ( \frac{\partial L}{\partial \hat{y}} )
MSE	( \hat{y} - y )
Cross-Entropy	( \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})} )

C. Numerical Stability

Vanishing Gradients: Use ReLU/Leaky ReLU, skip connections (ResNet).
Exploding Gradients: Gradient clipping, weight initialization (Xavier/He).

6. Pseudocode for Backpropagation

def backprop(X, y, weights, biases):
    # Forward pass
    z1 = X @ weights['w1'] + biases['b1']
    a1 = sigmoid(z1)
    z2 = a1 @ weights['w2'] + biases['b2']
    y_pred = z2
    
    # Loss gradient (MSE)
    dL_dy = y_pred - y
    
    # Backward pass
    dL_dw2 = a1.T @ dL_dy
    dL_db2 = np.sum(dL_dy, axis=0)
    
    dL_da1 = dL_dy @ weights['w2'].T
    dL_dz1 = dL_da1 * sigmoid_derivative(z1)
    dL_dw1 = X.T @ dL_dz1
    dL_db1 = np.sum(dL_dz1, axis=0)
    
    return {'w1': dL_dw1, 'b1': dL_db1, 'w2': dL_dw2, 'b2': dL_db2}

7. Visual Explanation

Input (x) → [w1, b1] → Hidden Layer (σ) → [w2, b2] → Output (ŷ)
          ↑ Backward pass (gradients) ↑

Red arrows: Error propagation from output → input.
Blue arrows: Weight updates using gradients.

Key Takeaways

Backpropagation computes gradients efficiently using the chain rule.
Forward Pass: Compute predictions and loss.
Backward Pass: Propagate errors and update weights.
Critical for Deep Learning: Enables training of complex architectures (CNNs, RNNs).

Quantinium

Explorer