Convergence Point
Posts
Intro to Neural Networks

Intro to Neural Networks

A goldmine for AI beginners: I can't believe I am giving this for free.

Abhiram kandiyana
August 01, 2024

I know. I took a long break from writing blogs due to some expected and unexpected events. But, I am back now. And we are going to continue the NLP series right where we stopped.

So, the last blog was on SVMs and kernel tricks. And that was the end of linear models (or traditional ML models). These models cannot fit data with non-linear relationships.

Now it’s time to move to your favorite algorithm, Neural Networks (NN). This blog will briefly introduce neural networks; the next blog will be about using them for NLP.

Heads up! This is a long a** blog. So, get a pen and paper, fill up your bottle of water, put your phone in DnD, and prepare for the best intro to NN you have ever seen.

Firstly, what is a neuron? Let’s say a neuron is a computational unit with scalar inputs and output. Each neuron has an associated weight for each input. The neuron multiplies each input by its corresponding weight and then sums these products. There is also an additional bias term that is added to this summation for regularization.

But these are simple multiplication and additions. There is no non-linearity in the operation that allows them to capture non-linear relationships in data. That is where activation functions come in. The output of each neuron is sent to a non-linear activation function which is then sent to other neurons. There are many popular activation functions based on data. Let’s look at a few of them below.

1. Activation Functions

1.1 ReLU

x is the input of the ReLU function.

1.2 Tanh

1.3 Sigmoid

1.4 Softmax

1.5 GeLU

1.6 Rules of Thumb

Start with ReLU for hidden layers, then GeLU and Tanh
If binary classification, use Sigmoid for output
If multi-class, use Softmax for output
For transformer-based models, start with GeLU

2. Layers in a Neural Network

Neural networks are organized into layers. Each layer in a NN has multiple neurons stacked on top of each other. In this way, we have multiple layers where the output of each layer is passed as the input to the other layer. :

Input Layer: The first layer that receives the initial data.
Hidden Layers: Intermediate layers that process the data. A network can have multiple hidden layers.
Output Layer: The final layer that provides the output of the network.

Image Source

NOTE

1.The circles in the input layer are not neurons. They are just individual features of the data passed as input to the hidden layers.

2. Each circle in the hidden and output layers represents a neuron followed by an activation function.

Now, the above figure is a simple example of a neural network. Companies build and use neural networks with hundreds of hidden layers.

I have explained the working of a neural network feed-forward, i.e., the inputs are passed to the network and they are then used to do some linear and nonlinear operations in a chain to get the output. The model’s output can then be used to make decisions for our problem, may it be classification, or text completion. Let’s dive deep into the feed-forward mechanism.

3. Feed-forward

The best way to teach humans or AI is with examples. So, let’s take a simple example to understand feedforward in neural networks.

Here is what we know

1. x₁ and x₂ are the inputs

2. h₁ and h₂ are the outputs of the first and second neurons in hidden layer respectively

3. y is the output of the neuron in output layer. This is the final output of the neural network.

4. w₁₁ - w₂₃ are the weights associated with each neuron as represented in the above figure.

Let’s first deal with h₁. Look at the figure above carefully. What are the weights associated with h₁? They are w₁₁ for x₁ and w₂₁ for x₂. Now let us multiply these pairs and then add. Let’s call this z_1:

Alright. This is a simple linear transformation. It doesn’t allow the network to represent complex functions. To add this non-linearity, we apply an activation function to this. Let us go with the sigmoid activation function for all the neurons in this NN.

This gives us h₁, the output of the first hidden neuron. similarly, we calculate h₂.

Let us take example values for the input and weights to calculate h₁ and h₂.

Now, these above equations can be written in a simple form (for convenience) using matrices.

where,

Similarly, let’s calculate y. Now, for the output layer, h₁ and h₂ are the inputs. The same process repeats but with different inputs and weights. Here the intermediate results are z₃ and z₄.

This is the end of feed-forward. We have our y, output from the NN. Now, you can use this for

classification: for eg, positive if y>0.5 or else negative
prediction: for eg, the estimated stock price for 2036 is y
next-token generation: for eg, the ASCII value of y is the first word of ChatGPT’s response.

But, do we know if this y is accurate? Can we use this NN for any application without causing life-threatening (mostly not) issues?

Well, for that we should evaluate the model. And as per our evaluation, we should change the weights so that the new y is closer to being correct. This whole process is called model training.

There are many ways to train a model. But the most popular method by far is Backpropagation.

4. Backpropagation

Backpropagation is an efficient way to train Neural Networks (NN) by adjusting weights to minimize the loss (error). But

What is loss?
How do we update the weights to minimize this loss?

“Loss in neural network training refers to the measure of how well or poorly a model's predictions match the actual target values. It quantifies the difference between the predicted output of the model and the true output,” says ChatGPT

And it is correct. Let’s assume you know the stock price for Apple in 2024. You train a model to predict this for you. But you use the data until 2023 for training. Your trained model tells you that Apple's stock price in 2024 will be 240 dollars. This value is called the prediction. But you know that the correct value is 218 dollars. This is called ground truth or target value. So, the model was wrong by 22 dollars, which is the loss.

So, the output of the network is used to calculate the loss. Our goal is to minimize this loss function across the training examples using gradient descent. We tune the parameters of the model (Weights) in such a way that the loss function is minimized. An efficient way to do this is Back-Propogation.

This is possibly the best video to understand gradient descent:

There are three steps in backpropagation.

Loss Calculation
Backpass
Weight update

4.1 Loss Calculation

We cannot simply subtract the true value and the NN’s prediction as we did in the above stock price example. This leads to many issues (stability, differentiability) with large datasets. Hence, we use a loss function.

A loss function, also known as a cost function or objective function, is a mathematical function used to quantify the difference between the predicted output of a machine learning model and the ground truth.

To apply back-propagation, an ideal loss function should be:

Non-negative
differentiable
Convex
Scalable
Robust to outliers

For the purposes of this blog let us go with a common loss function in ML, Mean Squared Error (MSE). MSE is defined as

where,

L is the loss function (MSE)
y_pred is the output of the NN
y_true is the expected true value, ground truth

4.2 Weight Update

Before we go ahead with Backpass, let’s take a look at what the update to each weight should be. We can then use Backpass to calculate it.

The weight update equation is,

where

w_old is the weight before the update? For initialization, we use numbers from random distribution as weights. There are a few popular random initialization functions.
L is the loss value, as discussed above
w_new is the new weight after the update
η is the learning rate. A parameter used to control the step size of each iteration

Here, we are calculating the direction and rate of change of the loss with respect to the weight using ∂L/ ∂W_old and then moving the weight in the opposite direction (hence the subtraction), so that the loss is decreased.

We repeat feed-forward with the new weights, calculate the loss, and then change the weights again to reduce the loss. This process repeats until we reach Convergence Point. The convergence point refers to the point after which the loss doesn’t change significantly with updates to the weights.

Now, our goal is to calculate ∂L/ ∂W_old for each weight and update the weight using it.

4.3 BackPass

We will have to update 6 weights (w₁₁-w₂₃) to complete an iteration of weight updates. But for this blog, let us focus on w₂₁.

This part includes solving some complex derivatives. But don’t worry. We have a trick under our sleeves. We will use the almightly chain rule extensively to make the math simple.

Firstly, take a hard look at the figure. Nothing new, just a summary of what we learnt so far.

feed-forward and backprop focused on w_21

And a legend for the above diagram just in case,

Alright. You have all the equations we know in the above figure. And you know what needs to be calculated, ∂L/ ∂w₂₁. Let’s go for a Differential Calculus Trip.

Before we go ahead, I would highly recommend taking a quick review of the Chain rule. Here is a good resource.

Ok. Let’s start:

Using the chain rule, we rewrite the partial derivative we want to calculate. We come one step back and rewrite ∂L / ∂w₂₁ using intermediate derivatives. Look through the above figure of NN backprop for reference as we go through the equations below.

Now, solving ∂y / ∂w₂₁ will be challenging as we have a complex operation including many variables (h1,h2,w13,w23, and the sigmoid activation function). So, let’s make this even simpler using the Chain rule.

This is better. But ∂y / ∂h₁ can be simplified even further. So, let’s rewrite this equation again using the Chain rule.

Good. All the first three derivates can be solved easily as they are direct equations of the denominator. But, has an activation function in it which can cause issues. So, let’s simplify that using …., you guessed it right. Chain rule!

Ok. we have simplified it to the lowest level. The equation may look daunting but trust me, each partial derivative is very simple to solve (except for the 2 derivatives of the sigmoid activation function). Let’s start solving each one of them, from right side to left.

Simple right? Let’s move to the next derivative. Now, ∂h1 / ∂z1 is the derivative of the sigmoid function. Just a reminder that the sigmoid function is

Let’s use this to calculate ∂h1 / ∂z1:

If you look closely into the above solution, we have used the Chain rule here, again. Now, We have a value for ∂h1 / ∂z1 but it still looks complex. Let’s simplify this.

This is so much better. Ok. let’s move to the next one.

Pfft! So simple. Next?

Another equation with Sigmoid function. No need to solve it as we know what the answer would be.

And lastly,

Can you imagine? We used the chain rule, again! Now, let’s assemble the Avengers.., sorry. Let’s assemble the derivatives.

Hmm. So satisfying. We did nothing but use the Chain rule, but still. Now, this can be easily solved and then plugged into our weight update equation for w₂₁

And you repeat the same for all 6 weights. That completes one iteration. You repeat this process until you get to the convergence point. And that completes training. Woohoo!

Now, this is an extremely simple example with only one hidden layer and it kind of has many operations. LLMs like GPT, and LLAMA have hundreds of layers with hundreds of neurons in each. Backprop’ing through all of them becomes overwhelming very quickly. So, how do we make this process efficient? Can you think of any optimization technique that can be used here easily?

Ok. Let us take a look at ∂L / ∂w11. After applying the chain rule multiple times we get to,

Great. Do you note anything interesting here? Solutions for ∂L / ∂w11 and ∂L / ∂w21 have many derivatives in common.

As the NN becomes longer (more layers) and wider (more neurons per layer) many calculations are repeated across weight updates. So, why don’t we store these partial derivatives and reuse them for various calculations instead of repeatedly solving them? This technique is called Memoization and has become an integral part of NN training. But to use memoization efficiently, we should know the operations that will be repeated, beforehand. There should be a way to structure the different calculations that need to be solved and the operands required for that. For this, we use Computation Graph Abstraction

5. Computation Graph Abstraction

A computation graph abstraction is a graphical representation of the mathematical operations and dependencies in a computation, particularly in the context of machine learning models like neural networks.

We will use it to visualize the flow of data (forward pass) and the flow of gradients (backward pass). A computation graph is a Directed Acyclic Graph (DAG). It has two parts

Nodes to represent input, operations, and output
Directed edges connecting nodes to represent the flow of data

Ok. let us build a computation graph for our one hidden-layer neural network.

First, let’s build a computation graph for the forward pass through the NN.

We started with inputs x1 and x2 all the way to the loss calculation. I have used dotted arrows to represent intermediate results, like h1,h2, and y. This is kind of a single computation as every operation in each layer is used to calculate the loss.

For backprop, you will have to update each weight separately, and hence you will see the reuse of intermediate results across weight updates.

Uff, Drawing this was not a piece of cake (sorry about my handwriting). Here you can see that for w21 and w11 all the operations are the same except for the multiplication with partial derivate of z3. All the calculations need to be repeated as the intermediate result can be used directly. Same with w12 and w22.

All your favorite ML frameworks, like Tensorflow and Pytorch, build graphs like this in the backend. They first fill this graph without executing any operations, and based on the requirement they calculate each derivative. This allows them to reuse calculations that were performed for other weight updates.

I know what you are thinking. Backpropagation is just the addition of chain rule and memoization. And you are mostly correct. But there are few issues with backprop that need to be taken care of.

6. Disadvantages of Backpropagation

As many people think, backprop is not a holy grail. It has a few issues that can be a pain in the butt (from my experience), especially for larger networks. Let us go over each one of them so you are prepared when you face an issue.

6.1 Vanishing gradient problem

In the above equation for the w₂₁ weight update, you saw that we are multiplying 5 partial derivatives. But this is a tiny network. For image classification, we build networks with 30-40 layers. The number of multiplications becomes huge for a weight update. And when you multiply a small fraction (less than 1) with another small fraction. When you use Tanh or Sigmoid activation, it squashes the output to a very small range and hence leading to small gradients. This can also happen when weights are not initialized properly.

6.2 High Sensitivity to Initialization

As Backprop follows gradient descent to update the weights, the initial weights play an important role. If the initial weights are not set correctly, the network might take a very long time to converge or might not converge at all. This is because poor initialization can lead to small gradients (vanishing gradient problem) or large gradients (exploding gradient problem), both of which hinder efficient learning.

6.3 Weight Transport Problem

As you learnt above, the weights need to be transported from forward pass to backward pass in order to calculate the gradients. This means all the weights need to be loaded to the memory when training leading to synchronization issues in distributed systems and memory issues in all systems. The weight transport problem is the reason for the requirement of thousands of GPUs to train LLMs.

6.4 Differentiability Requirements

Both the loss function and activation function have to be differentiable to allow gradient calculation. For eg, when using ReLU, the gradient becomes undefined when the input is 0. For practical purposes, we consider it to be 0, making all the gradients calculated with it using the rule becoming 0. This means that the weights associated with these neurons do not get updated during training because the gradient is zero.

“Ok, Abhi. You have shown how backprop sucks sometimes. Also tell us how to fix these issues”. No. That is something you have to figure out yourself. Build a small network (you can take the above example) and play around. Be ready to break things and learn the consequences of bad decisions.

7. Conclusion

All right guys. That’s it for this post. These topics are all you need to start building your first neural network. There are many other topics to cover like regularization, evaluation metrics, convolutional NNs, Recurrent NNs, etc but unless you set your foot on the ground learning these things is not useful.

If you want to start your career in deep learning, follow these steps

Find a problem no matter how superficial you think it is.
Find the dataset relevant to the problem. if you can’t find any suitable data, make plans to collect data: smaller datasets are fine
Build your model. Try different things, Use your intuition or thumb rules to start. be patient and train the model until you are satisfied with the results.
Deploy your model and build a simple UI for this so people can use it. Store all the requests and responses.
Use the stored info as data and re-train the model until the improvement saturates.

But, if you want personal guidance on this journey and want to save days of time and effort spent on research, book a 1:1 call with me: https://topmate.io/akandiyana/684714

That’s it for this week. For next week, we will use what we learned here along with a few other stuff to build our first non-linear model for text classification.