Learning with gradient descent
Categories:
our goal in training a neural network is to find weights and biases which minimize the quadratic cost function C(w,b)
We already kown how neural network is functioned. an input layer and outputlayer and hidden layer.
To let network learn itself, we need some data. Like digits
Actually, these digits come from a well-known training dataset called the MNIST dataset, which contains tens of thousands of scanned images of handwritten digits along with their correct classifications.
The MNIST data is divided into two parts: Training and Test datasets. We primarily use the training dataset to improve the model’s accuracy and the test dataset to evaluate the model’s performance.
For simplicity, let’s simplify the process. In linear algebra, we know that multiple values can define a ‘dot’ in space. So, we transform the input layer into a vector (which we can think of as a dot), and then use a function to map it to the target value y.
In this case, y is the output, which ranges from 0 to 9. This means y is a 10-dimensional vector.
As for x, we simplify the entire input layer into a single notation: x. Let me explain further:
x is a 28×28=784-dimensional vector. We define each pixel as an input neuron, and the vector represents the grayscale value of a single pixel in the image.
So, in this process, we map the 784-dimensional vector (input) into a 10-dimensional vector (output).
turning a row vector into an ordinary (column) vector.
Actially, we want a method lets us find weights and biases, so that the output from the network approximates y(x) for all training inputs x.
To achieve that we introduce cost function
a simple explain here
w denotes the collection of all weights in the network, b all the biases
n is the total number of training inputs
a is the vector of outputs from the network when x is input, and the sum is over all training inputs
call C the quadratic cost function or mean squared error or just MSE
C(w,b) is non-negative and mainly we can considerd as our training algorithm has done a good job if it can find weights and biases so that C(w,b)≈0
OK, why we not just simply let y(x) deduct a?
The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network.
we can barly get infomation from that, That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost.
C(v) v This could be any real-valued function of many variables, v=v1,v2,…v=v1,v2,….
To minimize C(v) it helps to imagine C as a function of just two variables, which we’ll call v1 and v2
Fuction above is sample, the main idea is to find the minimize value However, calculus doesn’t work. beacuse we will have tons of variables in function far more than 2
there is a beautiful analogy which suggests an algorithm which works pretty well. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of C - those derivatives would tell us everything we need to know about the local “shape” and this is nothing with physics we just trying to find where derivatives stop.
let’s think about what happens when we move the ball a small amount Δv1 in the v1 direction, and a small amount Δv2 in the v2 direction.
with liner alog let’s try seprated changes in v to changes in C
ΔC can be rewritten as
what’s really exciting about the equation is that it lets us see how to choose ΔvΔv so as to make ΔC negative. do not forget we keep tracing down the shape
where η is a small, positive parameter,we called learning rate.
we indicate position by
which will show in graph like “falling down”
But we have one problem here, we need to choose the learning rate to be small enough and approach good approximation. ==If we don’t, we might end up with ΔC>0. If we don’t, we might end up with ΔC>0.== At the same time, we don’t want η to be too small, since that will make the changes Δv tiny
with speed 0.03 we used 2495 steps and output curve relativly smooth
with speed 3 output curve is more abstract and reach a mini spot needs lucky