Massey University Dr. Juncheng Liu
Under instruction of Dr.Liu i had an opprtunrty start my very early academic exploration.
3/6 实验室, 学术准备前的入门工作
This is the multi-page printable view of this section. Click here to print.
Massey University Dr. Juncheng Liu
Under instruction of Dr.Liu i had an opprtunrty start my very early academic exploration.
3/6 实验室, 学术准备前的入门工作
OK, let’s get into the real problem handwriting recognition
First, we’d like a way of breaking an image containing many digits into a sequence of separate images, each containing a single digit.
We’ll focus on writing a program to solve the second problem
We do this because it turns out that the segmentation problem is not so difficult to solve, once you have a good way of classifying individual digits.
The first layer is input layer.
For simplicity I’ve omitted most of the 784784 input neurons in the diagram above.
The second layer is hidden layer We denote the number of neurons in this hidden layer by n
The output layer contain 10 neurons, we number the output neurons from 0 through 9, and figure out which neuron has the highest activation value.
why we need three layer to reco digit instead of two? what’s the role of hidden and output layer? why we need a outputlayer? how the weight comes up?
to explain those, let’s think about what the neural network is doing from first principles, we had input of every pixels, and output layer add all evidence and decide true or false
that’s quite simple, but where the evidence comes from?
hidden layers provided evidence, and let’s concentrate on the first hidden neurons detects whether or not an image like the following is present
It can do this by heavily weighting input pixels which overlap with the image, and only lightly weighting the other inputs.
the same way, if other neural fired.
we can get an 0 from output
There is a way of determining the bitwise representation of a digit by adding an extra layer to the three-layer network above. The extra layer converts the output from the previous layer into a binary representation, as illustrated in the figure below.
Find a set of weights and biases for the new output layer. Assume that the first 3 layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least 0.99, and incorrect outputs have activation less than 0.01
we had input layer which hold pixels in grayscale, through hidden layer, we get output from output layer, as said, output closer to 1 can indicate the number. now add new output layer which convert old output to 0 or 1. we had map as below
For the first neuron, only digits 8 or 9 can be activated. This means that neurons 8 and 9 in the old input layer had a greater weight influence on the first neuron in the new output layer.
The same principle applies to other neurons.
But how do we determine the actual values of the biases and weights?
That doesn’t matter—we don’t need to manually design a set of weights or biases. Instead, we should understand that a node in the old output layer that is closer to 1 can be interpreted as having a positive influence. This influence is then mapped to the corresponding node in the new output layer.
for example we have neurons 8 closer to 1. and we fire first node in new output layer
The weights and biases are actually optimized by the system itself using a methodology called gradient descent.
our goal in training a neural network is to find weights and biases which minimize the quadratic cost function C(w,b)
We already kown how neural network is functioned. an input layer and outputlayer and hidden layer.
To let network learn itself, we need some data. Like digits
Actually, these digits come from a well-known training dataset called the MNIST dataset, which contains tens of thousands of scanned images of handwritten digits along with their correct classifications.
The MNIST data is divided into two parts: Training and Test datasets. We primarily use the training dataset to improve the model’s accuracy and the test dataset to evaluate the model’s performance.
For simplicity, let’s simplify the process. In linear algebra, we know that multiple values can define a ‘dot’ in space. So, we transform the input layer into a vector (which we can think of as a dot), and then use a function to map it to the target value y.
In this case, y is the output, which ranges from 0 to 9. This means y is a 10-dimensional vector.
As for x, we simplify the entire input layer into a single notation: x. Let me explain further:
x is a 28×28=784-dimensional vector. We define each pixel as an input neuron, and the vector represents the grayscale value of a single pixel in the image.
So, in this process, we map the 784-dimensional vector (input) into a 10-dimensional vector (output).
turning a row vector into an ordinary (column) vector.
Actially, we want a method lets us find weights and biases, so that the output from the network approximates y(x) for all training inputs x.
To achieve that we introduce cost function
a simple explain here
w denotes the collection of all weights in the network, b all the biases
n is the total number of training inputs
a is the vector of outputs from the network when x is input, and the sum is over all training inputs
call C the quadratic cost function or mean squared error or just MSE
C(w,b) is non-negative and mainly we can considerd as our training algorithm has done a good job if it can find weights and biases so that C(w,b)≈0
OK, why we not just simply let y(x) deduct a?
The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network.
we can barly get infomation from that, That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost.
C(v) v This could be any real-valued function of many variables, v=v1,v2,…v=v1,v2,….
To minimize C(v) it helps to imagine C as a function of just two variables, which we’ll call v1 and v2
Fuction above is sample, the main idea is to find the minimize value However, calculus doesn’t work. beacuse we will have tons of variables in function far more than 2
there is a beautiful analogy which suggests an algorithm which works pretty well. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of C - those derivatives would tell us everything we need to know about the local “shape” and this is nothing with physics we just trying to find where derivatives stop.
let’s think about what happens when we move the ball a small amount Δv1 in the v1 direction, and a small amount Δv2 in the v2 direction.
with liner alog let’s try seprated changes in v to changes in C
ΔC can be rewritten as
what’s really exciting about the equation is that it lets us see how to choose ΔvΔv so as to make ΔC negative. do not forget we keep tracing down the shape
where η is a small, positive parameter,we called learning rate.
we indicate position by
which will show in graph like “falling down”
But we have one problem here, we need to choose the learning rate to be small enough and approach good approximation. ==If we don’t, we might end up with ΔC>0. If we don’t, we might end up with ΔC>0.== At the same time, we don’t want η to be too small, since that will make the changes Δv tiny
with speed 0.03 we used 2495 steps and output curve relativly smooth
with speed 3 output curve is more abstract and reach a mini spot needs lucky
patience in the face of such frustration is the only way to truly understand and internalize a subject. FIND YOUR PROJECT
Let’s start from a basic task, recognize those digits we don’t usually appreciate how tough a problem our visual systems solve. write a computer program to recognize digits like those above try to make such rules precise will got tou confuse
The idea is to take a large number of handwritten digits
develop a system which can learn from those training examples
artificial neuron called a perceptron. the main neuron model used is one called the sigmoid neuron
ocusing on handwriting recognition because it’s an excellent prototype problem for learning about neural networks in general.
a perceptrons is a device to add evidence to make decisions
perceptron is that it’s a device that makes decisions by weighing up evidence
###### how do perceptrons work? ##### algebraic terms A perceptron takes several binary inputs, x1,x2,…x1,x2,…, and produces a single binary output:
Rosenblatt use weights the importance of the respective inputs to the output
The neuron’s output, 0 or 1, is determined by whether the weighted sum is less than or greater than some threshold value.
A way you can think about the perceptron is that it’s a device that makes decisions by weighing up evidence.
By varying the weights and the threshold, we can get different models of decision-making.
we’ll call the first layer of perceptrons - is making three very simple decisions, by weighing the input evidence.
In this way a perceptron in the second layer can make a decision at a more complex and more abstract level than perceptrons in the first layer.
In this way, a many-layer network of perceptrons can engage in sophisticated decision making.
In fact, they’re still single output. The multiple output arrows are merely a useful way of indicating that the output from a perceptron is being used as the input to several other perceptrons.
simplify one The first change is to write
as a dot product
w and x are vectors whose components are the weights and inputs
move the threshold to the other side of the inequality, and to replace it by what’s known as the perceptron’s bias
bias can be seen as a way to measure how easy it is to get the perceptron to output a 1
the bias is a measure of how easy it is to get the perceptron to fire.
it leads to further notational simplifications. Because of this, in the remainder of the book we won’t use the threshold, we’ll always use the bias.
###### compute the elementary logical functions
suppose we have a perceptron with two inputs, each with weight −2, and an overall bias of 3.
the input 1 produces output 0,
nputs 1 and 0 produce output 1
NAND
gate!
we can use networks of perceptrons to compute any logical function at all.
Actually, in neural networks, a neuron has only one output, which is then broadcasted to all its outgoing connections.
for encoding input, we define a input layer, This notation for input perceptrons, in which we have an output, but no inputs. better to consider special units which are simply defined to output the desired values.
It turns out that we can devise learning algorithms which can automatically tune the weights and biases of a network of artificial neurons.
we already know that our neural networks can simply learn to solve problems, automatically tune the weights and biases of a network of artificial neurons.
This tuning happens in response to external stimuli, without direct intervention by a programmer.
how can we devise such algorithms for a neural network?
before design ourself, let’s see how network behave when a small change happen
a small change will cause a small corresponding change in the output
this property will make learning possible what we need is to make small change and let output closer to actual value.
But the problem is, when small change happens. output from perceptron more likely to flip like 0 -> 1, That flip may then cause the output overcontrol and unpredictable
That’s where sigmoid neuron comes
We can overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron
Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output.
output instead of 0 and 1 is
where σ is called the sigmoid function or logistic neurons
we put it explicaitly
we can take the algebraic form as a technical detail than barrier of understatnding
when z is large and positive, the output from the sigmoid neuron is approximately 1, On the Other hand the output is approximately 0
the behaviour of a sigmoid neuron also closely approximates a perceptron. Only z has a modest size that there’s much deviation from the perceptron model.
Indeed, it’s the smoothness of the σσ function that is the crucial fact, not its detailed form.
and we have another version
In that case, the the sigmoid neuron would be a perceptron
since the output would be 1 or 0 (we ignore modest value) so sigmoid is basicly smoother version of perceptron
Indeed, it’s the smoothness of the σσ function that is the crucial fact
we already know can be approximated by
Don’t panic !!! Only the shape of fuction matters, actually we will talk more for other activation function
where the output is f(w⋅x+b) for some other activation function f(⋅)
Δoutput is a linear function of the changes Δwj and Δb in the weights and bias
This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output.
So sigmod simulating perceptrons
make it much easier to figure out how changing the weights and biases will change the output.
Anyway, one big difference between perceptrons and sigmoid neurons is that sigmoid neurons don’t just output 0 or 1.
Sigmoid neurons simulating perceptrons, part I
Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, c>0. Show that the behaviour of the network doesn’t change.
Sigmoid neurons simulating perceptrons, part II
Suppose we have the same setup as the last problem - a network of perceptrons. Suppose also that the overall input to the network of perceptrons has been chosen. We won’t need the actual input value, we just need the input to have been fixed.
Suppose the weights and biases are such that for the input x to any particular perceptron in the network.
Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constant c>0.
Show that in the limit as c→∞ the behaviour of this network of sigmoid neurons is exactly the same as the network of perceptrons.
How can this fail when w⋅x+b=0 for one of the perceptrons?
feedforward
network which contains input, hidden, output layers with no loopneural network have three layers , input layer , output layer and ’not an input or output layer'
again, we have an network
the leftmost layer is input neurons the rightmost is outputneurons The middle layer is called a hidden layer t really means nothing more than “not an input or an output”.
input and output often strightforword, while there is often an art to the design of the hidden layers.
There are some design heuristics for the hidden layers. we are not included here
until now, the output from one layer is used as input to the next layer is called feedforward
network information is always fed forward, no loop acceptable
recurrent neural networks allow such loop, They’re much closer in spirit to how our brains work than feedforward networks.
a remote repository of Massey
Reference:
http://neuralnetworksanddeeplearning.com/chap1.html#learning_with_gradient_descent
Learn the Basics — PyTorch Tutorials 2.6.0+cu124 documentation pytorch.org Most machine learning workflows involve working with data, creating models, optimizing model parameters, and saving the trained models.
Deep Learning with PyTorch: A 60 Minute Blitz — PyTorch Tutorials 2.6.0+cu124 documentation pytorch.org
PyTorch is an open-source deep learning framework that’s known for its flexibility and ease-of-use.
understand that as a playground with instructure. define a function use def , function in python can return multiable values no need of type attribute
In python list is reinitializable, and turple is fixed and unchangable
import math
def move(x, y, step, angle=0):
nx = x + step * math.cos(angle)
ny = y - step * math.sin(angle)
return nx, ny
>>> x, y = move(100, 100, 60, math.pi / 6)
>>> print(x, y)
151.96152422706632 70.0
A huge improve is Python supports default para. In tons of value need to write value, we can ignore the fixed values and only trans uniqe ones.
WARN: python 的默认参数是在方法外部传入的, 一定注意Python是引用传参, 默认参数必须指向不变对象!
Written in Python, it’s relatively easy for most machine learning developers to learn and use.
For data, we have two primitives in toch
torch.utils.data.DataLoader torch.utils.data.Dataset
Dataset include a bunch of samples and corresponding lables.
Note here In python map called dict
DataLoad add Itrable ability
we mainly use TorchVision package of FashionMNIST dataset dataset.pytorch offer domain-specific libraryies.
In this Dataset
includes two arguments: transform
and target_transform
to modify the samples and labels respectively.
Pytorch use an unique data type called Tensors, similar to a multidimensional array, used to store and caculate input and output of a model. a important feature is
tensors can run on GPUs to accelerate computing.