Deep Learning Coursera
Deep Learning and Neural Net
FALL SEMESTER 2019
FALL SEMESTER 2019
INSTRUCTOR: DR. NG Andrew
no_reply@example.com
26 Feb 2019
https://github.com/Kulbear/deep-learning-coursera/tree/master/Neural%20Networks%20and%20Deep%20Learning
WEEK-1
Video 1: What is Neural Network
Relu: Rectified linear unit
Q1: Every Input layer is interconnected with every hidden layer feature - True
Video 2: Supervised Learning with Neural Network
Real Estate
|
Standard NN
|
Online Ad
| |
Photo Tagging
|
CNN
|
Speech Recognition
|
RNN
|
Machine Translation
| |
Autonomous Driving
|
Custom/ Hybrid NN
|
Structure data : table, DB
Unstructured Data: Audio, Image, Text
Q1: Which data has features like pixel or words - Unstructured
Video 3: Why DL taking off?
NN requires more data to get more performance.
With lesser data even SVM can perform better than NN
Below factors made DL take off
-Data
-Computation
-Algo
-Switching sigmoid to ReLu learning rate has increased and
gradient descent worked faster
Q1: What is m stand for - no. of records
WEEK-2
Video 1: Binary Classification
Binary Classification means problem of 1 or 0 Ex. where image is of cat or not
Image of 64x64
Steps: Get all RGB in one vector ie vector = 64x64x3 = 12288 (HxWxRGB)
Notations: single training ex - (X,Y) where X is nx Dimensional feature vector
and y is either 0 or 1 ie
(X,y) = x E R nx, y E{0,1}
Mtest = No.of test rows
Mtrain = No of training rows
X is Nx times m dimensional metrics
Y is 1xm dimesnional metric
We have used in past squared error as a loss function which is L(Y hat, Y) = 1/2(Y hat-Y) ^2
but in NN we will use below function to calculate loss to get global minimum :
Loss function:L(Y hat, Y) = Ylog Y hat + (1-y) log(1-Y hat)
Cost Function : J(w,b) = 1/m mEi=1 L(Y hat, Y) Loss function is applies to single observation and cost function sum loss of all data and takes average
here we select parameter w,b where cost is minimum regression starts with random B value and and keep
changing it till we get lowest cost .
Q1 : What is the difference between the cost function and the loss function
for logistic regression>
ans - the loss function computes the error for a single training example ;
the cost function is the average of the loss function of the entire training set
Video 4 : Gradient Descent
in Gradient descent we try to find global minimum value for cost function J(w,b)
start with only one parameter w with random value
in loop below function Until converge i.e. finds min value ( alpha is learning rate)
w := w - alpha dJ(w)/dw
dJ(w)/dw is also shown as dw derivative w
w:= w-alpha into dw
With both parameter w, b
J(w,b) will be:
w:= w- alpha dJ(w,b) / dw
b:= b- alpha dJ(w,b) / db
When J is function of 2 or more variable use first symbol "partial derivative"
and for one variable use lowercase d for derivative as shown in image
Q : A convex function always has multiple local optima - false
Video 5 : Derivatives
Below chart is for f(a) = 3a
Lets say f(a) = 3a
if a=2 then f(a) = 6 here if we change a = 2.001 then f(a) will be 6.003, then from this two f(a) values on line
we can see triangle is forming in image
to calculate slope/derivative = height/width i.e. y/x in chart i.e. 0.003/0.001 = 3
thus df(a)/da = 3
Derivative formal definition is if a is changes by very tiny then does f(a) changes
Q: On a straight line , the function's derivative - Doesn't change
Video 6 : More Derivatives Example
1. f(a) = a^2 then df(a)/da = 2a/4
if a = 2 f(a) = 4 and if a = 2.001 f(a) = 4.004
2. f(a) = a^3 then df(a)/da = 3a^2/3.2^2
if a = 2 then f(a) = 8 and if a = 2.001 then f(a) = 8.012
Video 7 : Computational Graph
Above is the example of computational graph, Forward propagation Its comes in handy when some
specific variable need to be optimize like J here
Q: One Step of Backward Propagation on a computation graph yields derivative of final output variable
Video 8 : Derivatives with Computational Graph
In Above image we have our regression formula 3(a+ bc) = J then it can be shown as graph.
To calculate slop(derivative) of J we start with some input values of a,b,c and compute v then
we again change value of input variable in backward order so first v changed from 11 to 11.001
and see how J is getting changed from 33 to 33.003 so slop is dj/dv= 3 then we change a from 5 to 5.001 and
value of J doesn't change so slop is dj/da = 1 and ultimately dj/da = dj/dv * dj/da i.e. 3*1 = 3
So finally in backward propagation derivative formula is "d final Output / d var"
Below we have calculated derivative for b and c as well:
Q: What does the coding convention dvar represent?
-Derivative of a final output variable with respect to various intermediate quantities
Video 9 : Logistic Regression Gradient Descent
here we have z = regression formula Y hat is prediction and loss function
We are goion with two features X1 and X2
we calculate z first
then Y hat using sigmoid
and last the loss
we do it continuously by changing W1, W2 and b until we reach minimum loss ( global minimum)
Below we are calculating derivatives for each input parameter by changing values :
Here we are calculating backward.
first derivative of a ie da = dL(a,y)/da ( what is loss changes if a changes)
then derivative of z ie dz = dL(a,y)/dz that is also dL/da into da/dz ( what is loss changes if z changes)
and lastly we will see derivative of w1,w2, and b
Q: Calculus also provide simplified formula for derivative of loss function L with respect to z
Ans : a-y
This happens for one row lets see how it happens with all rows of training data
Video 10 : Gradient Descent on m examples
Here show the cost function formula in first row and secondly derivative average of loss of all records
In above image code is written for logistic regression for m example and for loop is running for each
observation in dataset. in the code we are adding all the derivatives of dw1, dw2, db and at the end dividing
by m to get average value which is cost value. in above example we only have 2 inputs variable thus
we are directly storing derivative in dw1, and dw2 or x1 and x2 but if we have more variables we can use for loop
but it is better to avoid for loop for performance and vectorization can be.
Q: in the for loop depicted in the video, why is there only one dw variable (ie no superscripts in th for loop)?
Ans - The value of dw in code is cumulative
Python and Vectorization
Video 11 : Vectorization
to calculate z = wTranspose xb where w and x are long vectors its time consuming by doing in for loop similar thing can be done in numpy as z = np.dot(w,x)
Vectorization runs 300 times faster when tested in below example loop took 480ms time and numpy took 1.5ms
Q: Vectorization can not be done without GPU - False
Video 12 : More Vectorization Example
Other example of vectorization given below
Always avoid for loops and use internal function like vectorization
Now we have to apply vectorization in logistic regression as shown below :
We have to remove dw1 and dw2 this case only 2 dw ther may be more if more features are there
and we have to replace dw with np vectorization
Video 13 : Vectorzing Logistic Regression
Here, we have calculate z and a (activation function) for each row we can do so using numpy if we arrange
X and W in metrics and we below formula:
z = np.dot(w.T,X)+ b
after we can also calculate sigmoid for all activations
Q: what are the dimensions of matrix X in this video
ans: (nx,m)
Video 14 : Vectorizing Logistic Regression's, Gradient Output
Here we have shown how we can calculate gradient descent with single Iteration of parameter w, b
without using loop:
Q: How do you compute the deerivative of b in one line of python numpy?
Ans: 1/m*(np.sum(dz))
Video 15 : Broadcasting in python
Broadcasting is technique to run python code faster
Here we have calculate % of total calories given in each fruit
which can be achieve using two lines with out using for loop:
If two matrix are of different dimension (mxn) then python make then same and then does operations
element wise:
As in first case 1x4 matrix is added to 1x1 so first it changed to 1x4 by repeating 100 same follows
in rest calculation. This is general principle of broadcasting in python
Q: which of the numpy code would sum the value in a matrix A vertically?
Ans: A.sum(axis=0)
Video 16 : A note on python numpy vectors
Always create a vector as matrix by defining 1 as another dimension:
a = np.random.randn(5,1) (This is col vector) or a = np.random.randn(1,5) (This is row vector)
and not a = np.random.randn(5) this will just create 1 dimension array (Rank 1 array) and not vector,
thus any matrix operation like transpose will not work.
We can also convert arrays to vector/matrix using a = a.shape((5,1))
Q: What kind of array has dimension in this format: (10,)
Ans: A rank one Array
Video 17 : A quick tour of python/iPython notebook
Use ctrl+shift to run cell
Video 18 : Explanation of logistic
and y is either 0 or 1 ie
(X,y) = x E R nx, y E{0,1}
Mtest = No.of test rows
Mtrain = No of training rows
X is Nx times m dimensional metrics
Y is 1xm dimesnional metric
Video 2: Logistic Regression
Given X what is a chance of y = 1 for given X values
ie Given X Y hat = P(y=1 |x)
But we have X is vector with numbers and parameter W , B
We can use regression algo “Y hat = W transpose + b” but this will be huge -ve or +ve number thus we have
to use sigmoid function to convert above algo numbers to 0 and 1
to use sigmoid function to convert above algo numbers to 0 and 1
Z = regression algo given above then sig(z) = 1 / 1+e^-z is formula of sigmoid function
Q1: What is parameters of Logistic regression?
W is an x dimensional vector and B is real number
Video 3: Logistic Regression Cost Function
We hv to compare Y hat with actual y and calculate loss function We have used in past squared error as a loss function which is L(Y hat, Y) = 1/2(Y hat-Y) ^2
but in NN we will use below function to calculate loss to get global minimum :
Loss function:L(Y hat, Y) = Ylog Y hat + (1-y) log(1-Y hat)
Cost Function : J(w,b) = 1/m mEi=1 L(Y hat, Y) Loss function is applies to single observation and cost function sum loss of all data and takes average
here we select parameter w,b where cost is minimum regression starts with random B value and and keep
changing it till we get lowest cost .
Q1 : What is the difference between the cost function and the loss function
for logistic regression>
ans - the loss function computes the error for a single training example ;
the cost function is the average of the loss function of the entire training set
Video 4 : Gradient Descent
in Gradient descent we try to find global minimum value for cost function J(w,b)
start with only one parameter w with random value
in loop below function Until converge i.e. finds min value ( alpha is learning rate)
w := w - alpha dJ(w)/dw
dJ(w)/dw is also shown as dw derivative w
w:= w-alpha into dw
With both parameter w, b
J(w,b) will be:
w:= w- alpha dJ(w,b) / dw
b:= b- alpha dJ(w,b) / db
When J is function of 2 or more variable use first symbol "partial derivative"
and for one variable use lowercase d for derivative as shown in image
Q : A convex function always has multiple local optima - false
Video 5 : Derivatives
Below chart is for f(a) = 3a
Lets say f(a) = 3a
if a=2 then f(a) = 6 here if we change a = 2.001 then f(a) will be 6.003, then from this two f(a) values on line
we can see triangle is forming in image
to calculate slope/derivative = height/width i.e. y/x in chart i.e. 0.003/0.001 = 3
thus df(a)/da = 3
Derivative formal definition is if a is changes by very tiny then does f(a) changes
Q: On a straight line , the function's derivative - Doesn't change
Video 6 : More Derivatives Example
1. f(a) = a^2 then df(a)/da = 2a/4
if a = 2 f(a) = 4 and if a = 2.001 f(a) = 4.004
2. f(a) = a^3 then df(a)/da = 3a^2/3.2^2
if a = 2 then f(a) = 8 and if a = 2.001 then f(a) = 8.012
Video 7 : Computational Graph
Above is the example of computational graph, Forward propagation Its comes in handy when some
specific variable need to be optimize like J here
Q: One Step of Backward Propagation on a computation graph yields derivative of final output variable
Video 8 : Derivatives with Computational Graph
In Above image we have our regression formula 3(a+ bc) = J then it can be shown as graph.
To calculate slop(derivative) of J we start with some input values of a,b,c and compute v then
we again change value of input variable in backward order so first v changed from 11 to 11.001
and see how J is getting changed from 33 to 33.003 so slop is dj/dv= 3 then we change a from 5 to 5.001 and
value of J doesn't change so slop is dj/da = 1 and ultimately dj/da = dj/dv * dj/da i.e. 3*1 = 3
So finally in backward propagation derivative formula is "d final Output / d var"
Below we have calculated derivative for b and c as well:
Q: What does the coding convention dvar represent?
-Derivative of a final output variable with respect to various intermediate quantities
Video 9 : Logistic Regression Gradient Descent
here we have z = regression formula Y hat is prediction and loss function
We are goion with two features X1 and X2
we calculate z first
then Y hat using sigmoid
and last the loss
we do it continuously by changing W1, W2 and b until we reach minimum loss ( global minimum)
Below we are calculating derivatives for each input parameter by changing values :
Here we are calculating backward.
first derivative of a ie da = dL(a,y)/da ( what is loss changes if a changes)
then derivative of z ie dz = dL(a,y)/dz that is also dL/da into da/dz ( what is loss changes if z changes)
and lastly we will see derivative of w1,w2, and b
Q: Calculus also provide simplified formula for derivative of loss function L with respect to z
Ans : a-y
This happens for one row lets see how it happens with all rows of training data
Video 10 : Gradient Descent on m examples
Here show the cost function formula in first row and secondly derivative average of loss of all records
In above image code is written for logistic regression for m example and for loop is running for each
observation in dataset. in the code we are adding all the derivatives of dw1, dw2, db and at the end dividing
by m to get average value which is cost value. in above example we only have 2 inputs variable thus
we are directly storing derivative in dw1, and dw2 or x1 and x2 but if we have more variables we can use for loop
but it is better to avoid for loop for performance and vectorization can be.
Q: in the for loop depicted in the video, why is there only one dw variable (ie no superscripts in th for loop)?
Ans - The value of dw in code is cumulative
Python and Vectorization
Video 11 : Vectorization
to calculate z = wTranspose xb where w and x are long vectors its time consuming by doing in for loop similar thing can be done in numpy as z = np.dot(w,x)
Vectorization runs 300 times faster when tested in below example loop took 480ms time and numpy took 1.5ms
Q: Vectorization can not be done without GPU - False
Video 12 : More Vectorization Example
Other example of vectorization given below
Always avoid for loops and use internal function like vectorization
Now we have to apply vectorization in logistic regression as shown below :
We have to remove dw1 and dw2 this case only 2 dw ther may be more if more features are there
and we have to replace dw with np vectorization
Video 13 : Vectorzing Logistic Regression
Here, we have calculate z and a (activation function) for each row we can do so using numpy if we arrange
X and W in metrics and we below formula:
z = np.dot(w.T,X)+ b
after we can also calculate sigmoid for all activations
Q: what are the dimensions of matrix X in this video
ans: (nx,m)
Video 14 : Vectorizing Logistic Regression's, Gradient Output
Here we have shown how we can calculate gradient descent with single Iteration of parameter w, b
without using loop:
Q: How do you compute the deerivative of b in one line of python numpy?
Ans: 1/m*(np.sum(dz))
Video 15 : Broadcasting in python
Broadcasting is technique to run python code faster
Here we have calculate % of total calories given in each fruit
which can be achieve using two lines with out using for loop:
If two matrix are of different dimension (mxn) then python make then same and then does operations
element wise:
As in first case 1x4 matrix is added to 1x1 so first it changed to 1x4 by repeating 100 same follows
in rest calculation. This is general principle of broadcasting in python
Q: which of the numpy code would sum the value in a matrix A vertically?
Ans: A.sum(axis=0)
Video 16 : A note on python numpy vectors
Always create a vector as matrix by defining 1 as another dimension:
a = np.random.randn(5,1) (This is col vector) or a = np.random.randn(1,5) (This is row vector)
and not a = np.random.randn(5) this will just create 1 dimension array (Rank 1 array) and not vector,
thus any matrix operation like transpose will not work.
We can also convert arrays to vector/matrix using a = a.shape((5,1))
Q: What kind of array has dimension in this format: (10,)
Ans: A rank one Array
Video 17 : A quick tour of python/iPython notebook
Use ctrl+shift to run cell
Video 18 : Explanation of logistic
Comments
Post a Comment