Training Neural Networks Data Science & Business Analytics Lab 2015/11/23, Minsik Park
Training neural networks Prior knowledge for understanding Training neural net. Cost function(Loss function) Neural networks 에서 cost function은 실제 출력과 기대 출력 간의 MSE(Mean square error)를 구하는 것. 실제 출력과 기대 출력간의 차이가 작을수록 학습이 잘 되는 것이다. w와 b를 변화시키는 과정을 반복적으로 수행하여 cost function이 최소값이 되도록 하는 것이 신경망 학습의 목표 n : training에 사용되는 input 수 𝑦 𝑥 : 기대 출력값(target value) a : 실제 출력값(output) w : 가중치(weights) b : 바이어스(bias) 𝑪 𝒘,𝒃 ≡ 𝟏 𝟐𝒏 ∥𝒚 𝒙 −𝒂∥ 𝟐
Training neural networks Prior knowledge for understanding Training neural net. Gradient-descent method 임의의 시작점(𝑥0) 가 있을 때function의 slpoe(gradient)를 구하여 기울기가 낮은곳으로 이동시켜 local minimum에 도달하는 방법 Neural Networks에서는 cost function이 최소가 되도록 w와 b 값들을 반복적으로 변화시킴
Learning the weights of a linear neuron The perceptron cannot be extended to more complex networks. Instead of showing the weights get closer to a good set of weights, show that the actual output values get closer the target values. The simplest example is a linear neuron with a squared error measure. neuron’s estimate of the desired output input vector weight
Learning the weights of a linear neuron Example iterative method Price of meal = 850 = target Portions of fish portions of chips portions of ketchup 150 50 100 2 5 3 linear neuron True weights Price of meal = 500 Portions of fish portions of chips portions of ketchup 50 50 50 2 5 3 Arbitrary Initial weights linear neuron Residual error = 350 Delta-rule : With a learning rate of 1/35, the weight changes are +20, +50, +30 New price of meal = 70*2+100*5+80*3 = 880
The error surface in extended weight space The error surface for a linear neuron The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error. For a linear neuron with a squared error, it is a quadratic bowl. w1 w2 E
The error surface in extended weight space Online learning VS Batch learning w1 w2 constraint from training case 1 training case 2 w1 w2 Online learning batch learning
The error surface in extended weight space Why learning can be slow w1 w2 If the ellipse is very elongated, the direction of steepest descent is almost perpendicular to the direction towards the minimum.
Backpropagation algorithm 최적의 학습 결과를 찾기 위해 역방향으로 에러를 전파(backword propagation of error)하는 알고리즘 1970년도에 처음 개발 되었음. 1986년에 Rumelhart와 Hinton의 논문[1]에서 역전파 알고리즘을 통해 다층 퍼셉트론(Multi-layer perceptron)을 효율적으로 학습 할 수 있음을 증명함으로써 암흑기에 빠져있던 신경망 연구를 부활시킴.
Backpropagation algorithm MLP(Multi-layer perceptron; Feed Forward Network)는 입력층(input layer), 은닉층(hidden layer), 출력층(output layer)으로 구성됨 1. Feed forward : 입력층 -> 은닉층 -> 출력층 최종 출력 단계에서 error와 cost function을 구함 2. Backpropagation : 출력층 -> 은닉층 -> 입력층 출력의 반대방향으로 순차적으로 편미분을 수행하면서 weight 와 bias 값을 Update 함.
Step by step : Backpropagation[5] Multi-Layer perceptron Input layers hidden layers hidden layers output layers
Step by step : Backpropagation Neuron’s structure 𝑓(𝑒) : Activation function(sigmoid) 𝑒 = sum of input*weights
Step by step : Backpropagation Feed forward
Step by step : Backpropagation Backpropagation : difference
Step by step : Backpropagation Backpropagation : difference
Step by step : Backpropagation Backpropagation : difference
Step by step : Backpropagation Backpropagation : weights update 𝜂 : Learning Rate
Step by step : Backpropagation Backpropagation : weights update
Step by step : Backpropagation Backpropagation : weights update
Backpropagation Sigmoid function 0부터 1까지 연속적인 출력값을 갖는 함수 활성함수(Activation function)에서 주로 쓰임 * 0.5 1
Backpropagation Sigmoid function Sigmoid function을 미분한 값은 backpropagation 계산에 용이함
Backpropagation Delta-rule Gradient descent learning rule for updating the weights neural network(Linear neuron). Error as the squared residuals summed over all training cases Error derivatives for weights Batch delta rule changes the weights
Backpropagation Delta-rule Gradient descent learning rule for updating the weights neural network(logistic neuron). delta-rule extra term = slope of logistic
Backpropagation dE/dy : input(hidden -> output) =𝑏+ 𝑖 𝑦 𝑖 𝑤 𝑖𝑗 = 1 1+ 𝑒 − 𝑧 𝑗
Backpropagation example : Space exploration rovers a robot vehicle that moves across the surface of a planet and conducts detailed geological studies(physical analysis of planetary terrains and astronomical bodies andcollecting data about air pressure, climate, temperature, wind...) pertaining to the properties of the landing cosmic environment. Youssef Bassil, the Chief Science Officer of the LACSC association http://photojournal.jpl.nasa.gov/catalog/PIA04413
Backpropagation example : Space exploration rovers Space exploration robots The paper[1] proposed an ANN model using back-propagation supervised-algorithm allows rovers to autonomous path-planning to successfully navigate through challenging planetary terrains and follow their goal location while avoiding dangerous obstacles. Input layer : 2 neurons(𝑥,𝑦 nodes) is fed by the rover’s sensors. Hidden layer : 3 neurons(sigmoid activation function) Output layer : 2 neurons(𝑦=𝑣 , activation function) are directly linked to the rover’s motors which control its movement and its mechanical operation.
Backpropagation example : Space exploration rovers Space exploration robots
Improved backpropagation Learning slowdown problem Sigmoid 함수의 미분 특성으로 인해 신경망 학습 속도가 느려짐 Cost function : 𝐶= (𝑦−𝑎) 2 2 , output value : 𝑎= 𝜎(𝑧), where 𝑧=𝑤𝑥+𝑏 𝜕𝐶 𝜕𝑤 = 𝑎−𝑦 𝜎 ′ 𝑧 𝑥=𝑎 𝜎 ′ 𝑧 𝜕𝑐 𝜕𝑏 = 𝑎−𝑦 𝜎 ′ 𝑧 =𝑎 𝜎 ′ (𝑧) Sigmoid를 미분한 함수값은 0에서 멀어질수록 0에 수렴하게 되어 (a-y)항이 크더라도 학습속도 저하발생
Improved backpropagation Learning slowdown problem(해결방법) Cost function을 quadratic cost function 대신에 cross-entropy cost function을 사용 𝜎 ′ 𝑧 항이 제거 되어 MSE 보다 빠른 속도로 학습이 진행됨 𝐶= 1 𝑛 𝑥 [𝑦𝑙𝑛𝑎 + 1−𝑦 ln(1−𝑎) ] , where 𝑎= 𝜎 𝑧 , 𝑧= 𝑗 𝑤 𝑗 𝑥 𝑗 +𝑏 C>0, 0≤𝑦≤1
Linear regression in neural networks Normalization Activation function이 sigmoid function인 경우 Output이 0에서 1사이의 값이 출력되기 때문에 Neural networks에서 Cost function을 수행하기 위해서 실제 출력값의 Min-Max normalization 필요함 𝑦 ′ = 𝑦 −𝑚𝑖𝑛 max −𝑚𝑖𝑛 , 0≤ 𝑌 ′ ≤1 (Min-Max normalization) 예측값을 denormalization 시킴으로써 original scale로 예측가능 𝑎 ′ =𝑎× 𝑚𝑎𝑥−𝑚𝑖𝑛 +𝑚𝑖𝑛 (denormalization)
Linear regression in neural networks BostonHousing dataset Housing data for 506 census tracts of Boston from the 1970 census. Housing data for 506 census tracts of Boston from the 1970 census.
Linear regression in neural networks Housing data for 506 census tracts of Boston from the 1970 census.
Linear regression in neural networks Housing data for 506 census tracts of Boston from the 1970 census.
Linear regression in neural networks Housing data for 506 census tracts of Boston from the 1970 census.
Linear regression in neural networks Housing data for 506 census tracts of Boston from the 1970 census.
Optimization issues in using the weight derivatives How often to update the weights - Online : after each training case. - Full batch: after a full sweep through the training data. - Mini-batch: after a small sample of training cases. How much to update - Use a fixed learning rate? - Adapt the global learning rate? - Adapt the learning rate on each connection separately? - Don’t use steepest descent?
Overfitting Overfitting(과적합) Data를 training 할 때, training data에 특화되어 학습을 하는 경우, 새로운 샘플에 대한 예측의 결과가 오히려 나빠지거나 학습의 효과가 나타나지 않는 경우. (a)는 많은 error가 있음. (b)는 약간의 오차는 있지만 주어진 점들의 특성을 잘 반영함 (c)는 모든 점들을 그대로 추정하는데, 새로운 샘플에 적합하지 않음(과적합)
Ways to reduce overfitting Weight-decay : small or many of 0 Weight-sharing : make same value Early stopping : make fake test set, if a result is worse, training stop. Model averaging : lots of different neural net -> averaging Dropout : 일부 neuron들을 생략하고 학습을 하고, 그 과정을 반복하게 되는데 특정 training data에 영향을 받지 않는 robust한 모델이 형성됨.
Ways to reduce overfitting 1. Regularization Overfitting을 방지하기 위한 가장 확실한 방법은 훈련데이터 양을 늘리는 것인데, 이는 많은 시간과 비용이 발생할 수 있고 학습시간이 늘어나는 문제가 있음 Penalty condition으로 간단한 모형으로 선택을 유도함.
Ways to reduce overfitting 1. Regularization(L2 Regularization) w가 작아지는 방향으로 학습하는 것은 “local noise” 가 학습에 큰 영향을 끼치지 않고 “outlier”의 영향을 적게 받게 하는 것, 일반화에 적합하게 만듬. L2 Regularization 𝐶 0 : original cost function n : training에 사용되는 input 수 𝜆 : Learning rate w : 가중치(weights) 새롭게 정의된 cost function을 w에 대해 편미분한 결과이다. 원래 w 값에 (1-𝑛𝜆/𝑛)을 곱함 값이 작아지는 방향으로 진행 이를 “weight decay” 라고 함 𝐶= 𝐶 0 + 𝜆 2𝑛 𝑤 𝑤 2 𝑤 →𝑤 −𝜂 𝜕𝐶0 𝜕𝑤 − 𝜂𝜆 𝑛 = 1 − 𝜂𝜆 𝑛 𝑤 − 𝜂 𝜕 𝐶 0 𝜕𝑤
Ways to reduce overfitting 1. Regularization(L1 Regularization) 상수값을 빼주기 때문에 작은 가중치들은 모두 0으로 수렴하여, 몇 개의 중요한 가중치만 남음. “sparse-model”에 적합함. 미분이 불가능 하기 때문에 gradient-based learning에서는 주의가 필요함. L2 Regularization 𝐶 0 : original cost function n : training에 사용되는 input 수 𝜆 : Learning rate w : 가중치(weights) 새롭게 정의된 cost function을 w에 대해 편미분한 결과이다. w의 부호에 따라 상수값을 빼주는 방식 𝐶= 𝐶 0 + 𝜆 𝑛 𝑤 |𝑤| 𝑤 →𝑤 − 𝜂𝜆 𝑛 𝑠𝑔𝑛(𝑤)−𝜂 𝜕𝐶0 𝜕𝑤
Ways to reduce overfitting 2. Drop out[3] 그림 (a)에 대한 학습을 할 때 모든 망에 대해서 학습하지 않고 그림(b)와 같이 input layer나 hidden layer에서 일부 neuron을 생략(dropout)하고 학습을 수행한 후, 무작위로 다른 뉴런들을 선택하면서 반복적으로 학습한다.(voting에 의한 평균 효과로 regularization과 비슷한 효과를 얻을 수 있음) .
References -papers- [1] Rumelhart, D. E., Hinton, G. E., and Williams, R. J., “Learning representations by back-propagating errors. ”, Nature, 323, 533--536. , 1986 [2] Youssef Bassil, “Neural Network Model for Path-Planning Of Robotic Rover Systems”, International Journal of Science and Technology (IJST), E-ISSN: 2224-3577, Vol. 2, No. 2, February, 2012 [3] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, Journal of Machine Learning Research 15, 1929-1958, 2014 -web pages- [4] http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html [5] http://blog.naver.com/laonple [6] http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html [7] http://neuralnetworksanddeeplearning.com/chap3.html [8] https://en.wikipedia.org/wiki/Cross_entropy#cite_note-1 [9] https://www.youtube.com/watch?v=GlcnxUlrtek Neural Networks Demystified [Part 4: Backpropagation] [10] https://heuristically.wordpress.com/2011/11/17/using-neural-network-for-regression/ [11] http://beyondvalence.blogspot.kr/2014/04/r-comparing-multiple-and-neural-network.html