Parallel software Lab. 박 창 규

Parallel software Lab. 박 창 규
Chapter 7 Regularization for Deep Learning Parallel software Lab. 박 창 규

What is Regularization?
Machine learning에서 중요한 문제 중 하나는 training data가 아닌 새로운input이 들어왔을 때 좋은 performance를 보이는 것 Test error를 줄이기 위한 전략 -> Regularization Regularization은 test error를 줄이기 위한 전략을 뜻한다. (training error가 증가하는 비용이 있더라도.)

Goals of Regularization
Object function에 추가적인 constraints를 부여함 Parameter value에 soft constraints준 것과 같음(Test set에 대한 performance를 향상시킴) Goals of Regularization Encode prior knowledge Express preference for simpler model Needed to make underdetermined problem determined

Regularizing Estimators
Deep learning에서 Regularization은 Regularizing Estimators 를 의미한다. variance를 줄이기 위해서 bias를 증가시키는 것과 관련이 있다. 좋은 Regularization은 bias를 과도하게 증가시키지 않고 variance를 줄인다.

Model Family and Regularization
Three types of model family Chapter 5에서 generalization과 overfitting에 대해서 이야기 할 때, data generating process에 대해 언급한 내용. 3가지 model family에 대해 언급 Excludes the true data generating process (Underfitting) Matches the true data generating process Overfits Regularization의 목적은 overfits에서 다른 generating process를 제거하여 true data generating process에 match 시키는 것.

Importance of Regularization
과도하게 복잡한 model 집합은 target function, true data generating process 혹은 approximation을 필수적으로 포함하지는 않는다. 대부분의 deep learning application은 true data generating process가 모델 집합 밖에 있는 domains에 존재한다. Ex. Complex domain of image, audio sequences, text 동그란 구멍(our model family)에 네모난 구멍(data generating process)을 맞추는 일

What is the Best Model? 최적의 Model은 올바른 parameter들을 찾는 것으로 얻을 수는 없다.
대신, Best Fitting Model은 regularize되어 온 large model이다. 따라서 이번 chapter에서는 large, deep regularized model을 얻는 전략을 볼 것이다.

7.1 Parameter Norm Penalties
Regularization에 대한 많은 시도들이 parameter norm penalty Ω(θ) 를 object function J에 더해줌으로써 model의 capacity를 제한하는데 기초를 두고 있다.(𝛼:ℎ𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 where 𝛼∈[0, ∞)) 𝐽 𝜃;𝑋, 𝑦 =𝐽 𝜃;𝑋,𝑦 + 𝛼Ω 𝜃 Training algorithm은 regularized object function 𝐽 를 최소화 하며, 이는 곧 training set과 parameter들의 subset에 대하여 𝐽를 감소시킨다.

7.1.1 L2 Parameter Regularization
L2 Parameter Regularization은 weight decay로 알려져 있음. Objective function J 에 regularization term Ω 𝜽 = 𝟏 𝟐 𝝎 𝟐 를 더해주어 weight가 본래의 값으로 돌아가게 하기 때문. Assume no bias(𝜽→ 𝝎) : 𝐽 𝜔;𝑋, 𝑦 = 𝛼 2 𝜔 𝑇 𝜔+𝐽 𝜔;𝑋,𝑦 Corresponding parameter gradient : 𝛻 𝜔 𝐽 𝜔;𝑋, 𝑦 =𝛼𝜔+ 𝛻 𝜔 𝐽 𝜔;𝑋,𝑦 Take single gradient step :𝜔←𝜔−𝜖(𝛼𝜔+ 𝛻 𝜔 𝐽 𝜔;𝑋,𝑦 ) 𝜔←(1−𝜖𝛼)𝜔−𝜖 𝛻 𝜔 𝐽 𝜔;𝑋,𝑦 weight decay term이 gradient update 전, 매 step마다 constant factor에 의해 weight를 감소시키는 것을 볼 수 있다.(작아지는 방향으로 진행)

전체 training course에서 어떤 변화가 있는지 보기 위해 objective function을 2차 근사 한다. 𝐽 𝜃 = 𝜔− 𝜔 ∗ 𝑇 𝑯(𝜔− 𝜔 ∗ )+𝐽 𝜔 ∗ where 𝜔 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 𝜔 𝐽 𝜔 Minimum of 𝐽 : 𝛻 𝜔 𝐽 𝜔 =𝑯 𝜔− 𝜔 ∗ =0 Adding the weight decay gradient: α 𝜔 +𝑯 𝜔 − 𝜔 ∗ =0 𝑯+α𝑰 𝜔 =𝑯 𝜔 ∗ 𝜔 = 𝑯+α𝑰 −1 𝑯 𝜔 ∗ 𝛼가 0으로 수렴할 수록 regularized solution 𝜔 는 unregularized training cost 𝜔 ∗ 로 수렴한다.

L1 regularization은 각 model parameter 𝝎의 절대값을 모두 합한 값과 같으며, 적용하면 다음과 같다. Ω 𝜃 = 𝜔 1 = 𝑖 𝜔 𝑖 L1 regularization도 L2 regularization과 마찬가지로 Hyperparameter 𝛼를 이용하여 penalty Ω를 조절할 수 있다. 𝐽 𝜔;𝑋, 𝑦 =𝛼 𝜔 1 +𝐽 𝜔;𝑋,𝑦 𝛻 𝜔 𝐽 𝜔;𝑋, 𝑦 =𝛼𝑠𝑖𝑔𝑛(𝜔)+ 𝛻 𝜔 𝐽 𝜔;𝑋,𝑦 𝜔←𝜔 −𝛼𝑠𝑖𝑔𝑛(𝜔)+ 𝛻 𝜔 𝐽 𝜔;𝑋,𝑦 결과적으로 𝝎의 부호에 따라 상수 값을 빼주는 방식으로 regularization을 진행

L1 Parameter Regularization은 통상적으로 상수 값을 빼주도록 되어있기 때문에 hyperparameter 𝛼 가 충분히 크다면 작은 weight들을 거의 0으로 수렴시키며, 몇 개의 중요한 weight들만 남게 된다. 몇 개의 의미 있는 값만 남기고 싶은 경우에는 L1 Parameter Regularization이 효과적이기 때문에 sparse model(some parameters have an optimal value of zero)에 적합하다.

7.2 Norm Penalties as Constrained Optimization
Regularization term Ω 𝜽 를 상수 k보다 작기를 원한다면 generalized Lagrange function을 다음과 같이 만들 수 있다. ℒ 𝜃,𝛼;𝑋,𝑦 =𝐽 𝜃;𝑋,𝑦 +𝛼(Ω 𝜃 −𝑘) To solution to the constrained problem: 𝜃 ∗ =𝑎𝑟𝑔 min 𝜃 max 𝛼,𝛼≥0 ℒ(𝜃, 𝛼 ∗ ) =𝑎𝑟𝑔 min 𝜃 𝐽(𝜃;𝑋,𝑦) + 𝛼 ∗ Ω 𝜃 모든 프로시저에서 𝛼는 Ω 𝜃 >𝑘이면 증가하고 Ω 𝜃 <𝑘이면 감소한다. 이런 식으로 object function의 영역을 제한하고 싶을 때 사용. L2에서 사용하면 L2 ball내로 object function을 제한한다.

7.3 Regularization and Under-Constrained Problems
많은 linear model(ex. Linear regression)이 𝑋 𝑇 𝑋의 inverting에 의존하고 있다. 하지만 𝑋 𝑇 𝑋가 singular 가 되면 inverse matrix가 없어 불가능 함. input feature(column of X)보다 example(row of X)이 적으면 data generating distribution에 variance가 없어진다. Variance가 없어지면 𝑋 𝑇 𝑋는 singular가 된다. Regularization을 해주면 𝑋 𝑇 𝑋+𝛼𝐼의 inverting을 대신 사용할 수 있다.( 𝑋 𝑇 𝑋+𝛼𝐼은 invert matrix가 보장된다.) 또한 underdetermined problem에 대해 적용되는 반복 함수에 대해 convergence가 보장된다.

7.4 Dataset Augmentation Machine learning model의 일반화를 더 잘하기 위해서는 더 많은 data에 train해야 한다. 제한된 data에 대한 문제를 해결하기 위해 fake data를 만들고 training set에 추가해준다. 이러한 방법은 classification에는 쉽지만(다양한 변화에 대해 classifier가 불변한다.), density estimation task(density estimation problem을 풀었다면 가능.)와 같은 여러 task에는 적용하기 어렵다. 올바른 class를 변화시키는 transformation을 적용해서는 안된다.

7.5 Noise Robustness 어떤 model들은 Noise를 data augmentation을 이용하여 input으로 사용할 수도 있다. Regularizing model에서 사용되는 방법은 noise를 weight에 더하는 것이다. Noise를 weight에 더하는 것은 학습되어야 하는 function의 stability를 유지시켜주는 regularization의 전형적인 형태와 동등하다

7.6 Semi-Supervised Learning
Semi-Supervised Learning은 unlabeled examples from P(x)와 labeled examples from P(x, y) 가 P(y | x)를 추정하거나 x로 부터 y를 예측하기 위해 사용된다. Semi-Supervised Learning의 목표는 representation을 학습하여 같은 class로부터 얻은 예제들이 유사한 representation을 가지는 것이다. 펭귄이라고 정답이 적힌 데이터 하나를 토대로 전체 데이터 중에서 펭귄 사진과 비슷한 사진을 골라내, 다른 사진에도 펭귄이라는 이름을 붙이는 방법.

7.7 Multi-Task Learning Multi-Task Learning은 task의 밖에서 발생하는 예제들을 가져옴으로써 generalization을 향상시키는 방법이다. Supervised task(predict y(i) given x)는 같은 input x를 공유하고, intermediate-level representation h(shared)를 공유한다. Model은 두 종류 paramete로 나누어진다. Task-specific parameters(좋은 generalizatio을 얻기 위한 task의 예제들로부터 비롯된 이점) Generic parameters(모아진 모든 task의 데이터로 부터 오는 이득)

Parallel software Lab. 박 창 규

Similar presentations

Presentation on theme: "Parallel software Lab. 박 창 규"— Presentation transcript:

Similar presentations

About project

지원

로그인

Auth with social network:

Parallel software Lab. 박 창 규

Similar presentations

Presentation on theme: "Parallel software Lab. 박 창 규"— Presentation transcript:

Similar presentations

About project

지원