경사하강법 Gradient descent algorithm 다변수 함수의 최적화 경사하강법 Gradient descent algorithm
다변수 함수의 최적화 최적화 문제 : 최소해(minimizer) 찾기 𝑥 ∗ =𝑎𝑟𝑔𝑚𝑖 𝑛 𝑥∈𝐷 𝑓(𝑥) 𝑥= 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑝 𝑇 ∈𝐷⊂ ℝ 𝑝 𝑓:𝐷→ℝ : 아래로 볼록 함수(convex) 𝑥 ∗ ∈𝐷 : 유일하게 존재
Descent algorithm 𝑥 1 → 𝑥 2 → 𝑥 3 →⋯→ 𝑥 ∗ 𝑥 𝑛+1 = 𝑥 𝑛 + 𝛼 𝑛 𝑑 𝑛 𝑥 𝑛+1 = 𝑥 𝑛 + 𝛼 𝑛 𝑑 𝑛 𝛼 𝑛 >0 : step size (or learning rate) 𝑑 𝑛 ∈ ℝ 𝑝 : direction vector 𝑓 𝑥 𝑛+1 <𝑓 𝑥 𝑛 : descent condition 𝛼 𝑛 과 𝑑 𝑛 선택 𝑑 𝑛 선택 : descent direction 𝑓 𝑥 𝑛+1 <𝑓 𝑥 𝑛 𝛼 𝑛 선택 : optimal 𝛼 𝑛 𝛼 𝑛 =𝑎𝑟𝑔𝑚𝑖 𝑛 𝑡>0 𝑓 𝑥 𝑛 +𝑡 𝑑 𝑛
예 𝑓 𝑥,𝑦 = log 1+ exp 𝑥−𝑦 +2 𝑥 2 + 𝑦 2 −2𝑥𝑦−2𝑥−𝑦 Black point : present point = (2,2) Black solid line : level curve Black dashed line : tangent line Red arrows : descent directions Blue arrows : ascent directions
예 𝑓 𝑥,𝑦 = log 1+ exp 𝑥−𝑦 +2 𝑥 2 + 𝑦 2 −2𝑥𝑦−2𝑥−𝑦 ∇𝑓 2,2 = 2.5 −1.5 ∇𝑓 2,2 = 2.5 −1.5 Blue arrow +∇𝑓 2,2 Red arrow −∇𝑓 2,2
Descent direction 𝑥 𝛼 =𝑥+𝛼𝑑 𝑑=−∇𝑓 𝑥 선형 근사 : 충분히 작은 𝛼>0에 대하여 𝑓 𝑥 𝛼 ≈𝑓 𝑥 + 𝑥 𝛼 −𝑥 𝑇 ∇𝑓 𝑥 =𝑓 𝑥 −𝛼∇𝑓 𝑥 𝑇 ∇𝑓 𝑥 <𝑓(𝑥)
Descent direction 𝑥 𝛼 =𝑥+𝛼𝑑 𝑑=−𝐴∇𝑓 𝑥 , 𝑤ℎ𝑒𝑟𝑒 𝐴 𝑖𝑠 𝑎 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 선형 근사 : 충분히 작은 𝛼>0에 대하여 𝑓 𝑥 𝛼 ≈𝑓 𝑥 + 𝑥 𝛼 −𝑥 𝑇 ∇𝑓 𝑥 =𝑓 𝑥 −𝛼∇𝑓 𝑥 𝑇 𝐴∇𝑓 𝑥 <𝑓(𝑥)
Gradient descent algorithm 𝑥 𝑛+1 = 𝑥 𝑛 − 𝛼 𝑛 𝐴 𝑛 ∇𝑓 𝑥 𝑛 𝛼 𝑛 >0 : step size 𝐴 𝑛 : positive definite matrix ∇𝑓 𝑥 𝑛 : gradient vector at 𝑥= 𝑥 𝑛 예 Steepest descent : 𝐴 𝑛 =𝐼 Newton-Raphson : 𝐴 𝑛 = ∇ 2 𝑓 𝑥 𝑛 −1 Diagonally scaled steepest descent : 𝐴 𝑛 = 𝑑𝑖𝑎𝑔 ∇ 2 𝑓 𝑥 𝑛 −1 …
예 𝑓 𝑥,𝑦 = log 1+ exp 𝑥−𝑦 +2 𝑥 2 + 𝑦 2 −2𝑥𝑦−2𝑥−𝑦 𝑥 𝛼 = 2 2 +𝛼𝑑 𝑥 𝛼 = 2 2 +𝛼𝑑 Direction : 𝑑=−∇𝑓 2,2 =− 2.5 −1.5 Blue arrow 𝛼=0.50 Red arrow 𝛼=0.17
Step size 선택 Constant step size : 𝛼 𝑛 =0.1, 0.01, … Deterministic sequential step size : 𝛼 𝑛 = 1 𝑛 , 1 𝑛 , … Successive step size reduction 𝛼 𝑛𝑘 = 𝛼 𝑛1 𝛾 𝑘−1 , 𝑘=1,2,…, 0<𝛾<1 𝑓 𝑥 𝑛 + 𝛼 𝑛𝑘 𝑑 𝑛 <𝑓 𝑥 𝑛 을 처음으로 만족하는 𝛼 𝑛𝑘 를 𝛼 𝑛 으로 선택 1→ 1 2 → 1 4 → 1 8 →…을 순차적으로 사용해보고 descent condition을 만족하는 step size 선 택 Armijo 조건, 곡률 조건, …
R-실습 𝑓 𝑥,𝑦 = log 1+ exp 𝑥−𝑦 +2 𝑥 2 + 𝑦 2 −2𝑥𝑦−2𝑥−𝑦 Steepest descent algorithm Step size (100 iterations) Constant : 𝛼 𝑛 =1, 0.1, 0.01, 0.001 Sequence : 𝛼 𝑛 = 𝑛 −𝑐 , 𝑐= 1 4 , 1 2 , 1,2 Size reduction : 𝛼 𝑛𝑘 = 0.5 𝑘−1
Linear regression + square loss 𝑦 𝑖 = 𝛽 0 + 𝛽 1 𝑥 𝑖1 +⋯+ 𝛽 𝑘 𝑥 𝑖𝑘 + 𝜖 𝑖 , 𝑖=1,2,…,𝑛 Square loss : 𝑓 𝛽 0 , 𝛽 1 ,…, 𝛽 𝑘 = 1 𝑛 𝑖=1 𝑛 𝑦 𝑖 − 𝛽 0 − 𝛽 1 𝑥 𝑖1 −⋯− 𝛽 𝑘 𝑥 𝑖𝑘 2 행렬 표현 : 𝑌=𝑋𝛽+𝜖 𝑓 𝛽 = 1 𝑛 𝑌−𝑋𝛽 𝑇 𝑌−𝑋𝛽 𝑌= 𝑦 1 ⋮ 𝑦 𝑛 𝑋= 1 𝑥 11 ⋯ 𝑥 1𝑘 ⋮ ⋮ ⋱ ⋮ 1 𝑥 𝑛1 ⋯ 𝑥 𝑛𝑘 𝛽= 𝛽 0 𝛽 1 ⋮ 𝛽 𝑘 𝜖= 𝜖 1 ⋮ 𝜖 𝑛
Linear regression + square loss 𝑓 𝛽 = 1 𝑛 𝑌−𝑋𝛽 𝑇 𝑌−𝑋𝛽 Gradient : Hessian : Exact solution : Steepest gradient descent algorithm 𝛽 𝑛+1 = 𝛽 𝑛 − 𝛼 𝑛 ∇𝑓 𝛽 𝑛 Step size reduction
Linear regression + ridge 𝑦 𝑖 = 𝛽 0 + 𝛽 1 𝑥 𝑖1 +⋯+ 𝛽 𝑘 𝑥 𝑖𝑘 + 𝜖 𝑖 , 𝑖=1,2,…,𝑛 Ridge loss : 𝑓 𝛽 0 , 𝛽 1 ,…, 𝛽 𝑘 = 1 𝑛 𝑖=1 𝑛 𝑦 𝑖 − 𝛽 0 − 𝛽 1 𝑥 𝑖1 −⋯− 𝛽 𝑘 𝑥 𝑖𝑘 2 +𝜆 𝑗=1 𝑘 𝛽 𝑗 2 Note that 𝑓 𝛽 0 , 𝛽 = 𝑦 − 𝛽 0 − 𝛽 1 𝑥 1 −⋯− 𝛽 𝑘 𝑥 𝑘 2 +𝑔 𝛽 𝑔 𝛽 = 1 𝑛 𝑌 − 𝑋 𝛽 𝑇 𝑌 − 𝑋 𝛽 +𝜆 𝛽 𝑇 𝛽 𝑌 = 𝑦 1 − 𝑦 ⋮ 𝑦 𝑛 − 𝑦 , 𝑋 = 𝑥 11 − 𝑥 1 ⋯ 𝑥 1𝑘 − 𝑥 𝑘 ⋮ ⋱ ⋮ 𝑥 1𝑛 − 𝑥 1 ⋯ 𝑥 𝑛𝑘 − 𝑥 𝑘 , 𝛽 = 𝛽 1 ⋮ 𝛽 𝑘 Minimizer of 𝑓 𝛽 Minimizer of 𝑔 𝛽 𝛽 1 ,…, 𝛽 𝑘 𝛽 0 = 𝑦 − 𝛽 1 𝑥 1 −⋯− 𝛽 𝑘 𝑥 𝑘
Linear regression + ridge 𝑔 𝛽 = 1 𝑛 𝑌 − 𝑋 𝛽 𝑇 𝑌 − 𝑋 𝛽 +𝜆 𝛽 𝑇 𝛽 Gradient : Hessian : Exact solution : Steepest gradient descent algorithm 𝛽 𝑛+1 = 𝛽 𝑛 − 𝛼 𝑛 ∇𝑔 𝛽 𝑛 Step size reduction
Linear regression + LASSO 𝑦 𝑖 = 𝛽 0 + 𝛽 1 𝑥 𝑖1 +⋯+ 𝛽 𝑘 𝑥 𝑖𝑘 + 𝜖 𝑖 , 𝑖=1,2,…,𝑛 LASSO : 𝑓 𝛽 0 , 𝛽 1 ,…, 𝛽 𝑘 = 1 𝑛 𝑖=1 𝑛 𝑦 𝑖 − 𝛽 0 − 𝛽 1 𝑥 𝑖1 −⋯− 𝛽 𝑘 𝑥 𝑖𝑘 2 +𝜆 𝑗=1 𝑘 𝛽 𝑗 Note that 𝑓 𝛽 0 , 𝛽 = 𝑦 − 𝛽 0 − 𝛽 1 𝑥 1 −⋯− 𝛽 𝑘 𝑥 𝑘 2 +𝑔 𝛽 𝑔 𝛽 = 1 𝑛 𝑌 − 𝑋 𝛽 𝑇 𝑌 − 𝑋 𝛽 +𝜆 𝛽 1 Minimizer of 𝑓 𝛽 Minimizer of 𝑔 𝛽 𝛽 1 ,…, 𝛽 𝑘 𝛽 0 = 𝑦 − 𝛽 1 𝑥 1 −⋯− 𝛽 𝑘 𝑥 𝑘
Linear regression + LASSO 𝑔 𝛽 = 1 𝑛 𝑌 − 𝑋 𝛽 𝑇 𝑌 − 𝑋 𝛽 +𝜆 𝛽 1 𝛽 𝑗 =0에서 미분 불가능 Gradient descent algorithm 사용 불가능 해결방안 Coordinate descent algorithm 사용 Gradient 대신 sub-gradient를 사용!!
LASSO + coordinate descent algorithm ℎ 𝑥 =𝑎 𝑥 2 +𝑏𝑥+𝑐 𝑥 +𝑑 𝑎,𝑐>0, 𝑏,𝑑∈ℝ Exact minimizer 𝑥 ∗ = − 𝑏 2𝑎 −𝑠𝑖𝑔𝑛 − 𝑏 2𝑎 ∗ 𝑐 2𝑎 𝐼 − 𝑏 2𝑎 > 𝑐 2𝑎 − 𝑏 2𝑎 > 𝑐 2𝑎 >0⇒ 𝑥 ∗ =− 𝑏 2𝑎 − 𝑐 2𝑎 ∈(0,𝑏) 0<− 𝑏 2𝑎 ≤ 𝑐 2𝑎 ⇒ 𝑥 ∗ =0 − 𝑏 2𝑎 =0⇒ 𝑥 ∗ =0 − 𝑐 2𝑎 ≤− 𝑏 2𝑎 <0⇒ 𝑥 ∗ =0 − 𝑏 2𝑎 <− 𝑐 2𝑎 <0⇒ 𝑥 ∗ =− 𝑏 2𝑎 + 𝑐 2𝑎 ∈(𝑏,0)
LASSO + coordinate descent algorithm 𝑔 𝛽 = 1 𝑛 𝑌 − 𝑋 𝛽 𝑇 𝑌 − 𝑋 𝛽 +𝜆 𝛽 1 𝑔 𝛽 1 ,…, 𝛽 𝑘 = 1 𝑛 𝑖 𝑗 𝛽 𝑖 𝑥 𝑖 𝑇 𝑥 𝑗 𝛽 𝑗 − 2 𝑛 𝑗 𝑥 𝑗 𝑇 𝑌 𝛽 𝑗 +𝜆 𝑗 𝛽 𝑗 𝛽 𝑗 축 𝑎= 𝑥 𝑗 𝑇 𝑥 𝑗 𝑛 𝑏=− 2 𝑛 𝑥 𝑗 𝑇 𝑌 − 𝑖≠𝑗 𝛽 𝑖 𝑥 𝑖 𝑐=𝜆 𝛽 𝑗 ∗ =𝑎𝑟𝑔𝑚𝑖 𝑛 𝛽 𝑗 𝑔 𝛽 1 ,…, 𝛽 𝑘 = − 𝑏 2𝑎 −𝑠𝑖𝑔𝑛 − 𝑏 2𝑎 ∗ 𝑐 2𝑎 𝐼 − 𝑏 2𝑎 > 𝑐 2𝑎
부분 이차 근사 𝑓 𝑥 =𝑔 𝑥 +ℎ 𝑥 부분 이차 근사 Idea 𝑔(𝑥) : convex + 이차 도함수 존재 𝑔 𝑥 ≈𝑞 𝑥 , 𝑥∈𝒩 𝑥 𝑛 ,𝛿 𝑓 𝑥 ≈𝑞 𝑥 +ℎ 𝑥 Idea 𝑥 𝑛+1 =𝑎𝑟𝑔𝑚𝑖 𝑛 𝑥∈𝐷 𝑞 𝑥 +ℎ 𝑥
예제 : simple logistic regression Data 𝑌 𝑖 ∼𝐵 1, 𝑝 𝑖 , 𝑝 𝑖 = exp 𝛽 0 + 𝛽 1 𝑥 𝑖 1+ exp 𝛽 0 + 𝛽 1 𝑥 𝑖 , 𝑖=1,2,…,𝑛, 𝑖𝑛𝑑𝑒𝑝. 𝑓 𝛽 =− 1 𝑛 𝑖=1 𝑛 𝑦 𝑖 log 𝑝 𝑖 + 1− 𝑦 𝑖 log 1− 𝑝 𝑖 +𝜆 𝛽 1 부분 이차 근사 + coordinate descent algorithm X Y 1 2 3
Sub-gradient Gradient of 𝑓 at 𝑥=𝑎 Sub-gradient of 𝑓 at 𝑥=𝑎 𝑓 : convex 𝑓 : convex and differentiable 𝑓 𝑥 ≥𝑓 𝑎 +∇𝑓 𝑎 𝑇 𝑥−𝑎 , ∀𝑥∈𝐷 Sub-gradient of 𝑓 at 𝑥=𝑎 𝑓 𝑥 ≥𝑓 𝑎 + 𝑠 𝑇 𝑥−𝑎 , ∀𝑥∈𝐷 𝑠 : sub-gradient at 𝑥=𝑎 𝑓 : convex 1개 이상의 sub-gradient 존재 +미분가능 𝑠=∇𝑓 𝑎 유일
예 𝑓 𝑥 = 𝑥 𝑥 ≥ 𝑎 +𝑠(𝑥−𝑎) 𝑠=? 𝑠=𝑠𝑖𝑔𝑛 𝑎 sub-gradient 𝑎>0⇒𝑠=1 𝑎<0⇒𝑠=−1 𝑎=0⇒ 𝑠 ≤1 𝑠=𝑠𝑖𝑔𝑛 𝑎 sub-gradient
Sub-gradient descent algorithm 𝑥 𝑛+1 = 𝑥 𝑛 − 𝛼 𝑛 ∇𝑓 𝑥 𝑛 Gradient sub-gradient 사용 Sub-gradient descent algorithm 𝑥 𝑛+1 = 𝑥 𝑛 − 𝛼 𝑛 𝑠 𝑛 𝑠 𝑛 : any sub-gradient of 𝑓 at 𝑥= 𝑥 𝑛 Descent condition 성립 안 할 수 있음 Step size : constant or sequence Stopping rule : 0< 𝑓 𝑛 𝑏𝑒𝑠𝑡 − 𝑓 𝑛+1 𝑏𝑒𝑠𝑡 <𝜖 𝑓 𝑛 𝑏𝑒𝑠𝑡 = min 𝑖≤𝑛 𝑓 𝑥 𝑖
Linear regression + elastic net 𝑦 𝑖 = 𝛽 1 𝑥 𝑖1 +⋯+ 𝛽 𝑘 𝑥 𝑖𝑘 + 𝜖 𝑖 , 𝑖=1,2,…,𝑛 Centered data Ridge : 𝑓 𝛽 = 1 𝑛 𝑖=1 𝑛 𝑦 𝑖 − 𝛽 1 𝑥 𝑖1 −⋯− 𝛽 𝑘 𝑥 𝑖𝑘 2 +𝜆 𝑗=1 𝑘 𝛽 𝑗 2 LASSO : 𝑓 𝛽 = 1 𝑛 𝑖=1 𝑛 𝑦 𝑖 − 𝛽 1 𝑥 𝑖1 −⋯− 𝛽 𝑘 𝑥 𝑖𝑘 2 +𝜆 𝑗=1 𝑘 𝛽 𝑗 Elastic net : 𝑓 𝛽 = 1 𝑛 𝑖=1 𝑛 𝑦 𝑖 − 𝛽 1 𝑥 𝑖1 −⋯− 𝛽 𝑘 𝑥 𝑖𝑘 2 +𝜆 𝛾 𝑗=1 𝑘 𝛽 𝑗 + 1−𝛾 𝑗=1 𝑘 𝛽 𝑗 2 0≤𝛾≤1 𝛾=0 ridge 𝛾=1 LASSO
Linear regression + elastic net Sub-gradient 𝑓 𝛽 = 1 𝑛 𝑌−𝑋𝛽 𝑇 𝑌−𝑋𝛽 +𝜆 𝛾 𝛽 1 + 1−𝛾 𝛽 𝑇 𝛽 𝑠=− 2 𝑛 𝑋 𝑇 𝑌−𝑋𝛽 +𝜆 𝛾𝑠𝑖𝑔𝑛 𝛽 +2 1−𝛾 𝛽
Stochastic gradient descent algorithm 통계 응용에서의 최적화 문제 목적함수 : 𝑓 𝛽 = 1 𝑛 𝑖=1 𝑛 𝐿 𝛽; 𝑦 𝑖 , 𝑥 𝑖1 ,…, 𝑥 𝑖𝑘 ≈𝐸 𝐿 𝛽;𝑦, 𝑥 1 ,…, 𝑥 𝑘 경험적 위험함수 Sample에 대한 평균 형태 ∇𝑓 𝛽 = 1 𝑛 𝑖=1 𝑛 ∇𝐿 𝛽; 𝑦 𝑖 , 𝑥 𝑖1 ,…, 𝑥 𝑖𝑘 ≈𝐸 ∇𝐿 𝛽;𝑦, 𝑥 1 ,…, 𝑥 𝑘 평균 형태 𝑛이 큰 경우 (batch) whole sample : 𝛽 𝑚+1 = 𝛽 𝑚 − 𝛼 𝑚 1 𝑛 𝑖=1 𝑛 ∇𝐿 𝛽; 𝑦 𝑖 , 𝑥 𝑖1 ,…, 𝑥 𝑖𝑘 (stochastic) 1 random sample : 𝛽 𝑚+1 = 𝛽 𝑚 − 𝛼 𝑚 ∇𝐿 𝛽; 𝑦 𝑖 , 𝑥 𝑖1 ,…, 𝑥 𝑖𝑘 (mini-batch) 여러 개의 random sample : 𝛽 𝑚+1 = 𝛽 𝑚 − 𝛼 𝑚 1 𝑆 𝑖∈𝑆 ∇𝐿 𝛽; 𝑦 𝑖 , 𝑥 𝑖1 ,…, 𝑥 𝑖𝑘