Lecture 4: Regularizers, basis functions and cross-validation (Oxford machine learning)

Regularization

답을 예측하는 방식은 항상 다음과 같았다.

$\hat{\theta}=(\mathrm{X}^T\mathrm{X})^{-1}\mathrm{X}^T\mathrm{y}$

They require the inversion of $\mathrm{X}^T\mathrm{X}$ . This can lead to problems if the system of equations is poorly conditioned. A solution is to add a small element to the diagonal:

$\hat{\theta}=(\mathrm{X}^T\mathrm{X}+\delta^2\mathrm{I}_d)^{-1}\mathrm{X}^T\mathrm{y}$

This is the ridge regression estimate. It is the solution to the following regularized quadratic cost function

$J(\theta)=(\mathrm{y}-\mathrm{X}\theta)^T(\mathrm{y}-\mathrm{X}\theta)+\delta^2\theta^T\theta$

이때의 $\delta$ 를 tuning parameter라고 부른다.

Derivation

$\begin{align} \frac{\partial J(\theta)}{\partial \theta}&=\frac{\partial}{\partial\theta}(\theta^T\mathrm{X}^T\mathrm{X}\theta -2 \mathrm{y}^T\mathrm{X}\theta+\mathrm{y}^T\mathrm{y}+\delta^2\theta^T\theta)\\ &=2\mathrm{X}^T\mathrm{X}\theta-2\mathrm{X}^T\mathrm{y}+2\delta^2I{\theta}\\ &=2(\mathrm{X}^T\mathrm{X}+\delta^2I)\theta-2\mathrm{X}^T\mathrm{y}\\ &=0 \\ \hat{\theta}_{ridge}&=(\mathrm{X}^T\mathrm{X}+\delta^2I)\mathrm{X}^T\mathrm{y} \end{align}$

샘플 사이즈가 제한 되어 있을 경우에는 MLE의 결과값이 overfitting하는 경향을 보인다. 때문에 이를 해결해 주기 위하여 Euclidean norm을 넣어준다.

Ridge regression as constrained optimization

$J(\theta)$ 를 최소화 하는 것은 $\theta^T\theta<t(\delta)$ 를 만족하는 $\theta$ 에 대해서 $(\mathrm{y}-\mathrm{X}\theta)^T(\mathrm{y}-\mathrm{X}\theta)$ 를 최소화하는 문제와 같다.

즉,

$J(\theta)=(\mathrm{y}-\mathrm{X}\theta)^T(\mathrm{y}-\mathrm{X}\theta)+\delta^2\theta^T\theta \equiv \min_{\theta : \theta^T\theta \leq t(\delta)}\{(\mathrm{y}-\mathrm{X}\theta)^T(\mathrm{y}-\mathrm{X}\theta)\}$

이때 $\theta^T=\begin{bmatrix}{\theta_1 \ \theta_2}\end{bmatrix}$ 라고 하자. 이때 $\theta^T\theta = \theta_1^2+\theta_2^2$ 가 일정하다고 가정하면 원점을 중심으로 하는 원을 그릴 수 있다.

(OLS란 Ordinary least squares라는 뜻이며, 기존의 least square 식을 뜻한다.)

빨간 타원의 중심은 $\theta_{ML}$ 이 될 것이다.

OLS에서는 최대한 값은 $\theta$ 값을 찾으려 할 것이다. 하지만 이 Ridge regression에서는 shrinkage penalty가 존재한다. 이 값은 원점에서 멀어질 수록 큰 값을 가지게 된다. 그래서 이 두 항을 조절해주는 $\delta$ 가 중요한 역할을 한다.

만약 $\delta^2=0$ 일 경우에는 빨간 원의 중심이 가장 optimal한 해가 될 것이다.

만약 $\delta^2 \to \infty$ 일 경우에는 $\theta^T\theta$ 를 0으로 만드는 것이 가장 효율적이므로 파란 원의 중심이 가장 optimal한 해가 될 것이다.

즉 $\delta$ 값과 $\theta^T\theta$ 의 값은 반비례 관계를 가질 것이다.

이때의 최적의 $\delta$ 값은 빨간 타원과 파란 원의 접점에 위치하게 될 것이다.

Lasso regression

위의 예시에서는 원의 형태로 경계면을 설정하였다.

Lasso에서는 $\left\vert \theta_1 \right\vert + \left\vert \theta_2 \right\vert$ 가 일정하다고 생각할 것이다.

즉 다시말해 Ridge가 $L_2$ norm을 사용하는데 반하여 Lasso는 $L_1$ norm을 사용할 것이다.

Lasso는 경계가 구가 아니라 다각형이기 때문에 한 축의 계수가 0인 지점을 더 쉽게 확인 할 수 있다.

이는 몇몇 변수를 완전히 제외시키는 것으로 이는 더 높은 효율을 낼 수 있도록 한다.

Ridge regression and Maximum a Posteriori (MAP) learning

$J(\theta)=(\mathrm{y}-\mathrm{X}\theta)^T(\mathrm{y}-\mathrm{X}\theta)+\delta^2\theta^T\theta$ $\begin{align} P(y\vert \mathrm{X},\theta)&=\frac{1}{z}e^{-E(\theta,\mathrm{X},y)}\\ P(\theta)&=\frac{1}{z_2}e^{-\delta^2\theta^T\theta} \end{align}$ $P(\theta \vert\mathrm{X},y)=\max_\theta \text{const } P(y\vert\mathrm{X},\theta)P(\theta) = \max_\theta\frac{P(y\vert\mathrm{X},\theta)}{P(y\vert\mathrm{X})}$

MAP의 기본적인 생각은 형태를 선택해주면 그 선택에 해당하는 가장 큰 가능성을 가지는 결과를 찾는 것이다. 때문에 형태가 실제 데이터에 적합하다면 높은 결과를 가져오지만 그렇지 않다면 오히려 낮은 결과를 가져오게 된다. MLE에서는 아무런 근거가 없기 때문에 MAP과는 확연히 다르다.

실제로 주어진 데이터들의 꼴은 $f(\mathrm{X} \vert \theta)$ 이다. 이를 이를 바꾸는 공식은 Bayes’ Theorem이라고 불린다.

Going nonlinear via basis functions

이제는 nonlinear한 경우를 생각해보자.

이때의 식은 다음과 같을 것이다.

$y(\mathbb{x})=\phi(\mathbb{x})\theta+\epsilon$

그리고 이때의 예측값 $\hat{\theta}$ 는 다음과 같이 주어질 것이다.

$\hat{\theta}=(\phi^T\phi+\delta^2I)^{-1}\phi^Ty$

예를 들어 $\phi(x)=[ 1, x,x^2]$ 이라고 하자.

그렇다면

$\begin{align} \hat{y}&=\phi(x)\theta+\epsilon \\ &=\theta_0+x\theta_1+x^2\theta_2+\epsilon \end{align}$

이 될 것이다.

이 경우에도 앞에서 했던 방식과 같은 방식으로 진행을 해 줄 수 있다.

Effect of data

우리가 만약 정확한 모델을 가졌다면 test값과 train의 노이즈가 비슷하고 일정할 것이다.

너무 간단한 모델을 가졌다면 test, train값들에 노이즈가 심할 것이고 bias가 크게 생길 것이다.

너무 복잡한 모델을 가졌다면 트레이닝을 진행함에 있어서 train의 값이 떨어지지 않고 다시 올라가는 지점이 생길 것이다.

너무 복잡한 모델을 가졌을 때에는 Regression을 적절하게 사용하는 것도 좋다. $\delta$ 값에 따라서 어떤 $\theta$ 값들은 없어질 수도 있으며 이는 전체적인 그래프를 부드럽게 만들어준다.

Kernel regression and RBFs

우리는 feature로 kernel이나 radial basis function (RBFs)를 이용할 수 있다.

Kernel trick 이란 고차원으로 매핑을 할 수 있도록 제작된 배열 Kernel을 이용하여 더 높은 차원에서 러닝이 진행되는 효과를 내도록 하는 방법이다.

$\phi(x)=[\kappa(x,\mu_1,\lambda),\dots,\kappa(x,\mu_d,\lambda)], \text{e.g.}\quad \kappa(x,\mu_i,\lambda)=e^{(-\frac{1}{\lambda}\Vert x-\mu_i\Vert^2)}$

그렇다면 예측값은

$\hat{y}(x_i)=\phi(x_i)\theta=1\theta_0+\kappa(x_i,\mu_1,\lambda)\theta_1+\dots+\kappa(x_i,\mu_d,\lambda)\theta_d$

각각을 행렬 $\theta$ , $\Phi$ 로 관리 했을 때의 결과는 다음과 같이 나올 것이다.

$\hat{Y}=\Phi\theta$

이때의 optimal theta값은

$\hat{\theta}_{ls}=(\Phi^T\Phi)^{-1}\Phi^Ty$

또는

$\hat{\theta}_{ridge}=(\Phi^T\Phi+\delta^2I)^{-1}\Phi^Ty$

이는 여전히 linear regression인 것을 나타낸다.

Cross-validation

항상 모델은 테스트 되어야 한다. 그래야 우리는 무엇이 잘 러닝 되었는지에 대해서 판단할 수 있다. 다음과 같은 순서로 진행이 된다.

주어진 트레이닝 데이터 $(X_{train},Y_{train})$ 과 예측할 $\delta^2$ 에 대해서 , $\hat{\theta}$ 를 계산한다.
트레이닝 셋을 계산한다.
테스트 셋을 예측해본다.

K-fold crossvalidation

training data를 K 개의 folds로 나눈다. 그리고 각각의 fold를 테스트 데이터로 사용하여 cross-validation을 진행한다. 일반적으로 $K=5$ 를 사용하며 이때는 5-fold CV라 불린다.

만약 $K=N$ 인 경우에 특수하게 이것을 leave-one out cross validation이나 LOOCV라고 불린다.

[References]

[1] 정규화 선형회귀 (데이터 사이언스 스쿨) link

[2] Go’s BLOG - [ISL] 6장 -Lasso, Ridge, PCR이해하기 link

[3] SanghyukChun’s Blog -Machine Learning 스터디 (2) Probability Theory link