 # Implementation of "glmnet" with C++

Properties
Empty
Reference
Empty
Author
Empty
Date
Empty
Empty
Created
7/30/2021, 8:27:00 AM
Tags
Programming
C++ The goal of this posting is to reproduce the glmnet package using C++

## 0. Standardization

"glmnet" scales variables with standard deviation of $\sqrt{n}$ instead of $\sqrt{n-1}$.

## 1. Linear regression

### Theory

We can consider a linear regression framework:
$y_i = \boldsymbol{x}_i \boldsymbol{\beta} + \epsilon_i$
The least squares solve the problem: $\| \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2$, where $\| \cdot \|_2$ represents the $\ell_2$ norm.
However, in several cases, we cannot get an explicit solution such as logistic regression in generalized linear model. To handle with this problem, we can use the approximation algorithms, such as Newton-Raphson and Fisher-Scoring methods.
As an alternative, we can consider the coordinate descent algorithm which optimizes the objective function for each variable assuming that others are fixed.
For example, There are five variables associated with a continuous response. We begin with some initial value, and CCD can find an estimates for the first predictor given others fixed. It then estimates the second coefficient with the updated parameter from the initial value, and so on.

### Derivation

A derivation of estimate in linear regression is as follows:
\begin{aligned} L &= \frac{1}{2} \| \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 \\ &= \frac{1}{2} \sum_i{( y_i - \sum_j x_{ij} \beta_j )^2} \\ \frac{\partial L}{\partial \beta_k} &= -\sum_i x_{ik} (y_i - \sum_j x_{ij} \beta_j ) \\ &= - \sum_i x_{ik} (y_i - \sum_{j \ne k} x_{ij} \beta_j - x_{ik} \beta_k ) \\ &= -\boldsymbol{x_k}^T (\boldsymbol{y} - \boldsymbol{X}_{(-k)}\boldsymbol{\beta_{(-k)}}) + \boldsymbol{x}_k^T \boldsymbol{x}_k \beta_k \\ \therefore \hat{\beta}_k &= (\boldsymbol{x_k}^T\boldsymbol{x_k})^{-1}\boldsymbol{x_k}^T(\boldsymbol{y} - \boldsymbol{X_{(-k)}}\boldsymbol{\beta_{(-k)}}) \end{aligned}

Code

## 2. The Lasso

### The underlying model

The lasso solves the following problem with a constraint $\| \beta \|_1 < t$ using Lagrange multiplier:
$\| \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 + \lambda \| \boldsymbol{\beta} \|_1,$
where $\| \cdot \|_1$ indicates the $\ell_1$ norm.

### Derivation

\begin{aligned} L &= \frac{1}{2N} \| \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 + \lambda \| \boldsymbol{\beta} \|_1 \\ &= \frac{1}{2N} \sum_i (y_i - \sum_j x_{ij} \beta_j )^2 + \lambda \sum_{j=1}^p | \beta_j | \\ &= \frac{1}{2N} \sum_i (y_i - \sum_j x_{ij} \beta_j )^2 + \lambda \sum_{j=1}^p \beta_j sgn(\beta_j) \\ \frac{\partial L}{\partial \beta_k} &= -\frac{1}{N} \sum_i x_{ik} (y_i - \sum_{j \ne k} x_{ij} \beta_j - x_{ik} \beta_k ) + \lambda sgn(\beta_k) \\ &= -\frac{1}{N} \boldsymbol{x_k}^T (\boldsymbol{y} - \boldsymbol{X}_{(-k)}\boldsymbol{\beta}_{(-k)}) + \frac{1}{N} \boldsymbol{x_k}^T \boldsymbol{x_k} \boldsymbol{\beta_k} + \lambda sgn(\beta_k) \end{aligned}
if $\beta_k > 0,$
$\hat{\beta_k} = (\boldsymbol{x}_k^T \boldsymbol{x}_k)^{-1}\{ \boldsymbol{x}_k^T(\boldsymbol{y} - \boldsymbol{X}_{(-k)}\boldsymbol{\beta}_{(-k)}) - N\lambda \}_+$
if $\beta_k < 0,$
\begin{aligned} \hat{\beta_k} &= (\boldsymbol{x}_k^T \boldsymbol{x}_k)^{-1}\{ \boldsymbol{x}_k^T(\boldsymbol{y} - \boldsymbol{X}_{(-k)}\boldsymbol{\beta}_{(-k)}) + N\lambda \} \\ &= (\boldsymbol{x}_k^T \boldsymbol{x}_k)^{-1}\{ -\boldsymbol{x}_k^T(\boldsymbol{y} - \boldsymbol{X}_{(-k)}\boldsymbol{\beta}_{(-k)}) - N\lambda \}_+ sgn(\beta_k) \end{aligned}
$\therefore \hat{\beta}_k = (\boldsymbol{x}_k^T \boldsymbol{x}_k)^{-1} \{ |\boldsymbol{x}_k^T (\boldsymbol{y} - \boldsymbol{X}_{(-k)} \boldsymbol{\beta}_{(-k)})| - N\lambda \}_+ sgn(\beta_k)$

code

## 3. Ridge

### Derivation

\begin{aligned} L & = \frac{1}{2N} \parallel y-X\beta \parallel_2^2 + \frac{\lambda}{2} \parallel \beta \parallel_2^2 \\ & = \frac{1}{2N} \sum_{i=1}^{n} \left( y_i - \sum_{j=1}^p x_{ij} \beta_j \right)^2 + \frac{\lambda}{2} \sum_{j=1}^p \beta_j^2 \\ \frac{ \partial L }{\partial \beta_k} & = - \frac{1}{N} \sum_{i=1}^n x_{ik} \left( y_i - \sum_{j\ne k}x_{ij} \beta_j - x_{ik}\beta_k \right) + \lambda \beta_k \\ & = - \frac{1}{N} x_k^T(y-X_{(-k)}\beta_{(-k)} ) + \frac{1}{N} x_k^Tx_k \beta_k + \lambda \beta_k \\ \therefore \hat{\beta_k} & = ( \frac{1}{N} x_k^T x_k + N \lambda )^{-1}x_k^T (y-X_{(-k)}\beta_{(-k)}). \end{aligned}

code

## 4. Elastic-net

### Derivation

\begin{aligned} L & = \frac{1}{2N} \parallel y-X\beta \parallel_2^2 + \lambda \left( \frac{1-\alpha}{2} \parallel \beta \parallel_2^2 + \alpha \parallel \beta \parallel _1 \right) \\ & = \frac{1}{2N} \sum_{i=1}^n \left( y_i - \sum_{j=1}^px_{ij}\beta_j \right)^2 + \lambda \left( \frac{1-\alpha}{2} \sum_{j=1}^p \beta_j^2 + \alpha \sum_{j=1}^p | \beta_j | \right) \\ \frac{\partial L}{\partial \beta_k} & = - \frac{1}{N} \sum_{i=1}^n x_{ik} \left( y_i - \sum_{j\ne k}^p x_{ij} \beta_j - x_{ik} \beta_k \right) + \lambda \left( (1-\alpha)\beta_k + \alpha sgn(\beta_k) \right) \\ & = - \frac{1}{N} x_k^T (y - X_{(-k)} \beta_{(-k)}) + \frac{1}{N} x_k^T x_k \beta_k + \lambda \left( (1-\alpha)\beta_k + \alpha sgn(\beta_k) \right) \\ \\ \therefore \hat{\beta_k} & = \frac{ \left( | x_k^T(y-X_{(-k)}\beta_{(-k)}) | - N \lambda \alpha \right )_+ sgn(\beta_k) }{x_k^T x_k + N \lambda(1-\alpha) } \end{aligned} Sequence of lambda \begin{aligned} n \alpha \lambda &= \underset{\ell}{max} \lvert\rvert \\\lambda_{max} &= \underset{\ell}{max} \frac{1}{n \alpha} ({y - \bar{y}(1-\bar{y}))x_\ell }\\\lambda_{min} &= \epsilon \lambda_{max}, ~~ \text{where \epsilon=\begin{cases} 0.01 & \text{if n>p,} \\ 0.0001 & \text{if n https://stats.stackexchange.com/questions/166630/glmnet-compute-maximal-lambda-value

### Implementation in C++ (Standardized predictor with intercept)

code for standardized predictor with intercept
code for full version which is confirmed for only centered predictors

## 5. Block-wise coordinate descent (for multi-response elastic-net)

### The underlying model

Let us consider a framework to simulate multivariate multiple regression:
$Y = XB+W,$
where $W \sim MVN(0, \Sigma_Y)$, and $\Sigma_Y$ is an equicorrelation matrix with $\rho_y=0, ~0.2,~ 0.5, ~0.8$. Sequence of lambda \begin{aligned} \lambda_{max} = \underset{\ell}{max} (\frac{1}{n\alpha}\parallel x_\ell^T Y \parallel_2 ) \end{aligned}

### Derivation of multi-response elastic-net

\begin{aligned} L & = \frac{1}{2N} \parallel Y-XB \parallel_2^2 + \lambda \left( \frac{1-\alpha}{2} \parallel B \parallel_F^2 + \alpha \sum_{j=1}^p \parallel B_{j\cdot} \parallel_2 \right) \\ & = \frac{1}{2N} \sum_{i=1}^n \parallel y_{i\cdot} - \sum_{j=1}^p x_{ij} B_{j\cdot} \parallel_2^2 + \lambda \left( \frac{1-\alpha}{2} \sum_{j=1}^p \parallel B_{j\cdot} \parallel_2^2 + \alpha \sum_{j=1}^p \parallel B_{j\cdot} \parallel_2 \right) \\ \frac{\partial L}{\partial B_{k\cdot}} & = - \frac{1}{N} \sum_{i=1}^n x_{ik} ( y_{i\cdot} - \sum_{j\ne k}^p x_{ij} B_{j\cdot} - x_{ik} B_{k\cdot} ) + \lambda \left( (1-\alpha)B_{k\cdot} + \alpha \frac{B_{k\cdot}}{\parallel B_{k\cdot} \parallel_2} sgn(B_{k\cdot}) \right) \\ & = - \frac{1}{N} x_k^T ( Y - \sum_{j\ne k}^p x_{\cdot j} B_{j\cdot} ) + \frac{1}{N} x_k^T x_k B_{k\cdot} + \lambda (1-\alpha) B_{k\cdot} + \lambda \alpha \frac{B_{k\cdot}}{\parallel B_{k\cdot} \parallel_2} sgn(B_{k\cdot} ) \\\\ \text{Since }B_{k\cdot} & = (x_k^T x_k)^{-1} x_k^T ( Y - X_{(-k)} B_{(-k)} ), \\\\ \frac{\partial L}{\partial B_{k\cdot}} & = - \frac{1}{N} x_k^T ( Y - \sum_{j\ne k}^p x_{\cdot j} B_{j\cdot} ) + \frac{1}{N} x_k^T x_k B_{k\cdot} + \lambda (1-\alpha) B_{k\cdot} + \lambda \alpha \frac{x_k^T (Y-X_{(-k)}B_{(-k)})} {\parallel x_k^T( Y-X_{(-k)}B_{(-k)} ) \parallel_2} sgn(B_{k\cdot}) ) \\ \therefore \hat{B}_{k\cdot} & = \frac{1}{x_k^T x_k + N \lambda(1-\alpha) } { \left( 1 - \frac{ N \lambda \alpha} {\parallel x_k^T (Y-X_{(-k)}B_{(-k)}) \parallel_2 } \right) _+ x_k^T (Y-X_{(-k)} B_{(-k)}) } \end{aligned} It cannot be understood that $B_{k\cdot}$ could be replaced by the partial residual term.

### Remark

If $B_{j1} = \cdots = B_{jM}=b$, where $j=1,\cdots,p$, then
\begin{aligned} \parallel B_{j\cdot} \parallel_2 & = \sqrt{\sum_{m=1}^{M} B_{jm}^2} \\ & = \sqrt{ M \times b^2 } \\ & = \sqrt{M} \times \lvert b \rvert \\ & = \sum_{m=1}^M \frac{\lvert b \rvert}{\sqrt{M}} \\ \therefore \sum_{j=1}^p\parallel B_{j\cdot} \parallel_2 & = \sum_{m=1}^M \frac{1}{\sqrt{M}} \parallel B_{\cdot m} \parallel_1 \end{aligned}
where $m=1,\cdots,M$.
When $\rho_y$ is high and $\gamma=M$ (the number of response variables), the assumption of $B_{j1} = \cdots = B_{jM}=b$ can be satisfied.
The objective function can be expressed as
$L = \sum_{m=1}^M \left\{ \frac{1}{2N} \parallel Y_1 - XB_{\cdot m} \parallel_2^2 + \lambda ( \frac{1-\alpha}{2}\parallel B_{\cdot m} \parallel_2^2 + \frac{\alpha}{\sqrt{M}} \parallel B_{\cdot m} \parallel_1 ) \right\}. \\$
This yields the results of selection probabilities equivalent to univariate elastic-net.
On the other hand, when $\rho_y$ is low and $\gamma=M$, the causal variants in mgaussian are more frequently selected than in univariate elastic-net as $\sqrt{\frac{1}{M} \sum_{m=1}^M B_{jm}^2} \ge B_{jm}$
In the case where at least one coefficient $B_{jm}$ differs from the other responses, the causal regression coefficients were selected more often as the denominator $\parallel x_k^T (Y-X_{(-k)}B_{(-k)}) \parallel_2 (= \parallel B_{k \cdot} \parallel_2 )$ of *mgaussian* is larger than that of univariate elastic-net from this:
$\hat{B}_{k\cdot} = \frac{1}{x_k^T x_k + N \lambda(1-\alpha) } { \left( 1 - \frac{ N \lambda \alpha} {\parallel x_k^T (Y-X_{(-k)}B_{(-k)}) \parallel_2 } \right) _+ x_k^T (Y-X_{(-k)} B_{(-k)}) }$

### Implementation in C++ This code does not include standardization
code