Search

Implementation of "glmnet" with C++

Properties
Empty
Reference
Empty
Author
Empty
Date
Empty
Link
Empty
Created
7/30/2021, 8:27:00 AM
Tags
Programming
C++
The goal of this posting is to reproduce the glmnet package using C++

0. Standardization

"glmnet" scales variables with standard deviation of n\sqrt{n} instead of n1\sqrt{n-1}.

1. Linear regression

Theory

We can consider a linear regression framework:
yi=xiβ+ϵiy_i = \boldsymbol{x}_i \boldsymbol{\beta} + \epsilon_i
The least squares solve the problem: yXβ22\| \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2, where 2\| \cdot \|_2 represents the 2\ell_2 norm.
However, in several cases, we cannot get an explicit solution such as logistic regression in generalized linear model. To handle with this problem, we can use the approximation algorithms, such as Newton-Raphson and Fisher-Scoring methods.
As an alternative, we can consider the coordinate descent algorithm which optimizes the objective function for each variable assuming that others are fixed.
For example, There are five variables associated with a continuous response. We begin with some initial value, and CCD can find an estimates for the first predictor given others fixed. It then estimates the second coefficient with the updated parameter from the initial value, and so on.

Derivation

A derivation of estimate in linear regression is as follows:
L=12yXβ22=12i(yijxijβj)2Lβk=ixik(yijxijβj)=ixik(yijkxijβjxikβk)=xkT(yX(k)β(k))+xkTxkβkβ^k=(xkTxk)1xkT(yX(k)β(k))\begin{aligned} L &= \frac{1}{2} \| \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 \\ &= \frac{1}{2} \sum_i{( y_i - \sum_j x_{ij} \beta_j )^2} \\ \frac{\partial L}{\partial \beta_k} &= -\sum_i x_{ik} (y_i - \sum_j x_{ij} \beta_j ) \\ &= - \sum_i x_{ik} (y_i - \sum_{j \ne k} x_{ij} \beta_j - x_{ik} \beta_k ) \\ &= -\boldsymbol{x_k}^T (\boldsymbol{y} - \boldsymbol{X}_{(-k)}\boldsymbol{\beta_{(-k)}}) + \boldsymbol{x}_k^T \boldsymbol{x}_k \beta_k \\ \therefore \hat{\beta}_k &= (\boldsymbol{x_k}^T\boldsymbol{x_k})^{-1}\boldsymbol{x_k}^T(\boldsymbol{y} - \boldsymbol{X_{(-k)}}\boldsymbol{\beta_{(-k)}}) \end{aligned}

Implementation in C++

Code

References

2. The Lasso

The underlying model

The lasso solves the following problem with a constraint β1<t\| \beta \|_1 < t using Lagrange multiplier:
yXβ22+λβ1,\| \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 + \lambda \| \boldsymbol{\beta} \|_1,
where 1\| \cdot \|_1 indicates the 1\ell_1 norm.

Derivation

L=12NyXβ22+λβ1=12Ni(yijxijβj)2+λj=1pβj=12Ni(yijxijβj)2+λj=1pβjsgn(βj)Lβk=1Nixik(yijkxijβjxikβk)+λsgn(βk)=1NxkT(yX(k)β(k))+1NxkTxkβk+λsgn(βk)\begin{aligned} L &= \frac{1}{2N} \| \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 + \lambda \| \boldsymbol{\beta} \|_1 \\ &= \frac{1}{2N} \sum_i (y_i - \sum_j x_{ij} \beta_j )^2 + \lambda \sum_{j=1}^p | \beta_j | \\ &= \frac{1}{2N} \sum_i (y_i - \sum_j x_{ij} \beta_j )^2 + \lambda \sum_{j=1}^p \beta_j sgn(\beta_j) \\ \frac{\partial L}{\partial \beta_k} &= -\frac{1}{N} \sum_i x_{ik} (y_i - \sum_{j \ne k} x_{ij} \beta_j - x_{ik} \beta_k ) + \lambda sgn(\beta_k) \\ &= -\frac{1}{N} \boldsymbol{x_k}^T (\boldsymbol{y} - \boldsymbol{X}_{(-k)}\boldsymbol{\beta}_{(-k)}) + \frac{1}{N} \boldsymbol{x_k}^T \boldsymbol{x_k} \boldsymbol{\beta_k} + \lambda sgn(\beta_k) \end{aligned}
if βk>0,\beta_k > 0,
βk^=(xkTxk)1{xkT(yX(k)β(k))Nλ}+\hat{\beta_k} = (\boldsymbol{x}_k^T \boldsymbol{x}_k)^{-1}\{ \boldsymbol{x}_k^T(\boldsymbol{y} - \boldsymbol{X}_{(-k)}\boldsymbol{\beta}_{(-k)}) - N\lambda \}_+
if βk<0,\beta_k < 0,
βk^=(xkTxk)1{xkT(yX(k)β(k))+Nλ}=(xkTxk)1{xkT(yX(k)β(k))Nλ}+sgn(βk)\begin{aligned} \hat{\beta_k} &= (\boldsymbol{x}_k^T \boldsymbol{x}_k)^{-1}\{ \boldsymbol{x}_k^T(\boldsymbol{y} - \boldsymbol{X}_{(-k)}\boldsymbol{\beta}_{(-k)}) + N\lambda \} \\ &= (\boldsymbol{x}_k^T \boldsymbol{x}_k)^{-1}\{ -\boldsymbol{x}_k^T(\boldsymbol{y} - \boldsymbol{X}_{(-k)}\boldsymbol{\beta}_{(-k)}) - N\lambda \}_+ sgn(\beta_k) \end{aligned}
β^k=(xkTxk)1{xkT(yX(k)β(k))Nλ}+sgn(βk)\therefore \hat{\beta}_k = (\boldsymbol{x}_k^T \boldsymbol{x}_k)^{-1} \{ |\boldsymbol{x}_k^T (\boldsymbol{y} - \boldsymbol{X}_{(-k)} \boldsymbol{\beta}_{(-k)})| - N\lambda \}_+ sgn(\beta_k)

Implementation in C++

code

3. Ridge

The underlying model

Derivation

L=12NyXβ22+λ2β22=12Ni=1n(yij=1pxijβj)2+λ2j=1pβj2Lβk=1Ni=1nxik(yijkxijβjxikβk)+λβk=1NxkT(yX(k)β(k))+1NxkTxkβk+λβkβk^=(1NxkTxk+Nλ)1xkT(yX(k)β(k)).\begin{aligned} L & = \frac{1}{2N} \parallel y-X\beta \parallel_2^2 + \frac{\lambda}{2} \parallel \beta \parallel_2^2 \\ & = \frac{1}{2N} \sum_{i=1}^{n} \left( y_i - \sum_{j=1}^p x_{ij} \beta_j \right)^2 + \frac{\lambda}{2} \sum_{j=1}^p \beta_j^2 \\ \frac{ \partial L }{\partial \beta_k} & = - \frac{1}{N} \sum_{i=1}^n x_{ik} \left( y_i - \sum_{j\ne k}x_{ij} \beta_j - x_{ik}\beta_k \right) + \lambda \beta_k \\ & = - \frac{1}{N} x_k^T(y-X_{(-k)}\beta_{(-k)} ) + \frac{1}{N} x_k^Tx_k \beta_k + \lambda \beta_k \\ \therefore \hat{\beta_k} & = ( \frac{1}{N} x_k^T x_k + N \lambda )^{-1}x_k^T (y-X_{(-k)}\beta_{(-k)}). \end{aligned}

Implementation in C++

code

4. Elastic-net

The underlying model

Derivation

L=12NyXβ22+λ(1α2β22+αβ1)=12Ni=1n(yij=1pxijβj)2+λ(1α2j=1pβj2+αj=1pβj)Lβk=1Ni=1nxik(yijkpxijβjxikβk)+λ((1α)βk+αsgn(βk))=1NxkT(yX(k)β(k))+1NxkTxkβk+λ((1α)βk+αsgn(βk))βk^=(xkT(yX(k)β(k))Nλα)+sgn(βk)xkTxk+Nλ(1α)\begin{aligned} L & = \frac{1}{2N} \parallel y-X\beta \parallel_2^2 + \lambda \left( \frac{1-\alpha}{2} \parallel \beta \parallel_2^2 + \alpha \parallel \beta \parallel _1 \right) \\ & = \frac{1}{2N} \sum_{i=1}^n \left( y_i - \sum_{j=1}^px_{ij}\beta_j \right)^2 + \lambda \left( \frac{1-\alpha}{2} \sum_{j=1}^p \beta_j^2 + \alpha \sum_{j=1}^p | \beta_j | \right) \\ \frac{\partial L}{\partial \beta_k} & = - \frac{1}{N} \sum_{i=1}^n x_{ik} \left( y_i - \sum_{j\ne k}^p x_{ij} \beta_j - x_{ik} \beta_k \right) + \lambda \left( (1-\alpha)\beta_k + \alpha sgn(\beta_k) \right) \\ & = - \frac{1}{N} x_k^T (y - X_{(-k)} \beta_{(-k)}) + \frac{1}{N} x_k^T x_k \beta_k + \lambda \left( (1-\alpha)\beta_k + \alpha sgn(\beta_k) \right) \\ \\ \therefore \hat{\beta_k} & = \frac{ \left( | x_k^T(y-X_{(-k)}\beta_{(-k)}) | - N \lambda \alpha \right )_+ sgn(\beta_k) }{x_k^T x_k + N \lambda(1-\alpha) } \end{aligned}
Sequence of lambda nαλ=max<x,y>λmax=max1nα(yyˉ(1yˉ))xλmin=ϵλmax,  where ϵ={0.01if n>p,0.0001if n<p.\begin{aligned} n \alpha \lambda &= \underset{\ell}{max} \lvert<x_\ell, y>\rvert \\\lambda_{max} &= \underset{\ell}{max} \frac{1}{n \alpha} ({y - \bar{y}(1-\bar{y}))x_\ell }\\\lambda_{min} &= \epsilon \lambda_{max}, ~~ \text{where $\epsilon=\begin{cases} 0.01 & \text{if $n>p$,} \\ 0.0001 & \text{if $n<p$}. \end{cases}$} \end{aligned} https://stats.stackexchange.com/questions/166630/glmnet-compute-maximal-lambda-value

Implementation in C++ (Standardized predictor with intercept)

code for standardized predictor with intercept
code for full version which is confirmed for only centered predictors

5. Block-wise coordinate descent (for multi-response elastic-net)

The underlying model

Let us consider a framework to simulate multivariate multiple regression:
Y=XB+W,Y = XB+W,
where WMVN(0,ΣY)W \sim MVN(0, \Sigma_Y), and ΣY\Sigma_Y is an equicorrelation matrix with ρy=0, 0.2, 0.5, 0.8\rho_y=0, ~0.2,~ 0.5, ~0.8.
Sequence of lambda λmax=max(1nαxTY2)\begin{aligned} \lambda_{max} = \underset{\ell}{max} (\frac{1}{n\alpha}\parallel x_\ell^T Y \parallel_2 ) \end{aligned}

Derivation of multi-response elastic-net

L=12NYXB22+λ(1α2BF2+αj=1pBj2)=12Ni=1nyij=1pxijBj22+λ(1α2j=1pBj22+αj=1pBj2)LBk=1Ni=1nxik(yijkpxijBjxikBk)+λ((1α)Bk+αBkBk2sgn(Bk))=1NxkT(YjkpxjBj)+1NxkTxkBk+λ(1α)Bk+λαBkBk2sgn(Bk)Since Bk=(xkTxk)1xkT(YX(k)B(k)),LBk=1NxkT(YjkpxjBj)+1NxkTxkBk+λ(1α)Bk+λαxkT(YX(k)B(k))xkT(YX(k)B(k))2sgn(Bk))B^k=1xkTxk+Nλ(1α)(1NλαxkT(YX(k)B(k))2)+xkT(YX(k)B(k))\begin{aligned} L & = \frac{1}{2N} \parallel Y-XB \parallel_2^2 + \lambda \left( \frac{1-\alpha}{2} \parallel B \parallel_F^2 + \alpha \sum_{j=1}^p \parallel B_{j\cdot} \parallel_2 \right) \\ & = \frac{1}{2N} \sum_{i=1}^n \parallel y_{i\cdot} - \sum_{j=1}^p x_{ij} B_{j\cdot} \parallel_2^2 + \lambda \left( \frac{1-\alpha}{2} \sum_{j=1}^p \parallel B_{j\cdot} \parallel_2^2 + \alpha \sum_{j=1}^p \parallel B_{j\cdot} \parallel_2 \right) \\ \frac{\partial L}{\partial B_{k\cdot}} & = - \frac{1}{N} \sum_{i=1}^n x_{ik} ( y_{i\cdot} - \sum_{j\ne k}^p x_{ij} B_{j\cdot} - x_{ik} B_{k\cdot} ) + \lambda \left( (1-\alpha)B_{k\cdot} + \alpha \frac{B_{k\cdot}}{\parallel B_{k\cdot} \parallel_2} sgn(B_{k\cdot}) \right) \\ & = - \frac{1}{N} x_k^T ( Y - \sum_{j\ne k}^p x_{\cdot j} B_{j\cdot} ) + \frac{1}{N} x_k^T x_k B_{k\cdot} + \lambda (1-\alpha) B_{k\cdot} + \lambda \alpha \frac{B_{k\cdot}}{\parallel B_{k\cdot} \parallel_2} sgn(B_{k\cdot} ) \\\\ \text{Since }B_{k\cdot} & = (x_k^T x_k)^{-1} x_k^T ( Y - X_{(-k)} B_{(-k)} ), \\\\ \frac{\partial L}{\partial B_{k\cdot}} & = - \frac{1}{N} x_k^T ( Y - \sum_{j\ne k}^p x_{\cdot j} B_{j\cdot} ) + \frac{1}{N} x_k^T x_k B_{k\cdot} + \lambda (1-\alpha) B_{k\cdot} + \lambda \alpha \frac{x_k^T (Y-X_{(-k)}B_{(-k)})} {\parallel x_k^T( Y-X_{(-k)}B_{(-k)} ) \parallel_2} sgn(B_{k\cdot}) ) \\ \therefore \hat{B}_{k\cdot} & = \frac{1}{x_k^T x_k + N \lambda(1-\alpha) } { \left( 1 - \frac{ N \lambda \alpha} {\parallel x_k^T (Y-X_{(-k)}B_{(-k)}) \parallel_2 } \right) _+ x_k^T (Y-X_{(-k)} B_{(-k)}) } \end{aligned}
It cannot be understood that BkB_{k\cdot} could be replaced by the partial residual term.

Remark

If Bj1==BjM=bB_{j1} = \cdots = B_{jM}=b, where j=1,,pj=1,\cdots,p, then
Bj2=m=1MBjm2=M×b2=M×b=m=1MbMj=1pBj2=m=1M1MBm1\begin{aligned} \parallel B_{j\cdot} \parallel_2 & = \sqrt{\sum_{m=1}^{M} B_{jm}^2} \\ & = \sqrt{ M \times b^2 } \\ & = \sqrt{M} \times \lvert b \rvert \\ & = \sum_{m=1}^M \frac{\lvert b \rvert}{\sqrt{M}} \\ \therefore \sum_{j=1}^p\parallel B_{j\cdot} \parallel_2 & = \sum_{m=1}^M \frac{1}{\sqrt{M}} \parallel B_{\cdot m} \parallel_1 \end{aligned}
where m=1,,Mm=1,\cdots,M.
When ρy\rho_y is high and γ=M\gamma=M (the number of response variables), the assumption of Bj1==BjM=bB_{j1} = \cdots = B_{jM}=b can be satisfied.
The objective function can be expressed as
L=m=1M{12NY1XBm22 +λ(1α2Bm22+αMBm1)}.L = \sum_{m=1}^M \left\{ \frac{1}{2N} \parallel Y_1 - XB_{\cdot m} \parallel_2^2  + \lambda ( \frac{1-\alpha}{2}\parallel B_{\cdot m} \parallel_2^2 + \frac{\alpha}{\sqrt{M}} \parallel B_{\cdot m} \parallel_1 ) \right\}. \\
This yields the results of selection probabilities equivalent to univariate elastic-net.
On the other hand, when ρy\rho_y is low and γ=M\gamma=M, the causal variants in mgaussian are more frequently selected than in univariate elastic-net as 1Mm=1MBjm2Bjm\sqrt{\frac{1}{M} \sum_{m=1}^M B_{jm}^2} \ge B_{jm}
In the case where at least one coefficient $ B_{jm} $ differs from the other responses, the causal regression coefficients were selected more often as the denominator $ \parallel x_k^T (Y-X_{(-k)}B_{(-k)}) \parallel_2 (= \parallel B_{k \cdot} \parallel_2 )$ of *mgaussian* is larger than that of univariate elastic-net from this:
B^k=1xkTxk+Nλ(1α) (1NλαxkT(YX(k)B(k))2 )+xkT(YX(k)B(k))\hat{B}_{k\cdot} = \frac{1}{x_k^T x_k + N \lambda(1-\alpha) }  { \left( 1 - \frac{ N \lambda \alpha} {\parallel x_k^T (Y-X_{(-k)}B_{(-k)}) \parallel_2 }  \right) _+ x_k^T (Y-X_{(-k)} B_{(-k)}) }

Implementation in C++

This code does not include standardization
code
TOP