Sparsity plays an important role in machine learning, statistics, and signal processing. In the basic set up, we are given a noisy observation yRm and a dictionary DRm×n. The goal is to estimate the weights wRn that produce the observation under the assumption that most of its entries have a value of zero. A popular approach to obtaining sparse solutions is through the class of 1 penalized regression problems. Mathematically, these are written as

(1)w=argminw12Dwy22+λFw1

where FRk×n encodes K distinct 1-norm penalties, and λ0 is a lagrange multiplier that determines the strength of those penalty terms. In statistics, this is also known as the generalized Lasso problem [1]. The main challenge with solving equation 1 lies in the non-differentiability of the 1-norm. Therefore, we cannot obtain an analytical solution and must resort to iterative solvers.



Deriving an ADMM algorithm

An overview of the ADMM framework can be found here . We can rewrite the generalized Lasso into ADMM form by introducing a variable z and a constraint which transforms equation 1 into the following problem

(2)minw12Dwy22+λz1 subject to Fw=z

The corresponding scaled Augmented Lagragian is given by

(3)L(w,z,u)=12Dwy22+λz1+ρ2Fwz+u22ρ2u22

Note that introducing z doesn’t change the optimal solution since the additional penalty terms in L are minimized when wz=0. However, we benefit from separating the 1 penalty from the reconstruction term. ADMM proceeds with block minimization under L. This consists of an w-minimization step, a z-minimization step, and a gradient ascent update step on the dual variables u.


Updating w

The standard ADMM update for w is given by

wk+1=argminwL(w,zk,uk)

To obtain the RHS, we set the partial derivative of equation 3 wrt. w to zero.

wL=D(Dwy)+ρF(Fwz+u)=0

Now rearrange to collect terms with w on the LHS and all other terms on the RHS.

DDw+ρFFw=Dy+ρFzρFu

Factorize and solve for w.

w=(DD+ρFF)1(Dy+ρF(zkuk))

w minimizes the augmented Lagrangian given our previous estimates of zk and uk and becomes the closed-form update rule for wk+1.


Updating z

Next, we find the ADMM update for z

zk+1=argminzL(wk+1,z,uk)

Similarly, we set the partial derivative of L wrt. z to zero

(4)zL=z(λz1)ρ(Fwz+u)=0

Immediately, we encounter the non-differentiable 1 term. We can proceed by observing two facts. 1) The overall 1 penalty can be decomposed as a sum of 1 separate penalties placed on each component of z, i.e. z1=i=1N|zi|. Therefore, we can deal with each component independently. And 2) For each component, there exists only a single discontinuity centered around zero, and that the derivative exists for all other values. Therefore, we can split this problem into two cases z=0 and z0.


For z0, the derivative is equal to the sign(z) which can be substituted into equation 4 to get

λsign(z)ρ(Fwz+u)=0

Collecting all terms with z on the LHS and all others on the RHS, we get

z+λρsign(z)=Fw+u

When z<0, then Fw+u<λρ and when z>0, then Fw+u>λρ. This observation can be compactly written as |Fw+u|>λρ and sign(z)=sign(Fw+u) to get the update

z=Fw+uλρsign(Fw+u)


When z=0, we must deal with the discontinuity using subgradients, which is given by z1=[1,1]. The optimality condition is zero exists within the subgradient. Substituting this into equation 4,

0λ[1,1]ρ(Fwz+u)

which can be rearranged to get the form

Fw+u[λρ,λρ]

In words, z=0 is an optimal solution only when Fw+u is within the region [λρ,λρ]. Combining the results above for z0 with z=0, we get the following updates

zk+1={0,|Fwk+1+uk|λρFwk+1+ukλρsign(wk+1),|Fwk+1+uk|>λρ

More compactly this can be written using the soft-threshold operator S. The update equation for z is given by

zk+1=sign(Fwk+1+uk)max(|Fwk+1+uk|λρ,0)=Sλ/ρ(Fwk+1+uk)


Updating u

Given the updated wk+1 and zk+1, we perform a gradient ascent update to the dual variables. This is simply given by the running sum of residuals

uk+1=uk+Fwk+1zk+1

Special Cases of the Generalized Lasso

For different choices of F, we recover several well-studied problems as special cases.



Standard Lasso

When F is an identity matrix IRn×n, we recover the standard Lasso problem [2]

minw12Dwy22+λw1

This model is commonly used in compressed sensing where we are interested in sparse signal recovery, and can be used as a convex relaxation for 0-norm variable selection problem .



Variable Fusion

When FR(n1)×n is a first order difference matrix,

Fij={1,i=j11,i=j0,otherwise

we obtain the variable fusion model [3] which corresponds to the problem

minw12Dwy22+λi=2nwkwk11

This model is used when we expect the ordering of the weights to have smooth structure.



Fused Lasso

When FR(2n1)×n combines both the penalty from the standard lasso and the variable fusion,

F=[FLassoFFusion]

we obtain the Fused Lasso model [4] which combines both a sparsity and smoothness penalty.

minw12Dwy22+λ1w1+λ2i=2nwkwk11

Demo

I illustrate the impact of the various penalties mentioned earlier on the inference results. Noisy observations are generated from a random dictionary using smooth and sparse weights. For a given set of hyperparameters, we obtain the following ADMM solutions


The standard lasso problem encourages shrinkage towards zero of each weight independently. While it accurately captures certain regions with active support, adjacent weights can have large variance. On the other hand, the variable fusion model prioritizes smooth patterns and effectively identifies distinct areas of smoothness. However, since its penalty does not promote zero shrinkage, it incorrectly identifies that the entire support is active. The fused Lasso model excels in accurately identifying both smoothness and the active support set. Importantly, note that solutions across all models exhibit a bias towards lower magnitudes than the actual weights. This is because 1 inference provides a biased estimate of the weight vector.



Code

This experiment can be replicated with the code below and is also available here



import numpy as np
import numpy.random as npr

def soft_threshold(x, thresh):
    return np.sign(x)* np.max([np.abs(x)-thresh, np.zeros(len(x))], axis=0)

def ADMM_generalized_lasso(y, D, F, rho, lam, n_iters=100):

    # y: (m,) array. observation 

    # D: (m, n) array. dictionary

    # F: (k, n) array. constraint matrix

    # rho: augmented lagrange multiplier

    # lam: lagrange multiplier


    n = len(D.T)

    # random initialization

    w = npr.randn(n) 
    u = npr.randn(len(F)) 
    z = npr.randn(len(F)) 
    
    FtF = F.T @ F

    for i in range(n_iters):
        w = np.linalg.lstsq(D.T @ D + rho * FtF, D.T @ y + rho * F.T @ (z-u), rcond=None)[0]
        z = soft_threshold(F @ w + u, lam/rho)
        u = u + F @ w - z
        
    return w

from sklearn.datasets import make_sparse_coded_signal
import matplotlib.pyplot as plt

n, m = 50, 100
n_nonzero_coefs = 10

# generate dictionary

_, D, _ = make_sparse_coded_signal(
    n_samples=1,
    n_components=n,
    n_features=m,
    n_nonzero_coefs=n_nonzero_coefs,
    random_state=1,
)
D = D.T



# generate structured coefficients

np.random.seed(1)
w_true = np.zeros(n)
for i in range(5):
    ix = np.random.choice(n)
    length = np.random.randint(5,10)
    w_true[ix:ix+length] = npr.randn()
    
    
# generate noisy observations 

y_true = D @ w_true
y = y_true + npr.randn(m)*0.2


# define hyperparameters

rho = 0.3 # augmentation multiplier

lam = 0.5 # general multiplier for L1

lam2 = 0.3 # multipler for sparsity in Fused Lasso




# construct F matrices

F_Lasso = np.eye(n) # Lasso solution

F_fusion = (np.diag(np.ones(n),k=1)[:-1,:-1] + np.diag(np.ones(n)*-1,k=0))[:-1] # Fusion Penalty solution

F_fusedLasso = np.concatenate([np.diag(np.ones(n)/lam*lam2), F_fusion]) # Fused Lasso solution


# compute ADMM solution

w_lasso = ADMM_generalized_lasso(y, D, F_Lasso, rho, lam)
w_fusion = ADMM_generalized_lasso(y, D, F_fusion, rho, lam)
w_fusedLasso = ADMM_generalized_lasso(y, D, F_fusedLasso, rho, lam)


    
# Plot

plt.figure(figsize=(6,3))
plt.plot(w_lasso, label="Lasso")
plt.plot(w_fusion, label="Fusion")
plt.plot(w_fusedLasso, label="Fused Lasso")
plt.plot(w_true, 'k--', label="true")
plt.ylabel("Weight")
plt.xlabel("index")
plt.legend()
plt.show()
    


  1. Ryan J Tibshirani. “The solution path of the generalized lasso”. 2011.
  2. Robert Tibshirani. “Regression shrinkage and selection via the lasso”. Journal of the Royal Statistical Society Series B: Statistical Methodology, 1996.
  3. S Land, and J Friedman. “Variable fusion: a new method of adaptive signal regression”. Technical Report, Department of Statistics, Stanford University, 1996.
  4. Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. “Sparsity and smoothness via the fused lasso”. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2005.