Logistic Regression

Bernoulli Setting

Assume Yi follows Bernoulli distribution given the ith observation xi and the parameters β YixiBernoulli(pi) We can write the probability mass function for yi as follows P(Yi=yixi,pi)=piyi(1pi)1yi We assume that the log-odds is a linear combination of the input ln(pi1pi)=xiβi.e.pi=11+exp(xiβ)=exp(xiβ)1+exp(xiβ)=σ(xiβ)

Joint Log-Likelihood

The joint likelihood is then found assuming IID-ness p(yβ)=i=1nP(Yi=yixi,pi)=i=1npiyi(1pi)1yi The log-likelihood is then lnp(yβ)=ln(i=1npiyi(1pi)1yi)=i=1nyiln(pi)+(1yi)ln(1pi) Alternatively, the likelihood can be written as follows p(yβ)=i=1npiyi(1pi)1yi=i=1n(exp(xiβ)1+exp(xiβ))yi(1exp(xiβ)1+exp(xiβ))1yi=i=1n(exp(xiβ)1+exp(xiβ))yi(11+exp(xiβ))1yi=i=1n(exp(xiβ)1+exp(xiβ))yi(11+exp(xiβ))(1+exp(xiβ))yi=i=1nexp(xiβyi)1(1+exp(xiβ))yi(11+exp(xiβ))(1+exp(xiβ))yi=i=1nexp(xiβyi)1+exp(xiβ)

Taking the logarithm of this expression gives ln(p(yβ))=i=1nln(exp(xiβyi)1+exp(xiβ))=i=1nln(exp(xiβyi))ln(1+exp(xiβ))=i=1nxiβyiln(1+exp(xiβ))

Maximum Likelihood Minimize Loss Function

Our aim is to maximize the likelihood. This is equivalent to maximizing the log likelihood, which is equivalent to minimizing the negative log likelihood. minβi=1nyiln(pi)+(1yi)ln(1pi) Let’s consider the expression inside the summation yiln(pi)+(1yi)ln(1pi) We can notice that yiln(pi)+(1yi)ln(1pi)={ln(1pi)if yi=0ln(pi)if yi=1 Remember that pi=σ(xiβ) and that 1σ(xiβ)=σ(xiβ) yiln(σ(xiβ))+(1yi)ln(σ(xiβ))={ln(σ(xiβ))if yi=0ln(σ(xiβ))if yi=1 We are thus looking for something that is 1 when yi=0 and that it is 1 when yi=1. In particular, notice that 2yi1 does the job. Thus we can write our problem as minβi=1nln(σ((2yi1)xiβ))=minβi=1nln(11+exp((12yi)xiβ))=minβi=1nln(1+exp((12yi)xiβ))=minβi=1nL(xi,yi;β) where the loss incurred using parameter β when predicting the label for xi whose true label is yi is L(xi,yi;β)={ln(1+exp(xiβ))if yi=0ln(1+exp(xiβ)if yi=1

Maximum-A-Posteriori (MAP)

Recall from Bayes Rule p(βy)=p(yβ)p(β)p(y)p(yβ)p(β) Taking the logarithm both sides and multiplying by 1 we obtain ln(p(βy))ln(p(yβ))ln(p(β)) We can choose to use a Gaussian Prior on the parameters p(β)=N(μ0,Σ0). If βRp×1 then ln(p(β))=p2ln(2π)12ln(|Σ0|)12(βμ0)Σ01(βμ0) Plugging this in, we obtain the following. Notice how we neglect terms that do not depend on β because they will not matter when we minimize this with respect to β. minβln(p(βy))=minβln(p(yβ))ln(p(β))=minβln(p(yβ))+p2ln(2π)+12ln(|Σ0|)+12(βμ0)Σ01(βμ0)=minβln(p(yβ))+12(βμ0)Σ01(βμ0)=minβi=1nln(1+exp((12yi)xiβ))+12(βμ0)Σ01(βμ0)

Ridge Regularization Isotripic Gaussian Prior

Now if we set μ0=0p and Σ0=σβ2Ip (this is equivalent to setting a univariate normal prior on each coefficient, with p(βj)=N(0,σβ2)) we have minβi=1nln(1+exp((12yi)xiβ))+12β(σβ2Ip)1β using the fact that for an invertible matrix A and a non-zero constant cR{0} we have (cA)1=1cA1 we obtain minβi=1nln(1+exp((12yi)xiβ))+12σβ2ββ Setting λ:=1σβ2 we have regularized logistic regression minβi=1nln(1+exp((12yi)xiβ))+λ2ββ It is more stable to multiply out by σβ2 so we get minβσβ2i=1nln(1+exp((12yi)xiβ))+12ββ

Ridge Regularization except on Intercept

Where β=(β0β1βp1):=(β0β1:p1) We’ve defined β1:p1:=(β1,,βp) because often we don’t really regularize the intercept. This means that we place a Multivariate Gaussian prior on β1:p1 as follows p(β1:p1)=N(0p1,σβ1:p12Ip1) (again, this is equivalent to putting a univariate normal prior on each of β1,,βp1 with p(βj)=N(0,σβ1:p12)). Instead, on β0 we could place a uniform prior, which means it’s value would not depend on β0 and so could be dropped from the expression. minβi=1nln(1+exp((12yi)xiβ))+12σβ1:p12β1:p1β1:p1 It is more stable to multiply out by σβ1:p12 therefore minβσβ1:p12i=1nln(1+exp((12yi)xiβ))+12β1:p1β1:p1 Notice that this is consistent with the implementation used by scikit-learn provided here.

Full-Bayesian (Laplace Approximation)

Full-Bayesian in intractable. Laplace Approximation approximates p(βy) with a Gaussian distribution q(β). To find such a distribution, we use the multivariate version of Taylor’s expansion to expand the log posterior around its mode βMAP. We take a second order approximation lnp(βy)lnp(βMAPy)+lnp(βMAPy)(ββMAP)+12(ββMAP)2lnp(βMAPy)(ββMAP) Since βMAP is a stationary point, the gradient at this point will be zero so we have lnp(βy)lnp(βMAPy)+12(ββMAP)2lnp(βMAPy)(ββMAP) We take the exponential both sides to obtain p(βy)p(βMAPy)exp(12(ββMAP)2lnp(βMAPy)(ββMAP)) We regnize this has the shape of a Multivariate Normal distribution. We therefore define our Laplace approximation to be q(β)=(2π)p2det(2lnp(βMAPy))exp(12(ββMAP)2lnp(βMAPy)(ββMAP)) That is q(β)=N(βMAP,[2lnp(βMAPy)]1) To find β2lnp(βy)|β=βMAP we write β2lnp(βy)=β2lnp(yβ)+β2lnp(β) and we find each of the expressions on the right-hand side separately. We start with β2lnp(β). Recall that if we have a quadratic form xAx the derivative of this quadratic form with respect to x is given by x(A+A). Applying this to our case, and using the fact that Σ01 is symmetric we have βlnp(β)=12(βμ0)2Σ01=(βμ0)Σ01=βΣ01+μ0Σ01 Taking the derivative with respect to β again we get β2lnp(β)=Σ01 To find βlnp(yβ) we take the derivative componentwise βjlnp(yβ)=i=1nyiβjlnσ(xiβ)+(1yi)βjlnσ(xiβ)=i=1nyiσ(xiβ)(1σ(xiβ))σ(xiβ)xi(j)+(1yi)σ(xiβ)(1σ(xiβ))σ(xiβ)(xi(j))=i=1nyixi(j)σ(xiβ)+(yixi(j)xi(j))σ(xiβ) Now take the derivative with respect to βk βkβjlnp(yβ)=i=1nyixi(j)βkσ(xiβ)+(yixi(j)xi(j))βkσ(xiβ)=i=1nyixi(j)σ(xiβ)(1σ(xiβ))(xi(k))+(yixi(j)xi(j))σ(xiβ)(1σ(xiβ))xi(k)=i=1nyixi(j)xi(k)σ(xiβ)(1σ(xiβ))+(yixi(j)xi(k)xi(j)xi(k))σ(xiβ)(1σ(xiβ))=i=1nxi(j)xi(k)σ(xiβ)(1σ(xiβ)) This tells us that [β2lnp(yβ)]kj=i=1nxi(j)xi(k)σ(xiβ)(1σ(xiβ)) Note that for a vector xi:=(1,xi(1),,xi(p1)) the outer product gives the following symmetric matrix xixi=(1xi(1)xi(p1))(1xi(1)xi(p1))=(1xi(1)xi(p1)xi(1)(xi(1))2xi(1)xi(p1)xi(p1)xi(p1)xi(1)(xi(p1))2) In particular [xixi]kj=xi(j)xi(k) so that β2lnp(yβ)=i=1nσ(xiβ)(1σ(xiβ))xixi Putting everything together we get β2lnp(βy)=Σ01+i=1nσ(xiβ)(1σ(xiβ))xixi

Gradient Ascent Optimization (MLE, No Regularization)

The simplest way to find β that maximizes the likelihood is to do gradient ascent. Remember that when working on the Laplace approximation we found the derivative of the log-likelihood with respect to the jth component of β. We can rearrange such an expression to get a nicer form. βjlnp(yβ)=i=1nyixi(j)σ(xiβ)+(yixi(j)xi(j))σ(xiβ)=i=1nyixi(j)(1σ(xiβ))+(yixi(j)xi(j))σ(xiβ)=i=1nyixi(j)yixi(j)σ(xiβ)+yixi(j)σ(xiβ)xi(j)σ(xiβ)=i=1n[yiσ(xiβ)]xi(j) Therefore the full gradient is given by βlnp(yβ)=(β0lnp(yβ),,βp1lnp(yβ))=(i=1n(yiσ(xiβ))xi(0),,i=1n(yiσ(xiβ))xi(p1))=i=1n(yiσ(xiβ))(xi(0),,xi(p1))=i=1n(yiσ(xiβ))xi Gradient ascent then becomes βk+1βk+γβlnp(yβ)=βk+γi=1n(yiσ(xiβ))xi This can be vectorized when programming as follows βk+1βk+γX(yσ(Xβ)) where XRn×p is the design matrix. X=(x1xn)

One can change the step size at every iteration. One possible choice for γk is as follows γk=|(βkβk1)[lnp(yβk)lnp(yβk1)]|lnp(yβk)lnp(yβk12

Newton’s Method (MLE, No Regularization)

Again, during the Laplace approximation section we found that the Hessian is given by β2lnp(yβ)=i=1nσ(xiβ)(1σ(xiβ))xixi This expression can be vectorized as follows β2lnp(yβ)=XDX where D=diag(σ(Xβ)(1σ(Xβ)))=(σ(x1β)(1σ(x1β))000σ(x2β)(1σ(x2β))000σ(xnβ)(1σ(xnβ))) Newton’s method then updates the weights as follows (where α is a learning rate to control convergence) βk+1βk+α(XDX)1X(yσ(Xβk))

Of course in practice we never invert the matrix but rather compute the direction d by solving the linear system XDXdk=αX(yσ(Xβk)) and then we find the next iterate as βk+1βk+dk

Gradient Ascent (MAP, Ridge Regularization)

We want to maximize lnp(βy)=lnp(yβ)+lnp(β)=i=1nln(1+exp((12yi)xiβ))12σβ2ββ The gradient of the log posterior is given by βlnp(βy)=X(yσ(Xβ))1σβ2β Thus gradient descent with regularization to do MAP becomes βk+1βk+γ(σβ2X(yσ(Xβk))βk)

Newton’s Method (MAP, Ridge Regularization)

Similarly, we want to maximize lnp(βy). The Hessian is given by β2lnp(βy)=XDX1σβ2I therefore the Newton’s Method update formula becomes βk+1βk+α[σβ2XDX+I]1(σβ2X(yσ(Xβk))βk)

Iteratively Reweighted Least-Squares

We can manipulate the expression in Newton’s method by defining a new variable zk=Xβk+D1(yσ(Xβk)) Then the updates take the form βk+1βk+(XDkX)1XDkzk In practice we would follow these steps

  • Evaluate ηk=Xβk and Dk.
  • Solve the system Dkrk=yσ(ηk) for rk.
  • Compute zk=ηk+rk.
  • Solve the system (XDkX)dk=XDkzk for dk.
  • Compute βk+1βk+dk

Alternatively, noticing that Dii=σ(xiβ)(1σ(xiβ))>0 one can take the square root of its elements and rewriting the problem as (Dk1/2X)(Dk1/2X)dk=(Dk1/2X)(Dk1/2zk) which is a simple least squares problem on the new variables X~k=Dk1/2X and z~k=Dk1/2zk, and can be solve by using the QR decomposition of X~=QR by solving the following system for dk Rdk=Qz~k

Logistic Regression {1,1}

Notice that in the previous chapter we used yi{0,1} with P(Yi=yixi)=σ(xiβ)yiσ(xiβ)1yi={σ(xiβ)if yi=0σ(xiβ)if yi=1 In particular p(yi=1)=σ(xiβ) This gave us the loss function i=1nln(1+exp((12yi)xiβ)) Now the key point is to notice that 12yi={1if yi=01if yi=1 So that actually the mapping that makes {0,1}-Logistic Regression and {1,1}-Logistic Regression equivalent is 01 and 11.

Now instead we have p(zi=1)=σ(xiβ)

Avatar
Mauro Camara Escudero
Machine Learning Engineer

My research interests include approximate manifold sampling and generative models.