Logistic Regression

Last updated on Mar 9, 2020 11 min read classification

Bernoulli Setting

Assume $Y_{i}$ follows Bernoulli distribution given the $i^{t h}$ observation $x_{i}$ and the parameters $β$ $Y_{i} ∣ x_{i} \sim Bernoulli (p_{i})$ We can write the probability mass function for $y_{i}$ as follows $P (Y_{i} = y_{i} ∣ x_{i}, p_{i}) = p_{i}^{y_{i}} (1 - p_{i})^{1 - y_{i}}$ We assume that the log-odds is a linear combination of the input $\ln (\frac{p_{i}}{1 - p_{i}}) = x_{i}^{⊤} β i.e. p_{i} = \frac{1}{1 + \exp (- x_{i}^{⊤} β)} = \frac{\exp (x_{i}^{⊤} β)}{1 + \exp (x_{i}^{⊤} β)} = σ (x_{i}^{⊤} β)$

Joint Log-Likelihood

The joint likelihood is then found assuming IID-ness $\begin{aligned} p (y ∣ β) & = \prod_{i = 1}^{n} P (Y_{i} = y_{i} ∣ x_{i}, p_{i}) \\ = \prod_{i = 1}^{n} p_{i}^{y_{i}} (1 - p_{i})^{1 - y_{i}} \end{aligned}$ The log-likelihood is then $\begin{aligned} \ln p (y ∣ β) & = \ln (\prod_{i = 1}^{n} p_{i}^{y_{i}} (1 - p_{i})^{1 - y_{i}}) \\ = \sum_{i = 1}^{n} y_{i} \ln (p_{i}) + (1 - y_{i}) \ln (1 - p_{i}) \end{aligned}$ Alternatively, the likelihood can be written as follows $\begin{aligned} p (y ∣ β) & = \prod_{i = 1}^{n} p_{i}^{y_{i}} (1 - p_{i})^{1 - y_{i}} \\ = \prod_{i = 1}^{n} {(\frac{\exp (x_{i}^{⊤} β)}{1 + \exp (x_{i}^{⊤} β)})}^{y_{i}} {(1 - \frac{\exp (x_{i}^{⊤} β)}{1 + \exp (x_{i}^{⊤} β)})}^{1 - y_{i}} \\ = \prod_{i = 1}^{n} {(\frac{\exp (x_{i}^{⊤} β)}{1 + \exp (x_{i}^{⊤} β)})}^{y_{i}} {(\frac{1}{1 + \exp (x_{i}^{⊤} β)})}^{1 - y_{i}} \\ = \prod_{i = 1}^{n} {(\frac{\exp (x_{i}^{⊤} β)}{1 + \exp (x_{i}^{⊤} β)})}^{y_{i}} (\frac{1}{1 + \exp (x_{i}^{⊤} β)}) {(1 + \exp (x_{i}^{⊤} β))}^{y_{i}} \\ = \prod_{i = 1}^{n} \exp (x_{i}^{⊤} β y_{i}) \frac{1}{{(1 + \exp (x_{i}^{⊤} β))}^{y_{i}}} (\frac{1}{1 + \exp (x_{i}^{⊤} β)}) {(1 + \exp (x_{i}^{⊤} β))}^{y_{i}} \\ = \prod_{i = 1}^{n} \frac{\exp (x_{i}^{⊤} β y_{i})}{1 + \exp (x_{i}^{⊤} β)} \end{aligned}$

Taking the logarithm of this expression gives $\begin{aligned} \ln (p (y ∣ β)) & = \sum_{i = 1}^{n} \ln (\frac{\exp (x_{i}^{⊤} β y_{i})}{1 + \exp (x_{i}^{⊤} β)}) \\ = \sum_{i = 1}^{n} \ln (\exp (x_{i}^{⊤} β y_{i})) - \ln (1 + \exp (x_{i}^{⊤} β)) \\ = \sum_{i = 1}^{n} x_{i}^{⊤} β y_{i} - \ln (1 + \exp (x_{i}^{⊤} β)) \end{aligned}$

Maximum Likelihood $\equiv$ Minimize Loss Function

Our aim is to maximize the likelihood. This is equivalent to maximizing the log likelihood, which is equivalent to minimizing the negative log likelihood. $min_{β} - \sum_{i = 1}^{n} y_{i} \ln (p_{i}) + (1 - y_{i}) \ln (1 - p_{i})$ Let’s consider the expression inside the summation $y_{i} \ln (p_{i}) + (1 - y_{i}) \ln (1 - p_{i})$ We can notice that $y_{i} \ln (p_{i}) + (1 - y_{i}) \ln (1 - p_{i}) = {\begin{cases} \ln (1 - p_{i}) & if y_{i} = 0 \\ \ln (p_{i}) & if y_{i} = 1 \end{cases}$ Remember that $p_{i} = σ (x_{i}^{⊤} β)$ and that $1 - σ (x_{i}^{⊤} β) = σ (- x_{i}^{⊤} β)$ $y_{i} \ln (σ (x_{i}^{⊤} β)) + (1 - y_{i}) \ln (σ (- x_{i}^{⊤} β)) = {\begin{cases} \ln (σ (- x_{i}^{⊤} β)) & if y_{i} = 0 \\ \ln (σ (x_{i}^{⊤} β)) & if y_{i} = 1 \end{cases}$ We are thus looking for something that is $- 1$ when $y_{i} = 0$ and that it is $1$ when $y_{i} = 1$ . In particular, notice that $2 y_{i} - 1$ does the job. Thus we can write our problem as $\begin{aligned} min_{β} - \sum_{i = 1}^{n} \ln (σ ((2 y_{i} - 1) x_{i}^{⊤} β)) & = min_{β} \sum_{i = 1}^{n} - \ln (\frac{1}{1 + \exp ((1 - 2 y_{i}) x_{i}^{⊤} β)}) \\ = min_{β} \sum_{i = 1}^{n} \ln (1 + \exp ((1 - 2 y_{i}) x_{i}^{⊤} β)) \\ = min_{β} \sum_{i = 1}^{n} L (x_{i}^{⊤}, y_{i}; β) \end{aligned}$ where the loss incurred using parameter $β$ when predicting the label for $x_{i}^{⊤}$ whose true label is $y_{i}$ is $L (x_{i}^{⊤}, y_{i}; β) = {\begin{cases} \ln (1 + \exp (x_{i}^{⊤} β)) & if y_{i} = 0 \\ \ln (1 + \exp (- x_{i}^{⊤} β) & if y_{i} = 1 \end{cases}$

Maximum-A-Posteriori (MAP)

Recall from Bayes Rule $p (β ∣ y) = \frac{p (y ∣ β) p (β)}{p (y)} \propto p (y ∣ β) p (β)$ Taking the logarithm both sides and multiplying by $- 1$ we obtain $- \ln (p (β ∣ y)) \propto - \ln (p (y ∣ β)) - \ln (p (β))$ We can choose to use a Gaussian Prior on the parameters $p (β) = N (μ_{0}, Σ_{0})$ . If $β \in R^{p \times 1}$ then $\ln (p (β)) = - \frac{p}{2} \ln (2 π) - \frac{1}{2} \ln (| Σ_{0} |) - \frac{1}{2} (β - μ_{0})^{⊤} Σ_{0}^{- 1} (β - μ_{0})$ Plugging this in, we obtain the following. Notice how we neglect terms that do not depend on $β$ because they will not matter when we minimize this with respect to $β$ . $\begin{aligned} min_{β} - \ln (p (β ∣ y)) & = min_{β} - \ln (p (y ∣ β)) - \ln (p (β)) \\ = min_{β} - \ln (p (y ∣ β)) + \frac{p}{2} \ln (2 π) + \frac{1}{2} \ln (| Σ_{0} |) + \frac{1}{2} (β - μ_{0})^{⊤} Σ_{0}^{- 1} (β - μ_{0}) \\ = min_{β} - \ln (p (y ∣ β)) + \frac{1}{2} (β - μ_{0})^{⊤} Σ_{0}^{- 1} (β - μ_{0}) \\ = min_{β} \sum_{i = 1}^{n} \ln (1 + \exp ((1 - 2 y_{i}) x_{i}^{⊤} β)) + \frac{1}{2} (β - μ_{0})^{⊤} Σ_{0}^{- 1} (β - μ_{0}) \end{aligned}$

Ridge Regularization $\equiv$ Isotripic Gaussian Prior

Now if we set $μ_{0} = 0_{p}$ and $Σ_{0} = σ_{β}^{2} I_{p}$ (this is equivalent to setting a univariate normal prior on each coefficient, with $p (β_{j}) = N (0, σ_{β}^{2})$ ) we have $min_{β} \sum_{i = 1}^{n} \ln (1 + \exp ((1 - 2 y_{i}) x_{i}^{⊤} β)) + \frac{1}{2} β^{⊤} (σ_{β}^{2} I_{p})^{- 1} β$ using the fact that for an invertible matrix $A$ and a non-zero constant $c \in R ∖ {0}$ we have $(c A)^{- 1} = \frac{1}{c} A^{- 1}$ we obtain $min_{β} \sum_{i = 1}^{n} \ln (1 + \exp ((1 - 2 y_{i}) x_{i}^{⊤} β)) + \frac{1}{2 σ_{β}^{2}} β^{⊤} β$ Setting $λ := \frac{1}{σ_{β}^{2}}$ we have regularized logistic regression $min_{β} \sum_{i = 1}^{n} \ln (1 + \exp ((1 - 2 y_{i}) x_{i}^{⊤} β)) + \frac{λ}{2} β^{⊤} β$ It is more stable to multiply out by $σ_{β}^{2}$ so we get $min_{β} σ_{β}^{2} \sum_{i = 1}^{n} \ln (1 + \exp ((1 - 2 y_{i}) x_{i}^{⊤} β)) + \frac{1}{2} β^{⊤} β$

Ridge Regularization except on Intercept

Where $β = (\begin{matrix} β_{0} \\ β_{1} \\ ⋮ \\ β_{p - 1} \end{matrix}) := (\begin{matrix} β_{0} \\ β_{1 : p - 1} \end{matrix})$ We’ve defined $β_{1 : p - 1} := (β_{1}, \dots, β_{p})^{⊤}$ because often we don’t really regularize the intercept. This means that we place a Multivariate Gaussian prior on $β_{1 : p - 1}$ as follows $p (β_{1 : p - 1}) = N (0_{p - 1}, σ_{β_{1 : p - 1}}^{2} I_{p - 1})$ (again, this is equivalent to putting a univariate normal prior on each of $β_{1}, \dots, β_{p - 1}$ with $p (β_{j}) = N (0, σ_{β_{1 : p - 1}}^{2})$ ). Instead, on $β_{0}$ we could place a uniform prior, which means it’s value would not depend on $β_{0}$ and so could be dropped from the expression. $min_{β} \sum_{i = 1}^{n} \ln (1 + \exp ((1 - 2 y_{i}) x_{i}^{⊤} β)) + \frac{1}{2 σ_{β_{1 : p - 1}}^{2}} β_{1 : p - 1}^{⊤} β_{1 : p - 1}$ It is more stable to multiply out by $σ_{β_{1 : p - 1}}^{2}$ therefore $min_{β} σ_{β_{1 : p - 1}}^{2} \sum_{i = 1}^{n} \ln (1 + \exp ((1 - 2 y_{i}) x_{i}^{⊤} β)) + \frac{1}{2} β_{1 : p - 1}^{⊤} β_{1 : p - 1}$ Notice that this is consistent with the implementation used by scikit-learn provided here.

Full-Bayesian (Laplace Approximation)

Full-Bayesian in intractable. Laplace Approximation approximates $p (β ∣ y)$ with a Gaussian distribution $q (β)$ . To find such a distribution, we use the multivariate version of Taylor’s expansion to expand the log posterior around its mode $β_{MAP}$ . We take a second order approximation $\ln p (β ∣ y) ≃ \ln p (β_{MAP} ∣ y) + \nabla \ln p (β_{MAP} ∣ y) (β - β_{MAP}) + \frac{1}{2} (β - β_{MAP})^{⊤} \nabla^{2} \ln p (β_{MAP} ∣ y) (β - β_{MAP})$ Since $β_{MAP}$ is a stationary point, the gradient at this point will be zero so we have $\ln p (β ∣ y) ≃ \ln p (β_{MAP} ∣ y) + \frac{1}{2} (β - β_{MAP})^{⊤} \nabla^{2} \ln p (β_{MAP} ∣ y) (β - β_{MAP})$ We take the exponential both sides to obtain $p (β ∣ y) ≃ p (β_{MAP} ∣ y) \exp (\frac{1}{2} (β - β_{MAP})^{⊤} \nabla^{2} \ln p (β_{MAP} ∣ y) (β - β_{MAP}))$ We regnize this has the shape of a Multivariate Normal distribution. We therefore define our Laplace approximation to be $q (β) = (2 π)^{- \frac{p}{2}} det (- \nabla^{2} \ln p (β_{MAP} ∣ y)) \exp (\frac{1}{2} (β - β_{MAP})^{⊤} \nabla^{2} \ln p (β_{MAP} ∣ y) (β - β_{MAP}))$ That is $q (β) = N (β_{MAP}, {[- \nabla^{2} \ln p (β_{MAP} ∣ y)]}^{- 1})$ To find $\nabla_{β}^{2} \ln p (β ∣ y) |_{β = β_{MAP}}$ we write $\nabla_{β}^{2} \ln p (β ∣ y) = \nabla_{β}^{2} \ln p (y ∣ β) + \nabla_{β}^{2} \ln p (β)$ and we find each of the expressions on the right-hand side separately. We start with $\nabla_{β}^{2} \ln p (β)$ . Recall that if we have a quadratic form $x^{⊤} A x$ the derivative of this quadratic form with respect to $x$ is given by $x^{⊤} (A + A^{⊤})$ . Applying this to our case, and using the fact that $Σ_{0}^{- 1}$ is symmetric we have $\nabla_{β} \ln p (β) = - \frac{1}{2} (β - μ_{0})^{⊤} 2 Σ_{0}^{- 1} = - (β - μ_{0})^{⊤} Σ_{0}^{- 1} = - β^{⊤} Σ_{0}^{- 1} + μ_{0}^{⊤} Σ_{0}^{- 1}$ Taking the derivative with respect to $β$ again we get $- \nabla_{β}^{2} \ln p (β) = Σ_{0}^{- 1}$ To find $\nabla_{β} \ln p (y ∣ β)$ we take the derivative componentwise $\begin{aligned} \frac{\partial}{\partial β_{j}} \ln p (y ∣ β) & = \sum_{i = 1}^{n} y_{i} \partial_{β_{j}} \ln σ (x_{i}^{⊤} β) + (1 - y_{i}) \partial_{β_{j}} \ln σ (- x_{i}^{⊤} β) \\ = \sum_{i = 1}^{n} y_{i} \frac{σ (x_{i}^{⊤} β) (1 - σ (x_{i}^{⊤} β))}{σ (x_{i}^{⊤} β)} x_{i}^{(j)} + (1 - y_{i}) \frac{σ (- x_{i}^{⊤} β) (1 - σ (- x_{i}^{⊤} β))}{σ (- x_{i}^{⊤} β)} (- x_{i}^{(j)}) \\ = \sum_{i = 1}^{n} y_{i} x_{i}^{(j)} σ (- x_{i}^{⊤} β) + (y_{i} x_{i}^{(j)} - x_{i}^{(j)}) σ (x_{i}^{⊤} β) \end{aligned}$ Now take the derivative with respect to $β_{k}$ $\begin{aligned} \partial_{β_{k}} \partial_{β_{j}} \ln p (y ∣ β) & = \sum_{i = 1}^{n} y_{i} x_{i}^{(j)} \partial_{β_{k}} σ (- x_{i}^{⊤} β) + (y_{i} x_{i}^{(j)} - x_{i}^{(j)}) \partial_{β_{k}} σ (x_{i}^{⊤} β) \\ = \sum_{i = 1}^{n} y_{i} x_{i}^{(j)} σ (- x_{i}^{⊤} β) (1 - σ (- x_{i}^{⊤} β)) (- x_{i}^{(k)}) + (y_{i} x_{i}^{(j)} - x_{i}^{(j)}) σ (x_{i}^{⊤} β) (1 - σ (x_{i}^{⊤} β)) x_{i}^{(k)} \\ = \sum_{i = 1}^{n} - y_{i} x_{i}^{(j)} x_{i}^{(k)} σ (x_{i}^{⊤} β) (1 - σ (x_{i}^{⊤} β)) + (y_{i} x_{i}^{(j)} x_{i}^{(k)} - x_{i}^{(j)} x_{i}^{(k)}) σ (x_{i}^{⊤} β) (1 - σ (x_{i}^{⊤} β)) \\ = - \sum_{i = 1}^{n} x_{i}^{(j)} x_{i}^{(k)} σ (x_{i}^{⊤} β) (1 - σ (x_{i}^{⊤} β)) \end{aligned}$ This tells us that $[\nabla_{β}^{2} \ln p (y ∣ β)]_{k j} = - \sum_{i = 1}^{n} x_{i}^{(j)} x_{i}^{(k)} σ (x_{i}^{⊤} β) (1 - σ (x_{i}^{⊤} β))$ Note that for a vector $x_{i} := (1, x_{i}^{(1)}, \dots, x_{i}^{(p - 1)})^{⊤}$ the outer product gives the following symmetric matrix $x_{i} x_{i}^{⊤} = (\begin{matrix} 1 \\ x_{i}^{(1)} \\ ⋮ \\ x_{i}^{(p - 1)} \end{matrix}) (\begin{matrix} 1 & x_{i}^{(1)} & \dots & x_{i}^{(p - 1)} \end{matrix}) = (\begin{matrix} 1 & x_{i}^{(1)} & \dots & x_{i}^{(p - 1)} \\ x_{i}^{(1)} & (x_{i}^{(1)})^{2} & \dots & x_{i}^{(1)} x_{i}^{(p - 1)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{i}^{(p - 1)} & x_{i}^{(p - 1)} x_{i}^{(1)} & \dots & (x_{i}^{(p - 1)})^{2} \end{matrix})$ In particular $[x_{i} x_{i}^{⊤}]_{k j} = x_{i}^{(j)} x_{i}^{(k)}$ so that $- \nabla_{β}^{2} \ln p (y ∣ β) = \sum_{i = 1}^{n} σ (x_{i}^{⊤} β) (1 - σ (x_{i}^{⊤} β)) x_{i} x_{i}^{⊤}$ Putting everything together we get $- \nabla_{β}^{2} \ln p (β ∣ y) = Σ_{0}^{- 1} + \sum_{i = 1}^{n} σ (x_{i}^{⊤} β) (1 - σ (x_{i}^{⊤} β)) x_{i} x_{i}^{⊤}$

Gradient Ascent Optimization (MLE, No Regularization)

The simplest way to find $β$ that maximizes the likelihood is to do gradient ascent. Remember that when working on the Laplace approximation we found the derivative of the log-likelihood with respect to the $j^{t h}$ component of $β$ . We can rearrange such an expression to get a nicer form. $\begin{aligned} \frac{\partial}{\partial β_{j}} \ln p (y ∣ β) & = \sum_{i = 1}^{n} y_{i} x_{i}^{(j)} σ (- x_{i}^{⊤} β) + (y_{i} x_{i}^{(j)} - x_{i}^{(j)}) σ (x_{i}^{⊤} β) \\ = \sum_{i = 1}^{n} y_{i} x_{i}^{(j)} (1 - σ (x_{i}^{⊤} β)) + (y_{i} x_{i}^{(j)} - x_{i}^{(j)}) σ (x_{i}^{⊤} β) \\ = \sum_{i = 1}^{n} y_{i} x_{i}^{(j)} - y_{i} x_{i}^{(j)} σ (x_{i}^{⊤} β) + y_{i} x_{i}^{(j)} σ (x_{i}^{⊤} β) - x_{i}^{(j)} σ (x_{i}^{⊤} β) \\ = \sum_{i = 1}^{n} [y_{i} - σ (x_{i}^{⊤} β)] x_{i}^{(j)} \end{aligned}$ Therefore the full gradient is given by $\begin{aligned} \nabla_{β} \ln p (y ∣ β) & = {(\frac{\partial}{\partial β_{0}} \ln p (y ∣ β), \dots, \frac{\partial}{\partial β_{p - 1}} \ln p (y ∣ β))}^{⊤} \\ = {(\sum_{i = 1}^{n} (y_{i} - σ (x_{i}^{⊤} β)) x_{i}^{(0)}, \dots, \sum_{i = 1}^{n} (y_{i} - σ (x_{i}^{⊤} β)) x_{i}^{(p - 1)})}^{⊤} \\ = \sum_{i = 1}^{n} (y_{i} - σ (x_{i}^{⊤} β)) (x_{i}^{(0)}, \dots, x_{i}^{(p - 1)}) \\ = \sum_{i = 1}^{n} (y_{i} - σ (x_{i}^{⊤} β)) x_{i} \end{aligned}$ Gradient ascent then becomes $β_{k + 1} \leftarrow β_{k} + γ \nabla_{β} \ln p (y ∣ β) = β_{k} + γ \sum_{i = 1}^{n} (y_{i} - σ (x_{i}^{⊤} β)) x_{i}$ This can be vectorized when programming as follows $β_{k + 1} \leftarrow β_{k} + γ X^{⊤} (y - σ (X β))$ where $X \in R^{n \times p}$ is the design matrix. $X = (\begin{matrix} x_{1}^{⊤} \\ ⋮ \\ x_{n}^{⊤} \end{matrix})$

One can change the step size at every iteration. One possible choice for $γ_{k}$ is as follows $γ_{k} = \frac{| (β_{k} - β_{k - 1})^{⊤} [\nabla \ln p (y ∣ β_{k}) - \nabla \ln p (y ∣ β_{k - 1})] |}{∥ \nabla \ln p (y ∣ β_{k}) - \nabla \ln p (y ∣ β_{k - 1} ∥^{2}}$

Newton’s Method (MLE, No Regularization)

Again, during the Laplace approximation section we found that the Hessian is given by $\nabla_{β}^{2} \ln p (y ∣ β) = - \sum_{i = 1}^{n} σ (x_{i}^{⊤} β) (1 - σ (x_{i}^{⊤} β)) x_{i} x_{i}^{⊤}$ This expression can be vectorized as follows $\nabla_{β}^{2} \ln p (y ∣ β) = - X^{⊤} D X$ where $D = diag (σ (X β) (1 - σ (X β))) = (\begin{matrix} σ (x_{1}^{⊤} β) (1 - σ (x_{1}^{⊤} β)) & 0 & \dots & 0 \\ 0 & σ (x_{2}^{⊤} β) (1 - σ (x_{2}^{⊤} β)) & \dots & 0 \\ ⋮ & \dots & ⋱ & ⋮ \\ 0 & 0 & \dots & σ (x_{n}^{⊤} β) (1 - σ (x_{n}^{⊤} β)) \end{matrix})$ Newton’s method then updates the weights as follows (where $α$ is a learning rate to control convergence) $β_{k + 1} \leftarrow β_{k} + α (X^{⊤} D X)^{- 1} X^{⊤} (y - σ (X β_{k}))$

Of course in practice we never invert the matrix but rather compute the direction $d$ by solving the linear system $X^{⊤} D X d_{k} = α X^{⊤} (y - σ (X β_{k}))$ and then we find the next iterate as $β_{k + 1} \leftarrow β_{k} + d_{k}$

Gradient Ascent (MAP, Ridge Regularization)

We want to maximize $\ln p (β ∣ y) = \ln p (y ∣ β) + \ln p (β) = - \sum_{i = 1}^{n} \ln (1 + \exp ((1 - 2 y_{i}) x_{i}^{⊤} β)) - \frac{1}{2 σ_{β}^{2}} β^{⊤} β$ The gradient of the log posterior is given by $\nabla_{β} \ln p (β ∣ y) = X^{⊤} (y - σ (X β)) - \frac{1}{σ_{β}^{2}} β$ Thus gradient descent with regularization to do MAP becomes $β_{k + 1} \leftarrow β_{k} + γ (σ_{β}^{2} X^{⊤} (y - σ (X β_{k})) - β_{k})$

Newton’s Method (MAP, Ridge Regularization)

Similarly, we want to maximize $\ln p (β ∣ y)$ . The Hessian is given by $\nabla_{β}^{2} \ln p (β ∣ y) = - X^{⊤} D X - \frac{1}{σ_{β}^{2}} I$ therefore the Newton’s Method update formula becomes $β_{k + 1} \leftarrow β_{k} + α {[σ_{β}^{2} X^{⊤} D X + I]}^{- 1} (σ_{β}^{2} X^{⊤} (y - σ (X β_{k})) - β_{k})$

Iteratively Reweighted Least-Squares

We can manipulate the expression in Newton’s method by defining a new variable $z_{k} = X β_{k} + D^{- 1} (y - σ (X β_{k}))$ Then the updates take the form $β_{k + 1} \leftarrow β_{k} + (X^{⊤} D_{k} X)^{- 1} X^{⊤} D_{k} z_{k}$ In practice we would follow these steps

Evaluate $η_{k} = X β_{k}$ and $D_{k}$ .
Solve the system $D_{k} r_{k} = y - σ (η_{k})$ for $r_{k}$ .
Compute $z_{k} = η_{k} + r_{k}$ .
Solve the system $(X^{⊤} D_{k} X) d_{k} = X^{⊤} D_{k} z_{k}$ for $d_{k}$ .
Compute $β_{k + 1} \leftarrow β_{k} + d_{k}$

Alternatively, noticing that $D_{i i} = σ (x_{i}^{⊤} β) (1 - σ (x_{i}^{⊤} β)) > 0$ one can take the square root of its elements and rewriting the problem as $(D_{k}^{1 / 2} X)^{⊤} (D_{k}^{1 / 2} X) d_{k} = (D_{k}^{1 / 2} X)^{⊤} (D_{k}^{1 / 2} z_{k})$ which is a simple least squares problem on the new variables ${\tilde{X}}_{k} = D_{k}^{1 / 2} X$ and ${\tilde{z}}_{k} = D_{k}^{1 / 2} z_{k}$ , and can be solve by using the QR decomposition of $\tilde{X} = Q R$ by solving the following system for $d_{k}$ $R d_{k} = Q^{⊤} {\tilde{z}}_{k}$

Logistic Regression ${- 1, 1}$

Notice that in the previous chapter we used $y_{i} \in {0, 1}$ with $P (Y_{i} = y_{i} ∣ x_{i}) = σ (x_{i}^{⊤} β)^{y_{i}} σ (- x_{i}^{⊤} β)^{1 - y_{i}} = {\begin{cases} σ (- x_{i}^{⊤} β) & if y_{i} = 0 \\ σ (x_{i}^{⊤} β) & if y_{i} = 1 \end{cases}$ In particular $p (y_{i} = 1) = σ (x_{i}^{⊤} β)$ This gave us the loss function $\sum_{i = 1}^{n} \ln (1 + \exp ((1 - 2 y_{i}) x_{i}^{⊤} β))$ Now the key point is to notice that $1 - 2 y_{i} = {\begin{cases} 1 & if y_{i} = 0 \\ - 1 & if y_{i} = 1 \end{cases}$ So that actually the mapping that makes ${0, 1}$ -Logistic Regression and ${- 1, 1}$ -Logistic Regression equivalent is $0 \mapsto 1$ and $1 \mapsto - 1$ .

Now instead we have $p (z_{i} = - 1) = σ (x_{i}^{⊤} β)$

classification logistic regression

Logistic Regression

Bernoulli Setting

Joint Log-Likelihood

Maximum Likelihood $\equiv$ Minimize Loss Function

Maximum-A-Posteriori (MAP)

Ridge Regularization $\equiv$ Isotripic Gaussian Prior

Ridge Regularization except on Intercept

Full-Bayesian (Laplace Approximation)

Gradient Ascent Optimization (MLE, No Regularization)

Newton’s Method (MLE, No Regularization)

Gradient Ascent (MAP, Ridge Regularization)

Newton’s Method (MAP, Ridge Regularization)

Iteratively Reweighted Least-Squares

Logistic Regression ${- 1, 1}$

Mauro Camara Escudero

Machine Learning Engineer

Logistic Regression

Bernoulli Setting

Joint Log-Likelihood

Maximum Likelihood ≡ Minimize Loss Function

Maximum-A-Posteriori (MAP)

Ridge Regularization ≡ Isotripic Gaussian Prior

Ridge Regularization except on Intercept

Full-Bayesian (Laplace Approximation)

Gradient Ascent Optimization (MLE, No Regularization)

Newton’s Method (MLE, No Regularization)

Gradient Ascent (MAP, Ridge Regularization)

Newton’s Method (MAP, Ridge Regularization)

Iteratively Reweighted Least-Squares

Logistic Regression {−1,1}

Mauro Camara Escudero

Machine Learning Engineer

Maximum Likelihood $\equiv$ Minimize Loss Function

Ridge Regularization $\equiv$ Isotripic Gaussian Prior

Logistic Regression ${- 1, 1}$