Bernoulli Setting
Assume follows Bernoulli distribution given the observation and the parameters
We can write the probability mass function for as follows
We assume that the log-odds is a linear combination of the input
Joint Log-Likelihood
The joint likelihood is then found assuming IID-ness
The log-likelihood is then
Alternatively, the likelihood can be written as follows
Taking the logarithm of this expression gives
Maximum Likelihood Minimize Loss Function
Our aim is to maximize the likelihood. This is equivalent to maximizing the log likelihood, which is equivalent to minimizing the negative log likelihood.
Let’s consider the expression inside the summation
We can notice that
Remember that and that
We are thus looking for something that is when and that it is when . In particular, notice that does the job. Thus we can write our problem as
where the loss incurred using parameter when predicting the label for whose true label is is
Maximum-A-Posteriori (MAP)
Recall from Bayes Rule
Taking the logarithm both sides and multiplying by we obtain
We can choose to use a Gaussian Prior on the parameters . If then
Plugging this in, we obtain the following. Notice how we neglect terms that do not depend on because they will not matter when we minimize this with respect to .
Ridge Regularization Isotripic Gaussian Prior
Now if we set and (this is equivalent to setting a univariate normal prior on each coefficient, with ) we have
using the fact that for an invertible matrix and a non-zero constant we have we obtain
Setting we have regularized logistic regression
It is more stable to multiply out by so we get
Ridge Regularization except on Intercept
Where
We’ve defined because often we don’t really regularize the intercept. This means that we place a Multivariate Gaussian prior on as follows (again, this is equivalent to putting a univariate normal prior on each of with ). Instead, on we could place a uniform prior, which means it’s value would not depend on and so could be dropped from the expression.
It is more stable to multiply out by therefore
Notice that this is consistent with the implementation used by scikit-learn
provided here.
Full-Bayesian (Laplace Approximation)
Full-Bayesian in intractable. Laplace Approximation approximates with a Gaussian distribution . To find such a distribution, we use the multivariate version of Taylor’s expansion to expand the log posterior around its mode . We take a second order approximation
Since is a stationary point, the gradient at this point will be zero so we have
We take the exponential both sides to obtain
We regnize this has the shape of a Multivariate Normal distribution. We therefore define our Laplace approximation to be
That is
To find we write
and we find each of the expressions on the right-hand side separately. We start with . Recall that if we have a quadratic form the derivative of this quadratic form with respect to is given by . Applying this to our case, and using the fact that is symmetric we have
Taking the derivative with respect to again we get
To find we take the derivative componentwise
Now take the derivative with respect to
This tells us that
Note that for a vector the outer product gives the following symmetric matrix
In particular so that
Putting everything together we get
Gradient Ascent Optimization (MLE, No Regularization)
The simplest way to find that maximizes the likelihood is to do gradient ascent. Remember that when working on the Laplace approximation we found the derivative of the log-likelihood with respect to the component of . We can rearrange such an expression to get a nicer form.
Therefore the full gradient is given by
Gradient ascent then becomes
This can be vectorized when programming as follows
where is the design matrix.
One can change the step size at every iteration. One possible choice for is as follows
Newton’s Method (MLE, No Regularization)
Again, during the Laplace approximation section we found that the Hessian is given by
This expression can be vectorized as follows
where
Newton’s method then updates the weights as follows (where is a learning rate to control convergence)
Of course in practice we never invert the matrix but rather compute the direction by solving the linear system
and then we find the next iterate as
Gradient Ascent (MAP, Ridge Regularization)
We want to maximize
The gradient of the log posterior is given by
Thus gradient descent with regularization to do MAP becomes
Newton’s Method (MAP, Ridge Regularization)
Similarly, we want to maximize . The Hessian is given by
therefore the Newton’s Method update formula becomes
Iteratively Reweighted Least-Squares
We can manipulate the expression in Newton’s method by defining a new variable
Then the updates take the form
In practice we would follow these steps
- Evaluate and .
- Solve the system for .
- Compute .
- Solve the system for .
- Compute
Alternatively, noticing that one can take the square root of its elements and rewriting the problem as
which is a simple least squares problem on the new variables and , and can be solve by using the QR decomposition of by solving the following system for
Logistic Regression
Notice that in the previous chapter we used with
In particular
This gave us the loss function
Now the key point is to notice that
So that actually the mapping that makes -Logistic Regression and -Logistic Regression equivalent is and .
Now instead we have