Fine-Tuning

In the context of Supervised Fine-Tuning (SFT) it can be very expensive to update all the parameters $θ$ , most of which are large matrices. The key idea of LoRa is that one can update these matrices using a low-rank update. In the paper, they freeze all parameters except the matrices $W_{k}$ , $W_{q}$ , $W_{v}$ and $W_{o}$ (key, query, value, output) in the MHSA layers, but in principle one could do this adaptation also on the FFNN layers.

Notice that these matrices $W$ (which we assume of shape $d \times k$ ) are used in a linear-fashion: an input $x \in R^{k}$ will be multiplied by them so that during the forward pass one would get $W x \in R^{d}$ . In LoRa, we add an extra low-rank correction which is trainable so that during the forward pass we compute $(W + B A) x = \underset{pre-trained}{\underset{⏟}{W x}} + \underset{SFT}{\underset{⏟}{B A x}},$ where $B \in R^{d \times r}$ and $A \in R^{r \times k}$ with $r ≪ min (d, k)$ . The low-rank matrices are initialized with $A$ from a Gaussian and $B$ as a zero matrix. During SFT the loss function is then only over the low-rank updates, the UPT parameters are kept frozen.

$min_{Δ} {- \frac{1}{n_{batch}} \sum_{n = 1}^{n_{batch}} \frac{1}{| y_{n} |} \sum_{t = 1}^{| y_{n} |} \log (softmax ({LLM}_{θ + Δ} (x_{n}, y_{n, < t})_{y_{n, t}}))}$

Rule-of-Thumb: Given a computational budget, it is better to use LoRa on many matrices but with small $r$ rather than LoRa on few matrices with $r$ large. They suggest updating $W_{k}$ , $W_{q}$ , $W_{v}$ and $W_{o}$ with a very small rank (e.g. from $1$ to $4$ ) and keep the rest of the parameters in the network fixed.

Last updated on Jul 17, 2024

Edit this page