In the context of Supervised Fine-Tuning (SFT) it can be very expensive to update all the parameters , most of which are large matrices. The key idea of LoRa is that one can update these matrices using a low-rank update. In the paper, they freeze all parameters except the matrices , , and (key, query, value, output) in the MHSA layers, but in principle one could do this adaptation also on the FFNN layers.
Notice that these matrices (which we assume of shape ) are used in a linear-fashion: an input will be multiplied by them so that during the forward pass one would get . In LoRa, we add an extra low-rank correction which is trainable so that during the forward pass we compute
where and with . The low-rank matrices are initialized with from a Gaussian and as a zero matrix. During SFT the loss function is then only over the low-rank updates, the UPT parameters are kept frozen.
Rule-of-Thumb: Given a computational budget, it is better to use LoRa on many matrices but with small rather than LoRa on few matrices with large. They suggest updating , , and with a very small rank (e.g. from to ) and keep the rest of the parameters in the network fixed.