Fine-Tuning

In the context of Supervised Fine-Tuning (SFT) it can be very expensive to update all the parameters θ, most of which are large matrices. The key idea of LoRa is that one can update these matrices using a low-rank update. In the paper, they freeze all parameters except the matrices Wk, Wq, Wv and Wo (key, query, value, output) in the MHSA layers, but in principle one could do this adaptation also on the FFNN layers.

Notice that these matrices W (which we assume of shape d×k) are used in a linear-fashion: an input xRk will be multiplied by them so that during the forward pass one would get WxRd. In LoRa, we add an extra low-rank correction which is trainable so that during the forward pass we compute (W+BA)x=Wxpre-trained+BAxSFT, where BRd×r and ARr×k with rmin(d,k). The low-rank matrices are initialized with A from a Gaussian and B as a zero matrix. During SFT the loss function is then only over the low-rank updates, the UPT parameters are kept frozen.

minΔ{1nbatchn=1nbatch1|yn|t=1|yn|log(softmax(LLMθ+Δ(xn,yn,<t)yn,t))}

Rule-of-Thumb: Given a computational budget, it is better to use LoRa on many matrices but with small r rather than LoRa on few matrices with r large. They suggest updating Wk, Wq, Wv and Wo with a very small rank (e.g. from 1 to 4) and keep the rest of the parameters in the network fixed.

Previous
Next