Decision Trees
If the response is continuous then we are using Gradient Boost for regression.
Gradient Boost vs AdaBoost
- AdaBoost: creates stumps sequentially, where each stump learns from the errors of the previous stumps by using different sample weights.
- GradientBoost: starts by making a single leaf, not a tree. This leaf is a guess for the weights of all of the samples. E.g. when predicting a continuous response, this will be the average response across all observations. Then GBoost builds a tree (usually larger than a stump, but Gboost still restricts the size of the tree, typically people set the max number of leaves to $8$ or $32$). Trees are also built by learning from error of previous trees. An important difference is that Gboost rescales all trees by the same amount, unlike AdaBoost.
Steps for GradientBoost
- Compute average response. This is a leaf node.
- Build a tree: compute pseudo-residuals between the average response and the responses of the various observations.
- Build a tree to predict the residuals. We restrict the number of leaves of this tree to somewhere between $8$ and $32$. There will be typically fewer leaves than residuals. Associate each leave with the average pseudo-residual of the observations in that leaf.
- To make a prediction sum the leaf node with the average response + the predicted average residual.
This creates low-bias but high-variance: overfitting. To avoid this, use a learning rate to multiply the predicted average residual before summing over.
- Now create a new tree that predicts the residuals based on the “residual tree”. If you want these are residuals of residuals.
Basically we repeat this many times. Contributions from all trees are scaled by the same learning rate.