Also known as GBDT.
Etsy published a recent article which documented how they migrated from their old search service to one which was based on deep learning. The majority of that article made little sense to me, so I wanted to take a moment to document what the heck that actually meant.
Originally, etsy’s search service was based on a gradient-boosted decision tree model, which drove personalized search. They had to manually engineer the features of the tree model and their relevancy gains began to plateau. They were excited about the potential improvements of deep learning models, especially because they could embed features directly, incorporate multi-modal data and alter the network architecture itself. They ran two passes of ranking and their work was primarily targeted at the second layer.
Starting from the beginning, a decision tree is a supervised machine learning algorithm (definition).
Each node of the graph is either a root (the starting point), an internal node (also known as decision nodes) and leaf notes (aka terminal nodes). The intent is to find a path from the root to a leaf node based on a variety of predicates which let you know which outgoing edge to follow. By the time we get into leaf nodes, we want the supervised results to be homogenous.
As an example, we may have a simple question like “Should I wear a hat today?“. There are clear cases when you should definitely wear a hat: It’s sunny at noon in the desert or It’s going to be rainy. The question “What’s the day’s weather forecast?” isn’t adequate to firmly bucket you into “yes, wear a hat” or “no, don’t wear a hat”. If it’s going to be sunny today but you’re going out at 9pm.. the answer is unlikely to be the same as if you’re going out at noon.
Figuring out these decision points is a key aspect of decision tree models. The algorithm determines how to reach a decision through a process called “node splitting”. There are a variety of algorithms for this which are rooted in math and statistics. The general idea, however, is that you split the known values up based on their features (i.e. properties) and seek to split them up based on those properties until you get homogenous results. You may only be able to get it more homogenous.. at which point you continue to do it until you reach some final homogenous grouping.
There’s a real risk of overfitting with this approach. There are a few ways to address it, including one called “random forests”. Overfitting feels out of scope for understanding the original article though.
The “boosted” adjective applies outside of the context of decision trees in general. It’s a machine learning term which seems to mean “take a bunch of guesses and apply them one-after-another until you get the right answer”. In the algorithm’s case, it means “take a bunch of crappy learning algorithms and apply them in series”. The result is a better algorithm.
“Gradient”, in this context, refers to gradient decent. It’s a mechanism used to determine the minimum value of a function. In our case, the function is something like an error-rate function. We want to find the minimum error rate that we can within the boosting process.
For the “gradient-boosted decision tree” example, you take the output of one kinda janky decision tree and you create another decision tree which corrects any errors found in the first one. You continue this process until you have a better model than you started with. These errors that are left over from a decision tree are called “residuals” and we use a loss function to detect hem (similar to other classification tasks). If you add a bunch of these decision trees in series.. you similarly risk overfitting.
Okay, so we know that a gradient descent decision tree is a bunch of smaller decisions combined to result in a low error rate, which in our algorithm’s terms, means a bunch of leaf nodes which will correctly determine the output of something. Awesome.