Computer scientist Tin Kam Ho introduced Random Forests in an article back in 1995. Over the years, the algorithm has become very popular for Classification and Regression problems in Machine Learning. Random Forests rank very high up the performance hierarchy. Widely used by data scientists to predict all sorts of scenarios, the combination of multiple decision trees can result in a prominent approach to estimate and classify different data samples.
In this article, Random Forest will be the topic of discussion as we go through each step of its implementation.
Preparing the Training Data
The first step is to separate the initial data into a new batch, that contains samples from the primary dataset. This process of randomly sampling data receives the name of Bootstrapping.
Let us use a database as an example. The database below is completely fictional and very short. It indicates whether a certain user enjoys a Netflix title or not. In the example, binary values represent the words ‘yes’ and ‘no’.
The bootstrapping is responsible for randomly selecting about two-thirds of the initial data and creating a new batch, usually the same size or smaller than the original, that will train the model and generate decision trees. Therefore, the new dataset will look something like the following picture.
Note that some of the data will repeat, because it’s a case of Sampling with Replacement. The draw data will need replacement data to keep the dataset the same size as the initial.
The remaining one-third of the data that was not in the new training batch will compose the test data, fundamental for the evaluation of the model’s accuracy.
Creating Decision Trees
Decision trees are what compose the algorithm itself. Random features of the dataset generate each of the decision trees, the labels are disposed in different ways in each tree and are combined with other labels, resulting in various decision structures. However, the issue with this is decision trees tend to shape themselves to the data that trained them. For this reason, this step requires Bootstrap Aggregation (Bagging), a highly efficient way of reducing bias in the model.
In order to execute said method, we reduce the training data set in many different samples drawn with replacement. Each sample will run through a set of decision trees, and the final model will hold the average of each sample’s prediction—if dealing with regression algorithms. Similarly, for classification algorithms, the model will keep a score of the count of times a specific example was assigned to each label, the label with the highest score will be the output.
Improving the Accuracy
Testing the model is of extreme value for adjusting the algorithm and improving its performance. As indicated in previous sections, we divide the initial dataset and about one-third of the data becomes the testing dataset.
After running the algorithm through test data, it is possible to calculate the accuracy of the model. Random Forests allow precise algorithm tuning as the user is capable of adjusting its hyperparameters and choosing more specific settings.
Some of the parameters that can be modified cover the number of trees in the forest, the maximum number of levels in a tree, minimum number of data points in a node, among others.
At 4tune, we are passionate about solving clients’ problems to deliver profit. If you have a business idea or need help with anything related to artificial intelligence, we are glad to help. Please contact us at email@example.com for more details.