Decision Tree

Decision tree
Decision tree is a type of supervised learning algorithm that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population into two or more homogeneous based on most significant splitter in input variables.

A major decision tree analysis advantages is its ability to assign specific values to problem, decisions, and outcomes of each decision. This reduces ambiguity in decision-making. Every possible scenario from a decision finds representation by a clear fork and node, enabling viewing all possible solutions clearly in a single view.

Decision tree is a tree shaped diagram used to determine the course of action. Each branch of the tree represents a possible decision, occurrence of reaction

Entropy and Gini impurity overview

Entropy is the measure of randomness or unpredictability in the data set. Information gain is the measure if decrease in entropy when the data set is split

Entropy= – ((p/p+n) *log(p/p+n))- ((n/p+n )*log(n/p+n))

Using entropy information gain is calculated, the node with the largest information gain is chosen as root node

Gini Impurity is a measurement of the likelihood of an incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set. Gini impurity is lower bounded by 0, with 0 occurring if the data set contains only one class.

G(k) = Σ P(i) * (1 – P(i))

Gini gain is calculated when building a decision tree to help determine which attribute gives us the most information about which class a new data point belongs to.
This is done in a similar way to how information gain was calculated for entropy, except instead of taking a weighted sum of the entropies of each branch of a decision, we take a weighted sum of the Gini impurity.

Advantages of decision tree:

• Decision trees implicitly perform variable screening or feature selection
• Simple to understand, interpret and visualize
• Minimum effort required for data preparation
• Nonlinear parameters don’t affect its performance

Disadvantages of decision tree

• Overfitting occurs when the algorithm captures noise in the data.
• The model can get unstable due to small variance in data
• Highly complicated decision tree tends to have low bias which makes the model to work with new data difficult

How to improve the Decision Tree model without Overfitting?

Overfitting happens when the learning algorithm continues to develop hypotheses that reduce training set error at the cost of an increased test set error. There are several approaches to avoiding overfitting in building decision trees.

• Pre-pruning that stop growing the tree earlier, before it perfectly classifies the training set.
• Post-pruning that allows the tree to perfectly classify the training set, and then post prune the tree.

Practically, the second approach of post-pruning overfit trees is more successful because it is not easy to
precisely estimate when to stop growing the tree.

The first method is the most common approach. In this approach, the available data are separated into two sets of examples: a training set, which is used to build the decision tree, and a validation set, which is used to evaluate the impact of pruning the tree.

Post-pruning is also known as backward pruning. In this, first Generate the decision tree and then remove non-significant branches. Post-pruning a decision tree implies that we begin by generating the (complete) tree and then adjust it with the aim of improving the classification accuracy on unseen instances

Pre-pruning is also called forward pruning or online-pruning. Pre-pruning prevent the generation of non-significant branches. Pre-pruning a decision tree involves using a ‘termination condition’ to decide when it is desirable to terminate some of the branches prematurely as the tree is generated

Bagging, Boosting

Bagging, Boosting are both ensemble methods . Ensemble methods, which combines several decision trees to produce better predictive performance than utilizing a single decision tree. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.

Bagging (Bootstrap Aggregation) is used when our goal is to reduce the variance of a decision tree. Here idea is to create several subsets of data from training sample chosen randomly with replacement. Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

Boosting is another ensemble technique to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree.

Random forest

Random forest algorithm is a supervised classification algorithm. As the name suggest, this algorithm creates the forest with a number of trees. Random forest is ensemble model (i.e.) bagging model
In general, the more trees in the forest the more robust the forest looks like. In the same way in the random forest classifier, the higher the number of trees in the forest gives the high accuracy results.
It works on bagging ensemble models

Advantages of random forest

• The same random forest algorithm or the random forest classifier can use for both classification and the regression task.
• Random forest classifier will handle the missing values.
• When we have more trees in the forest, random forest classifier won’t overfit the model.
• Can model the random forest classifier for categorical values also.

Adaboost, gradient boost

Adaboost is one of the first boosting algorithm for binary classification. AdaBoost is best used to boost the performance of decision trees on binary classification problems. The most suited algorithm used with adaboost is decision tree with one level Because these trees are so short and only contain one decision for classification, they are often called decision stumps. .initially in the first tree the weights are equal, in second tree AdaBoost works by putting more weight on difficult to classify instances and less on those already handled well. This is done iteratively for a specified time Predictions of the final ensemble model is therefore the weighted sum of the predictions made by the previous tree models.

Gradient Boosting trains many models in a gradual, additive and sequential manner. The major difference between AdaBoost and Gradient Boosting Algorithm is how the two algorithms identify the shortcomings of weak learners (decision trees). While the AdaBoost model identifies the shortcomings by using high weight data points, gradient boosting performs the same by using gradients in the loss function (y=ax+b+e , e needs a special mention as it is the error term). The loss function is a measure indicating how good are model’s coefficients are at fitting the underlying data. Gradient boosting algorithm uses gradient descent method to optimize the loss function. It is an iterative algorithm