In the last section, we have discussed the k-nearest neighbors and how it is useful in different senses. Now we will see how to implement the KNN in python practically. For this, we are going to use the Sklearn library which is a standard library of python for machine learning.
So without wasting any time, let's dig into the code.
First of all, we import all the desired modules from python. At first, we import dataset library. Sklearn provides many datasets that are builtin which makes it very easy for a new learner to learn machine learning. The second module that we have imported is a trained test split. We use this for splitting our data into test and train. The advantage of using this we don't have to code this step hard, and we can also shuffle the data which makes the dataset more good before feeding it into the machine learning algorithm.
The third module is KNN which is a classifier which we are going to use than we import Matplotlib for plotting data and in the last metrics module for calculating the accuracy of the model.
Then we load the iris dataset. Iris data set is the data of the different type of iris flowers and their difference we feed into the ML algorithm so next time we give the attributes of the flower and the model can tell us which flower is this. There three classes ‘setosa’,'versicolor', 'virginica'. To get the target name following is the code:
Now we will separate the data and target the data is present in the data field and classes name is present in the target key.
We then split the data. Now the question arises how we will know how much k neighbors will be good enough. So there are 2 approaches to select the number of K for our mode. The first one is by taking the square root of several rows of training data, and here there are approximately 110 rows in training data so by taking the square root of 110 we will get approximately 10. the second is by using hit and trial rule for that following method is used.
We set the range of k from 1 to 40 and then we train our model and predict the accuracy and then plot the data. The plot looks like as follows:
Here we can see the first highest peak is at about near 10 which is similar to our square root answer so we will use 10 number of K.
The output is as follows:
The accuracy is pretty good that is 97%.
Decision tree learning is a method for approximating discrete valued target functions, in which a decision tree represents the learned function. Learned trees can also be represented as a set of if-then rules to improve human readability. These learning methods are among the most popular of inductive inference algorithms and have been successfully applied to a broad range of tasks from learning to diagnose medical cases to learn to assess the credit risk of loan applicants.
Decision Tree Representation:
Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of instances. Each node in the tree specifies a test of some attribute of the instance, and each branch is descending.
From that node corresponds to one of the possible values for this attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given an example. This process is then repeated for the subtree rooted at the new node.
Why the Decision Tree is Called Inductive Learning:
In the decision tree, we made a series of Boolean decisions and followed the corresponding branch. For example:
- Did we leave at 10 AM?
- Did a car stall on the road?
- Is there an accident on the road?
By answering each of these yes/no questions, we then concluded how long our commute might take
Appropriate Problems for Decision Tree Algorithm:
A decision tree can be applied to the number of problems depending upon the type of data we are having. Decision trees are the best suited for the problems having characteristics mentioned below:
- Attribute value pairs represent instance.
- The target function has discrete output values.
- A disjunctive description is required.
- Training data may contain error or missing values.
How Does a Tree Decide Where To Split:
Following are the methods on which we can make a decision when to split the data and what will be the root node of that tree. This split is based on the impurity in the data set that is homogeneity or heterogeneity in a given feature set. The set is said to be sure if there is only one class in a set and set is said to be impure if there are multiple classes in a class.
Methods are the following:
- Gini Index: It is the measure of impurity used in building a decision tree. For finding the Gini index the formula is:
gini(D) = 1- ( P / P+N )2 + ( N / P+N )
- Information Gain: The information gain is the decrease in entropy after a dataset is split based on an attribute. Constructing a decision tree is all about finding the attribute that returns the highest information gain.
- Reduction invariance is an algorithm used for continuous target variable(Regression problems). The split with lower variance is selected as the criteria to split the population.
- It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent nodes.