Unsupervised Learning in Python

In this part of Learning Python we Cover Unsupervised Learning In Python
Written by Paayi Tech |09-May-2019 | 0 Comments | 74 Views

Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.

The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters are modeled using a measure of similarity which is defined upon metrics such as Euclidean or probabilistic distance.

K-mean clustering algorithm is one of the unsupervised learning algorithms that we are going to discuss here.

 

K-Mean Clustering:

K means clustering is a method vector quantization. It separates the data on its similarity, and that is why it is prevalent data mining because it has done very well in cluster analysis. K means cluster tries to makes the clusters of the data. The number of clusters can be k. The clusters are made by separating the data by calculating similarity by distance from each other.

 

Description:

K means, it is an iterative algorithm that means it will iterate until it places the centroids at the best place and cluster that data based on the position of centroids. These centroids than calculate the minimum distance the features which will be close to that centroid will form one cluster.

Here the ‘k’ in k-mean clusters is the number of centroids. Just like the number of neighbors in k nearest neighbors. However, it is not linked to KNN in any ways. That was a supervised learning algorithm, and this one is opposite an unsupervised learning algorithm.

K-means clustering is used where there is no label to the data and prediction is based on clustering. The new input feature that is close to a particular cluster will be of that class.

 

Mathematics:

Inputs:

Parameter K = number of clusters.

Training set = feature vector x

x element of matrix[n] where x = {x1,x2,x3,x4,……………,xn}

Where each of the N data points xi is a D-dimensional vector. We will denote the cluster assignment associated to each data point by z1, …, zN, where if data point xi belongs to cluster k we write zi= k   

 

Intialization of K:

Suppose we have a data set of sizes of t-shirt and we have to cluster is based on small, medium and large. Then we know the number of clusters to be formed similarly if we have to cluster in 5 clusters extra small, small, medium, large and extra large still we know the number of clusters to be formed in that situation we will give an appropriate number of k to cluster that data in k groups.

 

Random Initialization of Centroid:

When K is initialized how will we know that centroid is at its best position. In that situation, we use the method “Random initialization centroid”.

Iteration will start and find a suitable position of centroid so that cluster can be formed.

Where centroid is equal to {mew1,mew2,mew3,……,mewk}

for i=1 to m

c(i) := index (from 1 to k) of clustering

centroid closer to x(i)

for 1 to K

mewk: = average mean of point assign to cluster k

 

Optimization Objective:

The equation of optimization is given by

1/m (sum 1 to m) || x^i – mew^i ||^2

where x^i is a training example

moreover, mew ^i is the location of the centroid

for optimization, euclidian distance is used from centroid location to feature x

By use of the Euclidean distance, K-means treats the data space as isotropic (distances unchanged by translations and rotations). This means that data points in each cluster are modeled as lying within a sphere around the cluster centroid. A sphere has the same radius in each dimension.

 

​​​​​​​Random Initialization of K:

Suppose we have a large dataset and we don't know what should be the value of K. If we don't have a value of we cant specify centroid and than centroid cannot optimize its location. So in that particular case, we use a method random initialize the K by iterating from to n values. Than these iterations will than optimize the position of centroid. Moreover, from all of these iterations, the best K will be deduced by computing and estimating the cost function. We will pick the cluster that gave us the lowest cost.

One of the methods that are being used for the random initialization of K is the elbow method.

 

​​​​​​​Elbow Method:

The Elbow method is a method of knowing and visualizing how many clusters we needed for developing a good learning model. It helps to find the appropriate number of k in a dataset which otherwise much is very difficult for a person. This method calculates the mean square error and the percentage of variance on each iteration. First, we have to give a range where we think that our data set have the clusters in this range than on each iteration variance is calculated. The variance gradually comes to the lowest point and then become a straight line. The turning point is called the lowest variance and X-axis will tell us how many clusters to be made.

 

​​​​​​​Uses of K-Means Clustering:

  • Document Classification
  • Links Classification
  • Images Classification
  • Call Record Data Analysis

 

Conclusion:

With the help of K means clustering, we can cluster the data that has no label. This cannot be done with a supervised model. K means widely used in today’s era. Google news is using K mean clustering algorithm to cluster all the similar news whenever the user searches a keyword. This algorithm is useful when we have no prior knowledge of data except having data.





Login/Sign Up

Comments




Related Posts



© Copyright 2019, All Rights Reserved. paayi.com

This site uses cookies. By continuing to use this site or clicking "I Agree", you agree to the use of cookies. Read our cookies policy and privacy statement for more information.