# Fine grained analysis of K- mean clustering and where we are using it

K-means is a centroid based algorithm that means points are grouped in a cluster according to the distance(mostly Euclidean) from centroid.

### Centroid-based Clustering

Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to hierarchical clustering defined below. k-means is the most widely-used centroid-based clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers. This course focuses on k-means because it is an efficient, effective, and simple clustering algorithm.

# Clustering Workflow

1. Prepare data.
2. Create similarity metric.
3. Run clustering algorithm.

This page briefly introduces the steps. We’ll go into depth in subsequent sections.

• in case of clustering Yours rows should be your observations and columns should be your Variables.
• You can’t afford missing values in your dataset if you have too many missing value in any observation remove it like your bad habits and if its few replace with estimated values like you estimate your future salary.
• You must scale your data for easy comparable observation with zero mean and one standard variation.

let’s show you a snippet in R how it is done (remember i always choose simple examples that is not practical)

I will choose USArrests dataset from R to show you this

summary(df)
Murder Assault UrbanPop Rape
Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
Median : 7.250 Median :159.0 Median :66.00 Median :20.10
Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00

summary of the data

summary(df)
Murder Assault UrbanPop Rape
Min. :-1.6044 Min. :-1.5090 Min. :-2.31714 Min. :-1.4874
1st Qu.:-0.8525 1st Qu.:-0.7411 1st Qu.:-0.76271 1st Qu.:-0.6574
Median :-0.1235 Median :-0.1411 Median : 0.03178 Median :-0.1209
Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.7949 3rd Qu.: 0.9388 3rd Qu.: 0.84354 3rd Qu.: 0.5277
Max. : 2.2069 Max. : 1.9948 Max. : 1.75892 Max. : 2.6444

preprocessed data

In addition to this you have to make sure that your data can easily let you calculate you the similarity

This preprocessing consists of

Donation

# Supervised Similarity Measure

Instead of comparing manually-combined feature data, you can reduce the feature data to representations calledembeddings, and then compare the embeddings. Embeddings are generated by training a supervised deep neural network (DNN) on the feature data itself. The embeddings map the feature data to a vector in an embedding space. Typically, the embedding space has fewer dimensions than the feature data in a way that captures some latent structure of the feature data set. The embedding vectors for similar examples, such as YouTube videos watched by the same users, end up close together in the embedding space. We will see how the similarity measure uses this “closeness” to quantify the similarity for pairs of examples.

Remember, we’re discussing supervised learning only to create our similarity measure. The similarity measure, whether manual or supervised, is then used by an algorithm to perform unsupervised clustering.

## Comparison of Manual and Supervised Measures

This table describes when to use a manual or supervised similarity measure depending on your requirements.

RequirementManualSupervised
Eliminate redundant information in correlated features.No, you need to separately investigate correlations between features.Yes, DNN eliminates redundant information.
Provide insight into calculated similarities.YesNo, embeddings cannot be deciphered.
Suitable for small datasets with few features.Yes, designing a manual measure with a few features is easy.No, small datasets do not provide enough training data for a DNN.
Suitable for large datasets with many features.No, manually eliminating redundant information from multiple features and then combining them is very difficult.Yes, the DNN automatically eliminates redundant information and combines features.

## Process for Supervised Similarity Measure

The following figure shows how to create a supervised similarity measure:

# Measuring Similarity from Embeddings

You now have embeddings for any pair of examples. A similarity measure takes these embeddings and returns a number measuring their similarity. Remember that embeddings are simply vectors of numbers. To find the similarity between two vectors A=[a1,a2,…,an] and B=[b1,b2,…,bn], you have three similarity measures to choose from, as listed in the table below.

How to work in real life dataset:

## Methods for measuring distances

The choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters.

The classical methods for distance measures are Euclidean and Manhattan distances, which are defined as follow:

1. Euclidean distance:
1. Manhattan distance:

Where, x and y are two vectors of length n.

Other dissimilarity measures exist such as correlation-based distances, which is widely used for gene expression data analyses. Correlation-based distance is defined by subtracting the correlation coefficient from 1. Different types of correlation methods can be used such as:

1. Pearson correlation distance:

Pearson correlation measures the degree of a linear relationship between two profiles.

1. Eisen cosine correlation distance (Eisen et al., 1998):
1. Spearman correlation distance:

The spearman correlation method computes the correlation between the rank of x and the rank of y variables

Where x′i=rank(xi)xi′=rank(xi) and y′i=rank(y)yi′=rank(y).

1. Kendall correlation distance:

Kendall correlation method measures the correspondence between the ranking of x and y variables. The total number of possible pairings of x with y observations is n(n−1)/2n(n−1)/2, where n is the size of x and y. Begin by ordering the pairs by the x values. If x and y are correlated, then they would have the same relative rank orders. Now, for each yiyi, count the number of yj>yiyj>yi (concordant pairs (c)) and the number of yj<yiyj<yi (discordant pairs (d)).

Kendall correlation distance is defined as follow:

Where,

• ncnc: total number of concordant pairs
• ndnd: total number of discordant pairs
• nn: size of x and y

Note that,

• Pearson correlation analysis is the most commonly used method. It is also known as a parametric correlation which depends on the distribution of the data.
• Kendall and Spearman correlations are non-parametric and they are used to perform rank-based correlation analysis.

# silhouette analysis:

Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1].

Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.

with the example given in github you will get a result like this

## elbow method

We will also understand how to use the elbow method as a way to estimate the value k

The dataset
The dataset we will study refers to clients of a wholesale distributor. It contains information such as clients annual spend on fresh product, milk products, grocery products etc. Below is some more information an each feature:

1. FRESH: annual spending (m.u.) on fresh products (Continuous)
2. MILK: annual spending (m.u.) on milk products (Continuous)
3. GROCERY: annual spending (m.u.) on grocery products (Continuous)
4. FROZEN: annual spending (m.u.) on frozen products (Continuous)
5. DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
6. DELICATESSEN: annual spending (m.u.) on delicatessen products (Continuous)
7. CHANNEL: customer channels – Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
8. REGION: customer regions – Lisnon, Oporto or Other (Nominal)

The end result of the elbow method on the dataset is this:

final treat for you

• full code on a real world clustering implementation on python
• full code on a real world clustering implementation on R
• full code on a real world clustering implementation on tensorflow framework

github code: