k-mean clustering and its real use-case in the security domain

Mayank Sharma 191_27
5 min readJul 19, 2021

The Activities of Internet users are increasing from year to year and have had an impact on the behavior of the users themselves. Assessment of user behavior is often only based on interaction across the Internet without knowing any others activities. The log activity can be used as another way to study the behavior of the user. The Log Internet activity is one of the types of big data so that the use of data mining with the K-Means technique can be used as a solution for the analysis of user behavior. This study has been carried out the process of clustering using the K-Means algorithm is divided into three clusters, namely high, medium, and low. The results of the higher education institution show that each of these clusters produces websites that are frequented by the sequence: website search engine, social media, news, and information. This study also showed that cyber profiling had been done strongly influenced by environmental factors and daily activities.

K-means Clustering ..

is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

How to choose the value of “K number of clusters” in K-means Clustering?

The performance of the K-means clustering algorithm depends upon the highly efficient clusters that it forms. But choosing the optimal number of clusters is a big task. There are some different ways to find the optimal number of clusters, but here we are discussing the most appropriate method to find the number of clusters or value of K. The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

  • It executes the K-means clustering on a given dataset for different K values (ranges from 1–10).
  • For each value of K, calculates the WCSS value.
  • Plots a curve between calculated WCSS values and the number of clusters K.
  • The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The graph for the elbow method looks like the below image:

Use the following steps for cluster analysis:

  • Sorting of the records — the first sorting will be done on the most important characteristics based on the detective’s experience.
  • Data mining is then used to detect more complex patterns as in real life there are many attributes associated with the crime and we often have partial information available.
  • Identification of significant attributes for clustering.
  • Placing different weights on different attributes dynamically based on the crime types being clustered.
  • Cluster the dataset for crime patterns and present the results to the detective or the domain expert along with the statistics of the important attributes.
  • The detective looks at the clusters and gives recommendations.
  • Unsolved crimes are clustered based on significant attributes and the result is given to detectives for inspection.
  • In this article, we will use the K-means approach for generating clusters. The K-means algorithm consists of the following steps:
  • Decide the number of clusters, K. The K-means cluster analysis requires you to know how many clusters to generate before the start of the algorithm.
  • Initialize the K clusters or generate them randomly. Different starting points for the clusters may yield different results.
  • Assign each observation to the nearest cluster center. This is an iterative technique that builds the clusters as we progress.
  • Re-compute the new cluster centers. Note that you need to specify the algorithms for determining the distance between clusters.
  • Repeat the process until none of the observations changed their membership in the last iteration.
  • An example of the K-means cluster analysis is shown in the figure below. In this example, we show the creation of 3 clusters (each in a different color).
  • Analyzing patterns and drawing conclusions This involves the analysis of each cluster formed. The computer is unable to understand what is unique about each cluster. This is where human expertise comes into play. For example, all the crimes committed in red may have been committed using a similar gun, or that all the crimes shown in blue may be due to theft of jewelry where people were walking on the road and the assailants were traveling on a motorbike, etc. This helps to find crime patterns and trend correlations. Once a specific pattern is detected, the law enforcement officers can deploy additional and suitable resources for the detection and suppression of criminal activities.
  • END
    Thanks For Reading :)

--

--