Kmeans_python package

Submodules

Kmeans_python.cluster_summary module

Kmeans_python.cluster_summary.cluster_summary(X, centroids, cluster_assignments)

Provides summary of groups created from Kmeans clustering, including centroid coordinates, number of data points in training data assigned to each cluster, and within-cluster distance metrics.

Parameters:
  • X (array-like, shape=(n_samples, n_features)) – data on which Kmeans was fit
  • centroids (numpy.ndarray) – N-dimensional array containing cluster center locations
  • cluster_assignments (array-like) – clusters assigned to each data point in training set
Returns:

data frame displaying, for each cluster: centroid coordinates, number of data points in training data assigned to each cluster, within-cluster distance metrics

Return type:

pandas.DataFrame

Examples

>>> from Kmeans_python.fit import fit
>>> from Kmeans_python.cluster_summary import cluster_summary
>>> import numpy as np
>>> import pandas as pd
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> centers, cluster_ass = fit(X, 2)
>>> cluster_summary(centers, cluster_ass)

Kmeans_python.elbow module

Kmeans_python.elbow.elbow(X, centers_list)

Creates a plot of inertia vs number of cluster centers as per the elbow method. Calculates and returns the inertia values for all cluster centers. Useful for identifying the optimal number of clusters while using k-means clustering algorithm.

Parameters:
  • X (array-like, shape=(n_samples, n_features)) – Input data that is to be clustered.
  • centers_list (list or 1-d array-like) – A list of all possible numbers of cluster centers
Returns:

A tuple of an altair plot object containing a line plot of k (number of cluster centers) vs inertia and inertia for all k.

Return type:

tuple

Examples

>>> from Kmeans_python.elbow import elbow
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> centers = [2, 3, 4, 5]
>>> elbow(X, centers)
>>> (alt.Chart(...),
    [2.8284271247461903, 2.8284271247461903, 1.4142135623730951, 0.0])

Kmeans_python.fit module

Kmeans_python.fit.compute_distance(samples, centers)

This computes the euclidean distance of each sample from all the cluster centers

Parameters:
  • samples (numpy.ndarray) –
    • all the data points in the sample
  • centers (numpy.ndarray) –
    • the centroids of the clusters already selected
Returns:

an array with all the samples and their distances from each of the cluster centers

Return type:

numpy.ndarray

Kmeans_python.fit.fit(X_train, k, n_init=10, max_iter=200)

This function classifies the non-labeled data into a given number of clusters k using simple KMeans algorithm. It returns labels for each data point according to the cluster it belongs and also cluster centers. This is a type of unsupervised learning method to classify data.

Parameters:
  • X_train (numpy.ndarray or a pandas.DataFrame,) –
  • n_features) (shape=(n_samples,) –
    • Input data that is to be clustered with features in the columns and samples in rows
  • k (an integer(int)) –
    • The number of clusters we need.
Returns:

  • list – A list of the centers of each cluster.
  • list – A list of labels for cluster assignment for all samples in the data

Examples

>>> from Kmeans_python.fit import fit
>>> import numpy as np
>>> import pandas as pd
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> centers, labels = fit(X, 2)

Kmeans_python.predict module

Kmeans_python.predict.predict(X_new, centroids)

Assigns new data points to clusters based on closest centroid.

Parameters:
  • X_new (array-like, shape=(n_samples, n_features)) – New data to assign to clusters
  • centroids (numpy.ndarray) – array containing cluster center locations
Returns:

assigned clusters for each point in X_new

Return type:

numpy.array, shape=(n_samples, )

Examples

>>> from Kmeans_python.fit import fit
>>> from Kmeans_python.predict import predict
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> centers, cluster_ass = fit(X, 2)
>>> X_test = np.array([[1, 0], [2, 4], [8, 1],
...                    [9, 3], [8, 8], [0, 0]])
>>> predict(X_test, centers)

Kmeans_python.silhouette module

Kmeans_python.silhouette.sil_score(X, labels)

Returns the average silhouette score of each sample in a given 2-d array and clustering labels.

Parameters:
  • X (2-d array, shape=(n_samples, n_features)) –
    • The data to be clustered.
  • labels (array) –
    • An array of all the labels.
Returns:

  • The average silhouette score of all points

Return type:

float

Examples

>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> labels = np.array([0, 0, 0, 1, 1, 1])
>>> sil_score(X, labels)
Kmeans_python.silhouette.silhouette(X, k_array)

Plots a graph of silhouette scores for each k value in the given array using fit. Returns a list of each k value in k_array paired with its corresponding silhouette score.

Parameters:
  • X (2-d array, shape=(n_samples, n_features)) –
    • The data to be clustered.
  • k_array (array) –
    • An array of all contending k values.
Returns:

  • 1-d array

    • An array containing silhouette scores in the same order as k_array.
  • Altair chart object

    • An Altair chart displaying silhouette scores

    with their corresponding k values.

Examples

>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> k_array = [2, 3, 4, 5]
>>> silhouette(X, k_array)

Module contents