Kmeans_python package¶
Submodules¶
Kmeans_python.cluster_summary module¶
-
Kmeans_python.cluster_summary.cluster_summary(X, centroids, cluster_assignments)¶ Provides summary of groups created from Kmeans clustering, including centroid coordinates, number of data points in training data assigned to each cluster, and within-cluster distance metrics.
Parameters: - X (array-like, shape=(n_samples, n_features)) – data on which Kmeans was fit
- centroids (numpy.ndarray) – N-dimensional array containing cluster center locations
- cluster_assignments (array-like) – clusters assigned to each data point in training set
Returns: data frame displaying, for each cluster: centroid coordinates, number of data points in training data assigned to each cluster, within-cluster distance metrics
Return type: pandas.DataFrame
Examples
>>> from Kmeans_python.fit import fit >>> from Kmeans_python.cluster_summary import cluster_summary >>> import numpy as np >>> import pandas as pd >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> centers, cluster_ass = fit(X, 2) >>> cluster_summary(centers, cluster_ass)
Kmeans_python.elbow module¶
-
Kmeans_python.elbow.elbow(X, centers_list)¶ Creates a plot of inertia vs number of cluster centers as per the elbow method. Calculates and returns the inertia values for all cluster centers. Useful for identifying the optimal number of clusters while using k-means clustering algorithm.
Parameters: - X (array-like, shape=(n_samples, n_features)) – Input data that is to be clustered.
- centers_list (list or 1-d array-like) – A list of all possible numbers of cluster centers
Returns: A tuple of an altair plot object containing a line plot of k (number of cluster centers) vs inertia and inertia for all k.
Return type: tuple
Examples
>>> from Kmeans_python.elbow import elbow >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> centers = [2, 3, 4, 5] >>> elbow(X, centers) >>> (alt.Chart(...), [2.8284271247461903, 2.8284271247461903, 1.4142135623730951, 0.0])
Kmeans_python.fit module¶
-
Kmeans_python.fit.compute_distance(samples, centers)¶ This computes the euclidean distance of each sample from all the cluster centers
Parameters: - samples (numpy.ndarray) –
- all the data points in the sample
- centers (numpy.ndarray) –
- the centroids of the clusters already selected
Returns: an array with all the samples and their distances from each of the cluster centers
Return type: numpy.ndarray
- samples (numpy.ndarray) –
-
Kmeans_python.fit.fit(X_train, k, n_init=10, max_iter=200)¶ This function classifies the non-labeled data into a given number of clusters k using simple KMeans algorithm. It returns labels for each data point according to the cluster it belongs and also cluster centers. This is a type of unsupervised learning method to classify data.
Parameters: - X_train (numpy.ndarray or a pandas.DataFrame,) –
- n_features) (shape=(n_samples,) –
- Input data that is to be clustered with features in the columns and samples in rows
- k (an integer(int)) –
- The number of clusters we need.
Returns: - list – A list of the centers of each cluster.
- list – A list of labels for cluster assignment for all samples in the data
Examples
>>> from Kmeans_python.fit import fit >>> import numpy as np >>> import pandas as pd >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> centers, labels = fit(X, 2)
Kmeans_python.predict module¶
-
Kmeans_python.predict.predict(X_new, centroids)¶ Assigns new data points to clusters based on closest centroid.
Parameters: - X_new (array-like, shape=(n_samples, n_features)) – New data to assign to clusters
- centroids (numpy.ndarray) – array containing cluster center locations
Returns: assigned clusters for each point in X_new
Return type: numpy.array, shape=(n_samples, )
Examples
>>> from Kmeans_python.fit import fit >>> from Kmeans_python.predict import predict >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> centers, cluster_ass = fit(X, 2) >>> X_test = np.array([[1, 0], [2, 4], [8, 1], ... [9, 3], [8, 8], [0, 0]]) >>> predict(X_test, centers)
Kmeans_python.silhouette module¶
-
Kmeans_python.silhouette.sil_score(X, labels)¶ Returns the average silhouette score of each sample in a given 2-d array and clustering labels.
Parameters: - X (2-d array, shape=(n_samples, n_features)) –
- The data to be clustered.
- labels (array) –
- An array of all the labels.
Returns: - The average silhouette score of all points
Return type: float
Examples
>>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> labels = np.array([0, 0, 0, 1, 1, 1]) >>> sil_score(X, labels)
- X (2-d array, shape=(n_samples, n_features)) –
-
Kmeans_python.silhouette.silhouette(X, k_array)¶ Plots a graph of silhouette scores for each k value in the given array using fit. Returns a list of each k value in k_array paired with its corresponding silhouette score.
Parameters: - X (2-d array, shape=(n_samples, n_features)) –
- The data to be clustered.
- k_array (array) –
- An array of all contending k values.
Returns: 1-d array –
- An array containing silhouette scores in the same order as k_array.
Altair chart object –
- An Altair chart displaying silhouette scores
with their corresponding k values.
Examples
>>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> k_array = [2, 3, 4, 5] >>> silhouette(X, k_array)
- X (2-d array, shape=(n_samples, n_features)) –