ComScan.clustering module¶
-
class
ComScan.clustering.KElbowVisualizer(estimator, ax=None, k=10, metric='distortion', timings=True, locate_elbow=True, n_jobs=None, verbose=0, pre_dispatch='2*n_jobs', **kwargs)[source]¶ Bases:
yellowbrick.cluster.base.ClusteringScoreVisualizerThe K-Elbow Visualizer implements the “elbow” method of selecting the optimal number of clusters for K-means clustering. K-means is a simple unsupervised machine learning algorithm that groups data into a specified number (k) of clusters. Because the user must specify in advance what k to choose, the algorithm is somewhat naive – it assigns all members to k clusters even if that is not the right k for the dataset.
The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then for each value of k computes an average score for all clusters. By default, the
distortionscore is computed, the sum of square distances from each point to its assigned center. Other metrics can also be used such as thesilhouettescore, the mean silhouette coefficient for all samples or thecalinski_harabaszscore, which computes the ratio of dispersion between and within clusters.When these overall metrics for each model are plotted, it is possible to visually determine the best value for k. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is the best value of k. The “arm” can be either up or down, but if there is a strong inflection point, it is a good indication that the underlying model fits best at that point.
Note
yellowbrick.cluster.elbowImplements the elbow method for determining the optimal number of clusters.Author: Benjamin BengfortCreated: Thu Mar 23 22:36:31 2017 -0400Copyright (C) 2016 The scikit-yb developersFor license information, see LICENSE.txtID: elbow.py [5a370c8] benjamin@bengfort.com- Parameters
estimator (a scikit-learn clusterer) – Should be an instance of an unfitted clusterer, specifically
KMeansorMiniBatchKMeans. If it is not a clusterer, an exception is raised.ax (matplotlib Axes, default: None) – The axes to plot the figure on. If None is passed in the current axes will be used (or generated if required).
k (integer, tuple, or iterable) – The k values to compute silhouette scores for. If a single integer is specified, then will compute the range (2,k). If a tuple of 2 integers is specified, then k will be in np.arange(k[0], k[1]). Otherwise, specify an iterable of integers to use as values for k.
metric (string, default:
"distortion") –Select the scoring metric to evaluate the clusters. The default is the mean distortion, defined by the sum of squared distances between each observation and its closest centroid. Other metrics include:
distortion: mean sum of squared distances to centers
silhouette: mean ratio of intra-cluster and nearest-cluster distance
calinski_harabasz: ratio of within to between cluster dispersion
timings (bool, default: True) – Display the fitting time per k to evaluate the amount of time required to train the clustering model.
locate_elbow (bool, default: True) – Automatically find the “elbow” or “knee” which likely corresponds to the optimal value of k using the “knee point detection algorithm”. The knee point detection algorithm finds the point of maximum curvature, which in a well-behaved clustering problem also represents the pivot of the elbow curve. The point is labeled with a dashed line and annotated with the score and k values.
n_jobs (int, default: None) – Number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the cross-validation splits.
Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.verbose (int, default: 0) – The verbosity level.
pre_dispatch (int or str, default: '2*n_jobs') – Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
kwargs (dict) – Keyword arguments that are passed to the base class and may influence the visualization as defined in other Visualizers.
-
k_scores_¶ The silhouette score corresponding to each k value.
- Type
array of shape (n,) where n is no. of k values
-
k_timers_¶ The time taken to fit n KMeans model corresponding to each k value.
- Type
array of shape (n,) where n is no. of k values
-
elbow_value_¶ The optimal value of k.
- Type
integer
-
elbow_score_¶ The silhouette score corresponding to the optimal value of k.
- Type
float
-
estimators_¶ a scikit-learn fitted estimator
- Type
BaseEstimator
Examples
>>> from yellowbrick.cluster import KElbowVisualizer >>> from sklearn.cluster import KMeans >>> model = KElbowVisualizer(KMeans(), k=10) >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> model.fit(X) >>> model.show()
Notes
Modification from yellowbrick consist of get the best_estimator based on the finded elbow_value
If you get a visualizer that doesn’t have an elbow or inflection point, then this method may not be working. The elbow method does not work well if the data is not very clustered; in this case, you might see a smooth curve and the value of k is unclear. Other scoring methods, such as BIC or SSE, also can be used to explore if clustering is a correct choice.
For a discussion on the Elbow method, read more at Robert Gove’s Block website. For more on the knee point detection algorithm see the paper “Finding a “kneedle” in a Haystack”.
See also
The scikit-learn documentation for the silhouette_score and calinski_harabasz_score. The default,
distortion_score, is implemented inyellowbrick.cluster.elbow.-
finalize()[source]¶ Prepare the figure for rendering by setting the title as well as the X and Y axis labels and adding the legend.
-
fit(X, y=None, **kwargs)[source]¶ Fits n KMeans models where n is the length of
self.k_values_, storing the silhouette scores in theself.k_scores_attribute. The “elbow” and silhouette score corresponding to it are stored inself.elbow_valueandself.elbow_scorerespectively. This method finishes up by calling draw to create the plot.
-
class
ComScan.clustering.KMeansConstrainedMissing(n_clusters=8, size_min=None, size_max=None, em_iter=10, n_init=10, max_iter=300, features_reduction=None, n_components=2, tol=0.0001, verbose=False, random_state=None, copy_x=True, n_jobs=1)[source]¶ Bases:
sklearn.base.TransformerMixin,sklearn.base.ClusterMixin,sklearn.base.BaseEstimatorK-Means clustering with minimum and maximum cluster size constraints with possible missing values
Note
inspired of https://stackoverflow.com/questions/35611465/python-scikit-learn-clustering-with-missing-data
- Parameters
n_clusters (int, optional, default: 8) – The number of clusters to form as well as the number of centroids to generate.
size_min (int, optional, default: None) – Constrain the label assignment so that each cluster has a minimum size of size_min. If None, no constrains will be applied
size_max (int, optional, default: None) – Constrain the label assignment so that each cluster has a maximum size of size_max. If None, no constrains will be applied
em_iter (int, default: 10) – expectation–maximization (EM) iteration for convergence of missing values. Use when no features reduction is applied and missing values.
n_init (int, default: 10) – Number of times the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
max_iter (int, default: 300) – Maximum number of iterations of the k-means algorithm for a single run.
features_reduction (str, default: None) – Method for reduction of the embedded space with n_components. Can be pca or umap.
n_components (int, default: 2) – Dimension of the embedded space for features reduction.
tol (float, default: 1e-4) – Relative tolerance with regards to inertia to declare convergence
verbose (int, default: 0) – Verbosity mode.
random_state (int, RandomState instance or None, optional, default: None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
copy_x (boolean, default: True) – When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True, then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean.
n_jobs (int, default: 1) –
The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.
-
cluster_centers_¶ Coordinates of cluster centers
- Type
array, [n_clusters, n_features]
-
labels_¶ Labels of each point
-
inertia_¶ Sum of squared distances of samples to their closest cluster center.
- Type
float
-
cls_¶ - Type
KMeansConstrained classifier object
-
cls_features_reduction_¶ - Type
PCA or UMAP reduction object
-
centroids_¶ Centroids found at the last iteration of k-means.
- Type
array
-
X_hat_¶ Copy of X with the missing values filled in.
- Type
array
-
mu_¶ - Type
Columns means
Examples
>>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> clf = KMeansConstrainedMissing( ... n_clusters=2, ... size_min=2, ... size_max=5, ... random_state=0 ... ) >>> clf.fit_predict(X) array([0, 0, 0, 1, 1, 1], dtype=int32) >>> clf.cluster_centers_ array([[ 1., 2.], [ 4., 2.]]) >>> clf.labels_ array([0, 0, 0, 1, 1, 1], dtype=int32)
Notes
K-means problem constrained with a minimum and/or maximum size for each cluster.
The constrained assignment is formulated as a Minimum Cost Flow (MCF) linear network optimisation problem. This is then solved using a cost-scaling push-relabel algorithm. The implementation used is Google’s Operations Research tools’s SimpleMinCostFlow.
- Ref:
1. Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. “Constrained k-means clustering.” Microsoft Research, Redmond (2000): 1-8. 2. Google’s SimpleMinCostFlow implementation: https://github.com/google/or-tools/blob/master/ortools/graph/min_cost_flow.h
-
fit(X, y=None)[source]¶ Compute k-means clustering with given constants.
- Parameters
X (array-like, shape=(n_samples, n_features)) – Training instances to cluster.
y (Ignored) –
-
fit_predict(X, y=None)[source]¶ Compute cluster centers and predict cluster index for each sample.
Equivalent to calling fit(X) followed by predict(X) but also more efficient.
- Parameters
X ({array-like, sparse matrix}, shape = [n_samples, n_features]) – New data to transform.
- Returns
labels – Index of the cluster each sample belongs to.
- Return type
array, shape [n_samples,]
-
predict(X, size_min='init', size_max='init')[source]¶ Predict the closest cluster each sample in X belongs to given the provided constraints. The constraints can be temporally overridden when determining which cluster each datapoint is assigned to.
Only computes the assignment step. It does not re-fit the cluster positions.
- Parameters
X (array-like, shape = [n_samples, n_features]) – New data to predict.
size_min (int, optional, default: size_min provided with initialisation) – Constrain the label assignment so that each cluster has a minimum size of size_min. If None, no constrains will be applied. If ‘init’ the value provided during initialisation of the class will be used.
size_max (int, optional, default: size_max provided with initialisation) – Constrain the label assignment so that each cluster has a maximum size of size_max. If None, no constrains will be applied. If ‘init’ the value provided during initialisation of the class will be used.
- Returns
labels – Index of the cluster each sample belongs to.
- Return type
array, shape [n_samples,]
-
ComScan.clustering.optimal_clustering(X, size_min=10, metric='distortion', features_reduction=None, n_components=2, n_jobs=1, random_state=None, visualize=False)[source]¶ Function to find the optimal clustering using a constrained k means. Two method are available to find the optimal number of cluster
silhouetteorelbow.- Parameters
X (
Union[DataFrame,ndarray]) – array-like or DataFrame of floats, shape (n_samples, n_features) The observations to cluster.size_min (
int) – Constrain the label assignment so that each cluster has a minimum size of size_min. If None, no constrains will be applied. default: Nonemetric (
str) – Select the scoring metric to evaluate the clusters. The default is the mean distortion, defined by the sum of squared distances between each observation and its closest centroid. Other metrics include: - distortion: mean sum of squared distances to centers - silhouette: mean ratio of intra-cluster and nearest-cluster distance - calinski_harabasz: ratio of within to between cluster dispersionfeatures_reduction (
Optional[str]) – Method for reduction of the embedded space with n_components. Can be pca or umap. Default: Nonen_components (
int) – Dimension of the embedded space for features reduction. Default 2.n_jobs (
int) – int The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.random_state (
Optional[int]) – int, RandomState instance or None, optional, default: None If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.visualize (
bool) – bool, default: False If True, callsshow()
- Return type
Tuple[KMeansConstrained,Union[UMAP,PCA],int,ndarray,int,Sequence[float],float,ndarray,ndarray,ndarray]- Returns
cls: KMeansConstrained classifier object
cls_features_reduction: PCA or UMAP reduction object
cluster_nb: optimal number of cluster
labels: label[i] is the code or index of the centroid the i’th observation is closest to.
ref_label: cluster label with the minimal within-cluster sum-of-squares.
wicss_clusters: within-cluster sum-of-squares for each cluster
best_wicss_cluster: minimal wicss.
centroid: Centroids found at the last iteration of k-means.
X_hat: Copy of X with the missing values filled in.