clustpy.partition package

Submodules

clustpy.partition.dipext module

@authors: Benjamin Schelling and Sam Maurus (original R implementation), Collin Leiber

class clustpy.partition.dipext.DipExt(n_components: int | None = None, do_dip_scaling: bool = True, step_size: float = 0.1, momentum: float = 0.95, dip_threshold: float = 0.5, n_starting_vectors: int | None = None, ambiguous_triangle_strategy: str = 'ignore', random_state: RandomState | None = None)[source]

Bases: TransformerMixin, BaseEstimator

Execute the DipExt algorithm to reduce the number of features. Therefore, it utilizes the gradient of the Dip-test of unimodality to perform gradient descent. The output features should show a high degree of modality.

Parameters:
  • n_components (int) – The number of components to extract. Can be None, in that case dip_threshold wil be used to define the number of components (default: None)

  • do_dip_scaling (bool) – If true, the resulting features space will be scaled by performing a min-max normalization for each feature and multiplying this feautre by its dip-value (default: True)

  • step_size (float) – Step size used for gradient descent (default: 0.1)

  • momentum (float) – Momentum used for gradient descent (default: 0.95)

  • dip_threshold (float) – Defines the number of components if n_components is None. If an identified feature has a dip-value below the maximum dip-value times dip_threshold, DipExt will terminate (default: 0.5)

  • n_starting_vectors (int) – The number of starting vectors for gradient descent. Can be None, in that case it will be equal to log(data dimensionality) + 1 (default: None)

  • ambiguous_triangle_strategy (str) – The strategy with which to handle an ambiguous modal triangle. Can be ‘ignore’, ‘random’ or ‘all’. In the case of ‘random’, a valid triangle is created at random. In the case of ‘all’, for each possible triangle the gradient is calculated and it is checked for which gradient the following result looks most promising - this strategy can increase the runtime noticeably (default: ‘ignore’)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only used if ambiguous_triangle_strategy is ‘random’ (default: None)

dip_values_

The dip-value of each resulting feature

Type:

np.ndarray

projection_axes_

List containing the axes used for the transformation

Type:

list

argsorted_dips_

The indices of the sorted original dip-values (decreasing), needed to adjust order of the features when using transform()

Type:

np.ndarray

References

Schelling, Benjamin, et al. “Utilizing Structure-rich Features to improve Clustering.” (2020). The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2020

fit(X: ndarray, y: ndarray | None = None) DipExt[source]

Retrieve the necessary projection axes to apply DipExt to any given data set.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – This instance of the DipExt algorithm

Return type:

DipExt

fit_transform(X: ndarray, y: ndarray = None) ndarray[source]

Initiate the actual dimensionality-reduction process on the input data set.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

subspace – The transformed feature space (Number of samples x number of components)

Return type:

np.ndarray

transform(X: ndarray) ndarray[source]

Apply the transformation to the given data set using the projection axes found by DipExt.

Parameters:

X (np.ndarray) – the given data set

Returns:

subspace – The transformed feature space (Number of samples x number of components)

Return type:

np.ndarray

class clustpy.partition.dipext.DipInit(n_clusters: int, n_components: int | None = None, do_dip_scaling: bool = True, step_size: float = 0.1, momentum: float = 0.95, dip_threshold: float = 0.5, n_starting_vectors: int | None = None, ambiguous_triangle_strategy: str = 'ignore', random_state: RandomState | None = None)[source]

Bases: DipExt, BaseEstimator, ClusterMixin

Execute the DipInit clustering procedure. Initially, DipExt is executed to identify relevant features. Next, KMeans with initial cluster centers is used to identify high-quality cluster labels. To get the coordinates of the initial cluster centers DipInit uses the features identified by DipExt one after another. In the first iteration the centers will be equally distributed in a one-dimensional space by using the ids of the objects. Thereafter, the algorithm adds additional features und uses the current cluster labels to also add another coordinate to the centers.

Parameters:
  • n_clusters (int) – The number of clusters

  • n_components (int) – The number of components to extract. Can be None, in that case dip_threshold wil be used to define the number of components (default: None)

  • do_dip_scaling (bool) – If true, the resulting features space will be scaled by performing a min-max normalization for each feature and multiplying this feautre by its dip-value (default: True)

  • step_size (float) – Step size used for gradient descent (default: 0.1)

  • momentum (float) – Momentum used for gradient descent (default: 0.95)

  • dip_threshold (float) – Defines the number of components if n_components is None. If an identified feature has a dip-value below the maximum dip-value times dip_threshold, DipExt will terminate (default: 0.5)

  • n_starting_vectors (int) – The number of starting vectors for gradient descent. Can be None, in that case it will be equal to log(data dimensionality) + 1 (default: None)

  • ambiguous_triangle_strategy (str) – The strategy with which to handle an ambiguous modal triangle. Can be ‘ignore’, ‘random’ or ‘all’. In the case of ‘random’, a valid triangle is created at random. In the case of ‘all’, for each possible triangle the gradient is calculated and it is checked for which gradient the following result looks most promising - this strategy can increase the runtime noticeably (default: ‘ignore’)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only used if ambiguous_triangle_strategy is ‘random’ (default: None)

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

References

Schelling, Benjamin, et al. “Utilizing Structure-rich Features to improve Clustering.” (2020). The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2020

fit(X: ndarray, y: ndarray | None = None) DipInit[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DipInit algorithm

Return type:

DipInit

clustpy.partition.dipmeans module

@authors: Collin Leiber

class clustpy.partition.dipmeans.DipMeans(significance: float = 0.001, split_viewers_threshold: float = 0.01, pval_strategy: str = 'table', n_boots: int = 1000, n_split_trials: int = 10, n_clusters_init: int = 1, max_n_clusters: int = inf, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Execute the DipMeans clustering procedure. In contrast to other algorithms (e.g. KMeans) DipMeans is able to identify the number of clusters by itself. Therefore, it uses the dip-dist criterion. It calculates the dip-value of the distances of each point within a cluster to all other points in that cluster and checks how many points are assigned a dip-value below the threshold. If that amount of so called split viewers is above the split_viewers_threshold, the cluster will be split using 2-Means. The algorithm terminates if all clusters show a unimdoal behaviour.

Parameters:
  • significance (float) – Threshold to decide if the result of the dip-test is unimodal or multimodal (default: 0.001)

  • split_viewers_threshold (float) – Threshold to decide whether a cluster has a unimodal or multimodal structure. Must be within [0, 1] (default: 0.01)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • n_split_trials (int) – Number tries to split a cluster. For each try 2-KMeans is executed with different cluster centers (default: 10)

  • n_clusters_init (int) – The initial number of clusters. Can also by of type np.ndarray if initial cluster centers are specified (default: 1)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

References

Kalogeratos, Argyris, and Aristidis Likas. “Dip-means: an incremental clustering method for estimating the number of clusters.” Advances in neural information processing systems. 2012.

fit(X: ndarray, y: ndarray | None = None) DipMeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DipMeans algorithm

Return type:

DipMeans

clustpy.partition.dipnsub module

@authors: Collin Leiber

class clustpy.partition.dipnsub.DipNSub(significance: float = 0.01, threshold: float = 0.15, step_size: float = 0.1, momentum: float = 0.95, n_starting_vectors: int | None = None, add_tails=True, outliers=False, consider_duplicates: bool = False, random_state: RandomState | None = None, debug=False)[source]

Bases: object

Execute the Dip`n`Sub clustering procedure. It searches for projection axes in which as many samples as possible are part of multimodal clusters. Therefore, it uses the gradient of the p-value of the Dip-test to perform stochastic gradient descent. On the identified axes TailoredDip, an extension of UniDip, will be executed to assign cluster labels. Note, that clusters will always form a hypercube in the resulting subspace.

Parameters:
  • significance (float) – Threshold to decide if the result of the dip-test is unimodal or multimodal (default: 0.01)

  • threshold (float) – Defines the amount of objects that must be contained in multimodal clusters for the algorithm to continue (default: 0.15)

  • step_size (float) – Step size used for gradient descent (default: 0.1)

  • momentum (float) – Momentum used for gradient descent (default: 0.95)

  • n_starting_vectors (int) – The number of starting vectors for gradient descent. Can be None, in that case it will be equal to np.log(data dimensionality) + 1 (default: None)

  • add_tails (bool) – Defines if TailoredDip should try to add tails to the surrounding clusters (default: True)

  • outliers (bool) – Defines if outliers should be identified as described by UniDip (default: False)

  • consider_duplicates (bool) – If multiple instances on the projection axis share a value, the gradient is ambiguous. If those duplicate values should be considered a random instances will be choses for furhter calculations. Beware: The calculation will not be deterministic anymore (default: False)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

subspace_

The resulting subspace

Type:

np.ndarray

References

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

fit(X: ndarray, y: ndarray | None = None) DipNSub[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DipNSub algorithm

Return type:

DipNSub

clustpy.partition.gapstatistic module

@authors: Collin Leiber

class clustpy.partition.gapstatistic.GapStatistic(min_n_clusters: int = 1, max_n_clusters: int = 10, n_boots: int = 10, use_principal_components: bool = True, use_log: bool = True, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Estimate the number of cluster for KMeans using the Gar Statistic. Calculate the Gap Statistic by comparing within cluster dispersion of the input data set with that of ranomly sampled data. The Gap Statistic is evaluated for multiple numebers of clusters. First clustering result that fulfills the Gap condition ‘Gap(k) >= Gap(k+1)-s_{k+1}’ will be returned. Beware: Result can be None if no clustering result fulfills that condition!

Parameters:
  • min_n_clusters (int) – Minimum number of clusters. Must be smaller than max_n_clusters (default: 1)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than min_n_clusters (default: 10)

  • n_boots (int) – Number of random data sets that should be created to calculate Gap Statistic (default: 10)

  • use_principal_components (bool) – True, if the random data sets should be created using the feature-wise minimum and maximum value of the Principle Components. Else, the minimum and maximum value of the regular data set will be used (default: True)

  • use_log (bool) – True, if the logarithm of the within cluster dispersion should be used. For more information see Mohajer et al. (default: True)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The first number of clusters that fulfills the Gap condition (can be None)

Type:

int

labels_

The labels as identified by the Gap Statistic (can be None)

Type:

np.ndarray

cluster_centers_

The cluster centers as identified by the Gap Statistic (Can be None)

Type:

np.ndarray

gaps_

The Gap values,

Type:

np.ndarray

sks_

The sk values

Type:

np.ndarray

Examples

>>> from sklearn.datasets import make_blobs
>>> X, L = make_blobs(1000, 2, centers=7, cluster_std=0.7)
>>> gs = GapStatistic()
>>> gs.fit(X)
>>> print(gs.n_clusters_)
>>> gs.plot()

References

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. “Estimating the number of clusters in a data set via the gap statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63.2 (2001): 411-423.

and

Mohajer, Mojgan, Karl-Hans Englmeier, and Volker J. Schmid. “A comparison of Gap statistic definitions with and without logarithm function.” arXiv preprint arXiv:1103.4767 (2011).

fit(X: ndarray, y: ndarray | None = None) GapStatistic[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the GapStatistic algorithm

Return type:

GapStatistic

plot() None[source]

Plot the result of the Gap Statistic. Shows the number of the clusters on the x-axis and the Gap values on the y-axis.

clustpy.partition.gmeans module

@authors: Collin Leiber

class clustpy.partition.gmeans.GMeans(significance: float = 0.001, n_clusters_init: int = 1, max_n_clusters: int = inf, n_split_trials: int = 10, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Execute the GMeans clustering procedure. Determines the number of clusters by repeatedly trying to split a cluster into two clusters. Therefore, the data is projected onto the axis connecting the two resulting centers. If the Anderson Darling test does not assume a Gaussian distribution for this projection, the new clusters are retained. Otherwise, the cluster remains as it was originally. This is repeated until no cluster changes.

Parameters:
  • significance (float) – Threshold to decide if the result of the Anderson Darling Test indicates a Gaussian distribution (default: 0.001)

  • n_clusters_init (int) – The initial number of clusters. Can also by of type np.ndarray if initial cluster centers are specified (default: 1)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • n_split_trials (int) – Number tries to split a cluster. For each try 2-KMeans is executed with different cluster centers (default: 10)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

References

Hamerly, Greg, and Charles Elkan. “Learning the k in k-means.” Advances in neural information processing systems. 2004.

and

D’Agostino, Ralph B., and Michael A. Stephens. “Goodness-of-fit techniques.” Statistics: Textbooks and Monographs (1986).

fit(X: ndarray, y: ndarray | None = None) GMeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the GMeans algorithm

Return type:

GMeans

clustpy.partition.ldakmeans module

authors: Collin Leiber

class clustpy.partition.ldakmeans.LDAKmeans(n_clusters: int, n_dims: int | None = None, max_iter: int = 300, n_init: int = 1, kmeans_repetitions: int = 10, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Execute the LDA-Kmeans clustering procedure. The initial rotation are normally the (n_clusters-1) components of a PCA. Afterward, Kmeans and LDA are executed one after the other until the labels do not change anymore. Kmeans always takes place in the rotated subspace.

Parameters:
  • n_clusters (int) – the number of clusters

  • n_dims (int) – The number of features in the resulting subspace. If None this will be equal to n_clusters - 1 (default: None)

  • max_iter (int) – the maximum number of iterations (default: 300)

  • n_init (int) – number of times LDAKmeans is executed using different seeds. The final result will be the one with lowest costs (default: 1)

  • kmeans_repetitions (int) – Number of repetitions when executing KMeans. For more information see sklearn.cluster.KMeans (default: 10)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels

Type:

np.ndarray

rotation_

The final rotation matrix

Type:

np.ndarray

cluster_centers_ = np.ndarray

The cluster centers in the subspace

error_

The final error (KMeans error in the subspace)

Type:

float

References

Ding, Chris, and Tao Li. “Adaptive dimension reduction using discriminant analysis and k-means clustering.” Proceedings of the 24th international conference on Machine learning. 2007.

fit(X: ndarray, y: ndarray | None = None) LDAKmeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels are contained in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the LDAKmeans algorithm

Return type:

LDAKmeans

transform_subspace(X: ndarray) ndarray[source]

Transform the input dataset with the rotation matrix identified by the fit function.

Parameters:

X (np.ndarray) – the given data set

Returns:

rotated_data – The rotated data set

Return type:

np.ndarray

clustpy.partition.pgmeans module

@authors: Collin Leiber

class clustpy.partition.pgmeans.PGMeans(significance: float = 0.001, n_projections: int | None = None, n_samples: int | None = None, n_new_centers: int = 10, amount_random_centers: float = 0.5, n_clusters_init: int = 1, max_n_clusters: int = inf, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Execute the PGMeans clustering procedure. Determines the number of clusters by executing the EM algorithm multiple times to create different Gaussian Mixtures (GMMs). For each GMM it projects the GMM model onto random projection axes and uses the Kolmogorov Smirnov Test to decide whether the data matches the fitted model. If this is not the case, a new center will be added and the next GMM will be fitted. This is repeated until no cluster are added anymore.

Parameters:
  • significance (float) – Threshold to decide if the result of the Kolmogorov Smirnov Test indicates a Gaussian Mixture Model (default: 0.001)

  • n_projections (int) – Number of projection axes to test different projected GMMs on. Can be None, in that case it will be set to: -2.6198 * log(significance) (default: None)

  • n_samples (int) – Number of samples generated from the fitted GMM and used to execute the Kolmogorov Smirnov Test. If it is chosen larger than the number of data samples, it will be equal to this value. Can be None, in that case it will be set to: 3 / significance (default: None)

  • n_new_centers (int) – Nummber of centers to test when a new center should be added to the current GMM model. The additional center producing the best new GMM will be used for subsequent iterations (default: 10)

  • amount_random_centers (float) – Amount of random centers tested. Must be a value in the range [0, 1]. In total (n_new_centers * amount_random_centers) random centers will be tested. The other possible centers will be chosen based on the probability densities of the current GMM modal (default: 0.5)

  • n_clusters_init (int) – The initial number of clusters. Can also by of type np.ndarray if initial cluster centers are specified (default: 1)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

References

Feng, Yu, and Greg Hamerly. “PG-means: learning the number of clusters in data.” Advances in neural information processing systems. 2007.

fit(X: ndarray, y: ndarray | None = None) PGMeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the PGMeans algorithm

Return type:

PGMeans

clustpy.partition.projected_dipmeans module

@authors: Collin Leiber

class clustpy.partition.projected_dipmeans.ProjectedDipMeans(significance: float = 0.001, n_random_projections: int = 0, pval_strategy: str = 'table', n_boots: int = 1000, n_split_trials: int = 10, n_clusters_init: int = 1, max_n_clusters: int = inf, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Execute the Projected DipMeans clustering procedure. It repeatedly creates random projection axes for each cluster and tests whether the data projected onto that projection axis is unimodal. If the probability of unimodality is below significance another the cluster will be split. This is done by using 2-Means. The algorithm terminates if all clusters show a unimdoal behaviour on all projection axes.

Parameters:
  • significance (float) – Threshold to decide if the result of the dip-test is unimodal or multimodal (default: 0.001)

  • n_random_projections (int) – Number of random projections that should be applied in addition to the original features and the components from a PCA (default: 0)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • n_split_trials (int) – Number tries to split a cluster. For each try 2-KMeans is executed with different cluster centers (default: 10)

  • n_clusters_init (int) – The initial number of clusters. Can also by of type np.ndarray if initial cluster centers are specified (default: 1)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

References

Chamalis, Theofilos, and Aristidis Likas. “The projected dip-means clustering algorithm.” Proceedings of the 10th Hellenic Conference on Artificial Intelligence. 2018.

fit(X: ndarray, y: ndarray | None = None) ProjectedDipMeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the ProjectedDipMeans algorithm

Return type:

ProjectedDipMeans

clustpy.partition.skinnydip module

@authors: Collin Leiber

class clustpy.partition.skinnydip.SkinnyDip(significance: float = 0.05, pval_strategy: str = 'table', n_boots: int = 1000, add_tails: bool = False, outliers: bool = True, max_cluster_size_diff_factor: float = 2, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

Execute the SkinnyDip clustering procedure. This approach iteratively executes the univariate clustering algorithm UniDip on each feature. The result are clusters formed as hypercubes, which have a much higher density than surrounding noise. See UniDip for more information.

Parameters:
  • significance (float) – Threshold to decide if the result of the dip-test is unimodal or multimodal (default: 0.05)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • add_tails (bool) – Defines if TailoredDip should try to add tails to the surrounding clusters (default: False)

  • outliers (bool) – Defines if outliers should be identified as described by UniDip (default: True)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for merging and assigning tails of distributions if ‘add_tails’ is True (default: 2)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only relevant if pval_strategy is ‘bootstrap’ (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

Examples

>>> from sklearn.datasets import make_blobs
>>> X, L = make_blobs(1000, 3, centers=5)
>>> sk = SkinnyDip()
>>> sk.fit(X)

References

Maurus, Samuel, and Claudia Plant. “Skinny-dip: clustering in a sea of noise.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.

and

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

fit(X: ndarray, y: ndarray | None = None) SkinnyDip[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the SkinnyDip algorithm

Return type:

SkinnyDip

class clustpy.partition.skinnydip.UniDip(significance: float = 0.05, pval_strategy: str = 'table', n_boots: int = 1000, add_tails: bool = False, outliers: bool = True, max_cluster_size_diff_factor: float = 2, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

Execute the UniDip clustering procedure. This univariate clustering algorithm recursively uses the Dip-test of unimodality to check if a set of samples is distributed uni- or multimodally. If the test indicates multimodality, the data will be split into three sets: left of the main mode, the main mode itself and right of the main mode. Each set will itself be tested by the Dip-test to see if it has to be further subdivided.

This implementation includes the extensions made by TailoredDip (see Bauer et al.).

Parameters:
  • significance (float) – Threshold to decide if the result of the dip-test is unimodal or multimodal (default: 0.05)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • add_tails (bool) – Defines if TailoredDip should try to add tails to the surrounding clusters (default: False)

  • outliers (bool) – Defines if outliers should be identified as described by UniDip (default: True)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for merging and assigning tails of distributions if ‘add_tails’ is True (default: 2)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only relevant if pval_strategy is ‘bootstrap’ (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_boundaries_

List of tuples containing the id of the first sample in a cluster and the first sample that is not part of the cluster anymore

Type:

list

References

Maurus, Samuel, and Claudia Plant. “Skinny-dip: clustering in a sea of noise.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.

and

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

fit(X: ndarray, y: ndarray | None = None) UniDip[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the UniDip algorithm

Return type:

UniDip

clustpy.partition.specialk module

@authors: Collin Leiber

Based on the original implementation, available at: https://github.com/Sibylse/SpecialK/blob/master/SpecialK.ipynb

class clustpy.partition.specialk.SpecialK(significance: float = 0.01, n_dimensions: int = 200, similarity_matrix: str = 'NAM', n_neighbors: int = 10, percentage: float = 0.99, n_cluster_pairs_to_consider: int = 10, max_n_clusters: int = inf, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

Execute the SpecialK clustering procedure. SpecialK is able to autonomously identify a suitable number of clusters for spectral clustering procedures. Therefore, it uses probability bounds on the operator norm of centered, symmetric decompositions based on the matrix Bernstein concentration inequality. It iteratively increases the number of clusters until the probability of objects originating from two instead of one distribution does not increase further. Can be based on an epsilon-neighborhood adjacency matrix or the symmetrically normalized adjacency matrix of the kNN graph.

Parameters:
  • significance (float) – Threshold to decide if the samples originate from a single distribution or two distributions (default: 0.01)

  • n_dimensions (int) – Dimensionality of the embedding (default: 200)

  • similarity_matrix (str) – Defines the similarity matrix to use. Can be one of the following strings or a numpy array / scipy sparse csr matrix. If ‘NAM’, a neighborhood adjacency matrix is used. If ‘SAM’ a symmetrically normalized adjacency matrix is used (default: ‘NAM’)

  • n_neighbors (int) – Number of neighbors for the construction of the similarity matrix. Does not include the object itself (default: 10)

  • percentage (float) – The amount of data points that should have at least n_neighbors neighbors. Only relevant if use_neighborhood_adj_mat is set to True (default: 0.99)

  • n_cluster_pairs_to_consider (int) – The number of cluster pairs to consider when calculating the probability boundary. The cluster pairs responsible for the highest cut. If None, all pairs will be considered. Smaller values for n_cluster_pairs_to_consider will decrease the computing time (default: 10)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

References

Hess, Sibylle, and Wouter Duivesteijn. “k is the magic number—inferring the number of clusters through nonparametric concentration inequalities.” Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I. Springer International Publishing, 2020.

fit(X: ndarray, y: ndarray | None = None) SpecialK[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the SpecialK algorithm

Return type:

SpecialK

clustpy.partition.subkmeans module

@authors: Collin Leiber

class clustpy.partition.subkmeans.SubKmeans(n_clusters: int, V: ndarray | None = None, m: int | None = None, cluster_centers: ndarray | None = None, mdl_for_noisespace: bool = False, outliers: bool = False, max_iter: int = 300, n_init: int = 1, cost_type: str = 'default', threshold_negative_eigenvalue: float = -1e-07, max_distance: float | None = None, precision: float | None = None, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

The Subspace Kmeans (SubKmeans) algorithm. The algorithm will simultaneously search for cluster assignments and an optimal subspace regarding those assignments. Here, the feature-space will be divided into an clustered space containing the actual clustering result and the noise space which is a special subspace containing a single cluster. The rotation and dimensions of the clustered space are defined through an eigenvalue decomposition.

This implementation includes some extensions from ‘Automatic Parameter Selection for Non-Redundant Clustering’.

Parameters:
  • n_clusters (int) – the number of clusters

  • V (np.ndarray) – the orthonormal rotation matrix (default: None)

  • m (int) – the initial dimensionality of the clustered space (default: None)

  • cluster_centers (np.ndarray) – list containing the initial cluster centers (default: None)

  • mdl_for_noisespace (bool) – defines if MDL should be used to identify noise space dimensions instead of only considering negative eigenvalues (default: False)

  • outliers (bool) – defines if outliers should be identified through MDL (default: False)

  • max_iter (int) – maximum number of iterations for the algorithm (default: 300)

  • n_init (int) – number of times SubKmeans is executed using different seeds. The final result will be the one with lowest costs. Costs can be the standard SubKmeans costs or MDL costs (defined by the cost_type parameter) (default: 1)

  • cost_type (str) – Can be “default” or “mdl” and defines whether the the standard SubKmeans cost function or MDL costs should be considered to identify the best result. Only relevant if n_init is larger than 1 (default: “default”)

  • threshold_negative_eigenvalue (float) – threshold to consider an eigenvalue as negative. Used for the update of the subspace dimensions (default: -1e-7)

  • max_distance (float) – distance used to encode cluster centers and outliers. Only relevant if a MDL strategy is used (default: None)

  • precision (float) – precision used to convert probability densities to actual probabilities. Only relevant if a MDL strategy is used (default: None)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

labels_

The final labels

Type:

np.ndarray

scatter_matrix_

The final scatter matrix

Type:

np.ndarray

References

Mautz, Dominik, et al. “Towards an optimal subspace for k-means.” Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017.

and

Leiber, Collin, et al. “Automatic Parameter Selection for Non-Redundant Clustering.” Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2022.

calculate_cost_function(X: ndarray) float[source]

Calculate the result of the SubKmeans cost function. Depends on the rotation and the scatter matrix. Calculates for both subspaces j: P_j^T*V^T*S_j*V*P_j

Parameters:

X (np.ndarray) – the given data set

Returns:

costs – The total loss of this SubKmeans object

Return type:

float

calculate_mdl_costs(X: ~numpy.ndarray) -> (<class 'float'>, <class 'float'>, <class 'list'>)[source]

Calculate the Mdl Costs of this SubKmeans result.

Parameters:

X (np.ndarray) – the given data set

Returns:

tuple – The total costs (global costs + sum of subspace costs), The global costs, The subspace specific costs (one entry for each subspace)

Return type:

(float, float, list)

fit(X: ndarray, y: ndarray | None = None) SubKmeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Xnp.ndarray

the given data set

ynp.ndarray

the labels (can be ignored)

Returns:

self – this instance of the SubKmeans algorithm

Return type:

SubKmeans

plot_clustered_space(X: ndarray, labels: ndarray | None = None, plot_centers: bool = False, gt: ndarray | None = None, equal_axis=False) None[source]

Plot the clustered space identified by SubKmeans as scatter matrix plot.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – the cluster labels used for coloring the plot. If none, the labels identified by the fit() function will be used (default: None)

  • plot_centers (bool) – defines whether the cluster centers should be plotted (default: False)

  • gt (np.ndarray) – the ground truth labels. In contrast to the labels parameter this will be displayed using different markers instead of colors (default: None)

  • equal_axis (bool) – defines whether the axes should be scaled equally

transform_clustered_space(X: ndarray) ndarray[source]

Transform the input dataset with the orthonormal rotation matrix identified by the fit function and project it into the clustered space.

Parameters:

X (np.ndarray) – the given data set

Returns:

rotated_data – The rotated and projected dataset

Return type:

np.ndarray

transform_full_space(X: ndarray) ndarray[source]

Transform the input dataset with the orthonormal rotation matrix identified by the fit function.

Parameters:

X (np.ndarray) – the given data set

Returns:

rotated_data – The rotated dataset

Return type:

np.ndarray

clustpy.partition.xmeans module

@authors: Collin Leiber

class clustpy.partition.xmeans.XMeans(n_clusters_init: int = 2, max_n_clusters: int = inf, check_global_score: bool = True, allow_merging: bool = False, n_split_trials: int = 10, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Parameters:
  • n_clusters_init (int) – The initial number of clusters. Can also by of type np.ndarray if initial cluster centers are specified (default: 2)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • check_global_score (bool) – Defines whether the global BIC score should be checked after the ‘Improve-Params’ step. Some implementations skip this step (default: True)

  • allow_merging (bool) –

    Try to merge clusters after the regular XMeans algorithm terminated. See Ishioka et al. for more information.

    Normally, if allow_merging is True, check_global_score should be False (default: False)

  • n_split_trials (int) – Number tries to split a cluster. For each try 2-KMeans is executed with different cluster centers (default: 10)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

Examples

>>> from clustpy.partition import XMeans
>>> from sklearn.datasets import make_blobs
>>> from clustpy.utils import plot_with_transformation
>>> rs = np.random.RandomState(11)
>>> X, L = make_blobs(500, 2, centers=1, cluster_std=2, random_state=rs)
>>> X2, L2 = make_blobs(1000, 2, centers=4, cluster_std=0.5, random_state=rs)
>>> X = np.r_[X, X2]
>>> for b in [False, True]:
>>>     xm = XMeans(allow_merging=b, random_state=rs)
>>>     xm.fit(X)
>>>     plot_with_transformation(X, xm.labels_, xm.cluster_centers_)

References

Pelleg, Dan, and Andrew W. Moore. “X-means: Extending k-means with efficient estimation of the number of clusters.” Icml. Vol. 1. 2000.

and

Ishioka, Tsunenori. “An expansion of X-means for automatically determining the optimal number of clusters.” Proceedings of International Conference on Computational Intelligence. Vol. 2. 2005.

fit(X: ndarray, y: ndarray | None = None) XMeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the XMeans algorithm

Return type:

XMeans

Module contents

class clustpy.partition.DipExt(n_components: int | None = None, do_dip_scaling: bool = True, step_size: float = 0.1, momentum: float = 0.95, dip_threshold: float = 0.5, n_starting_vectors: int | None = None, ambiguous_triangle_strategy: str = 'ignore', random_state: RandomState | None = None)[source]

Bases: TransformerMixin, BaseEstimator

Execute the DipExt algorithm to reduce the number of features. Therefore, it utilizes the gradient of the Dip-test of unimodality to perform gradient descent. The output features should show a high degree of modality.

Parameters:
  • n_components (int) – The number of components to extract. Can be None, in that case dip_threshold wil be used to define the number of components (default: None)

  • do_dip_scaling (bool) – If true, the resulting features space will be scaled by performing a min-max normalization for each feature and multiplying this feautre by its dip-value (default: True)

  • step_size (float) – Step size used for gradient descent (default: 0.1)

  • momentum (float) – Momentum used for gradient descent (default: 0.95)

  • dip_threshold (float) – Defines the number of components if n_components is None. If an identified feature has a dip-value below the maximum dip-value times dip_threshold, DipExt will terminate (default: 0.5)

  • n_starting_vectors (int) – The number of starting vectors for gradient descent. Can be None, in that case it will be equal to log(data dimensionality) + 1 (default: None)

  • ambiguous_triangle_strategy (str) – The strategy with which to handle an ambiguous modal triangle. Can be ‘ignore’, ‘random’ or ‘all’. In the case of ‘random’, a valid triangle is created at random. In the case of ‘all’, for each possible triangle the gradient is calculated and it is checked for which gradient the following result looks most promising - this strategy can increase the runtime noticeably (default: ‘ignore’)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only used if ambiguous_triangle_strategy is ‘random’ (default: None)

dip_values_

The dip-value of each resulting feature

Type:

np.ndarray

projection_axes_

List containing the axes used for the transformation

Type:

list

argsorted_dips_

The indices of the sorted original dip-values (decreasing), needed to adjust order of the features when using transform()

Type:

np.ndarray

References

Schelling, Benjamin, et al. “Utilizing Structure-rich Features to improve Clustering.” (2020). The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2020

fit(X: ndarray, y: ndarray | None = None) DipExt[source]

Retrieve the necessary projection axes to apply DipExt to any given data set.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – This instance of the DipExt algorithm

Return type:

DipExt

fit_transform(X: ndarray, y: ndarray = None) ndarray[source]

Initiate the actual dimensionality-reduction process on the input data set.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

subspace – The transformed feature space (Number of samples x number of components)

Return type:

np.ndarray

transform(X: ndarray) ndarray[source]

Apply the transformation to the given data set using the projection axes found by DipExt.

Parameters:

X (np.ndarray) – the given data set

Returns:

subspace – The transformed feature space (Number of samples x number of components)

Return type:

np.ndarray

class clustpy.partition.DipInit(n_clusters: int, n_components: int | None = None, do_dip_scaling: bool = True, step_size: float = 0.1, momentum: float = 0.95, dip_threshold: float = 0.5, n_starting_vectors: int | None = None, ambiguous_triangle_strategy: str = 'ignore', random_state: RandomState | None = None)[source]

Bases: DipExt, BaseEstimator, ClusterMixin

Execute the DipInit clustering procedure. Initially, DipExt is executed to identify relevant features. Next, KMeans with initial cluster centers is used to identify high-quality cluster labels. To get the coordinates of the initial cluster centers DipInit uses the features identified by DipExt one after another. In the first iteration the centers will be equally distributed in a one-dimensional space by using the ids of the objects. Thereafter, the algorithm adds additional features und uses the current cluster labels to also add another coordinate to the centers.

Parameters:
  • n_clusters (int) – The number of clusters

  • n_components (int) – The number of components to extract. Can be None, in that case dip_threshold wil be used to define the number of components (default: None)

  • do_dip_scaling (bool) – If true, the resulting features space will be scaled by performing a min-max normalization for each feature and multiplying this feautre by its dip-value (default: True)

  • step_size (float) – Step size used for gradient descent (default: 0.1)

  • momentum (float) – Momentum used for gradient descent (default: 0.95)

  • dip_threshold (float) – Defines the number of components if n_components is None. If an identified feature has a dip-value below the maximum dip-value times dip_threshold, DipExt will terminate (default: 0.5)

  • n_starting_vectors (int) – The number of starting vectors for gradient descent. Can be None, in that case it will be equal to log(data dimensionality) + 1 (default: None)

  • ambiguous_triangle_strategy (str) – The strategy with which to handle an ambiguous modal triangle. Can be ‘ignore’, ‘random’ or ‘all’. In the case of ‘random’, a valid triangle is created at random. In the case of ‘all’, for each possible triangle the gradient is calculated and it is checked for which gradient the following result looks most promising - this strategy can increase the runtime noticeably (default: ‘ignore’)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only used if ambiguous_triangle_strategy is ‘random’ (default: None)

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

References

Schelling, Benjamin, et al. “Utilizing Structure-rich Features to improve Clustering.” (2020). The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2020

fit(X: ndarray, y: ndarray | None = None) DipInit[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DipInit algorithm

Return type:

DipInit

class clustpy.partition.DipMeans(significance: float = 0.001, split_viewers_threshold: float = 0.01, pval_strategy: str = 'table', n_boots: int = 1000, n_split_trials: int = 10, n_clusters_init: int = 1, max_n_clusters: int = inf, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Execute the DipMeans clustering procedure. In contrast to other algorithms (e.g. KMeans) DipMeans is able to identify the number of clusters by itself. Therefore, it uses the dip-dist criterion. It calculates the dip-value of the distances of each point within a cluster to all other points in that cluster and checks how many points are assigned a dip-value below the threshold. If that amount of so called split viewers is above the split_viewers_threshold, the cluster will be split using 2-Means. The algorithm terminates if all clusters show a unimdoal behaviour.

Parameters:
  • significance (float) – Threshold to decide if the result of the dip-test is unimodal or multimodal (default: 0.001)

  • split_viewers_threshold (float) – Threshold to decide whether a cluster has a unimodal or multimodal structure. Must be within [0, 1] (default: 0.01)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • n_split_trials (int) – Number tries to split a cluster. For each try 2-KMeans is executed with different cluster centers (default: 10)

  • n_clusters_init (int) – The initial number of clusters. Can also by of type np.ndarray if initial cluster centers are specified (default: 1)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

References

Kalogeratos, Argyris, and Aristidis Likas. “Dip-means: an incremental clustering method for estimating the number of clusters.” Advances in neural information processing systems. 2012.

fit(X: ndarray, y: ndarray | None = None) DipMeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DipMeans algorithm

Return type:

DipMeans

class clustpy.partition.DipNSub(significance: float = 0.01, threshold: float = 0.15, step_size: float = 0.1, momentum: float = 0.95, n_starting_vectors: int | None = None, add_tails=True, outliers=False, consider_duplicates: bool = False, random_state: RandomState | None = None, debug=False)[source]

Bases: object

Execute the Dip`n`Sub clustering procedure. It searches for projection axes in which as many samples as possible are part of multimodal clusters. Therefore, it uses the gradient of the p-value of the Dip-test to perform stochastic gradient descent. On the identified axes TailoredDip, an extension of UniDip, will be executed to assign cluster labels. Note, that clusters will always form a hypercube in the resulting subspace.

Parameters:
  • significance (float) – Threshold to decide if the result of the dip-test is unimodal or multimodal (default: 0.01)

  • threshold (float) – Defines the amount of objects that must be contained in multimodal clusters for the algorithm to continue (default: 0.15)

  • step_size (float) – Step size used for gradient descent (default: 0.1)

  • momentum (float) – Momentum used for gradient descent (default: 0.95)

  • n_starting_vectors (int) – The number of starting vectors for gradient descent. Can be None, in that case it will be equal to np.log(data dimensionality) + 1 (default: None)

  • add_tails (bool) – Defines if TailoredDip should try to add tails to the surrounding clusters (default: True)

  • outliers (bool) – Defines if outliers should be identified as described by UniDip (default: False)

  • consider_duplicates (bool) – If multiple instances on the projection axis share a value, the gradient is ambiguous. If those duplicate values should be considered a random instances will be choses for furhter calculations. Beware: The calculation will not be deterministic anymore (default: False)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

subspace_

The resulting subspace

Type:

np.ndarray

References

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

fit(X: ndarray, y: ndarray | None = None) DipNSub[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DipNSub algorithm

Return type:

DipNSub

class clustpy.partition.GMeans(significance: float = 0.001, n_clusters_init: int = 1, max_n_clusters: int = inf, n_split_trials: int = 10, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Execute the GMeans clustering procedure. Determines the number of clusters by repeatedly trying to split a cluster into two clusters. Therefore, the data is projected onto the axis connecting the two resulting centers. If the Anderson Darling test does not assume a Gaussian distribution for this projection, the new clusters are retained. Otherwise, the cluster remains as it was originally. This is repeated until no cluster changes.

Parameters:
  • significance (float) – Threshold to decide if the result of the Anderson Darling Test indicates a Gaussian distribution (default: 0.001)

  • n_clusters_init (int) – The initial number of clusters. Can also by of type np.ndarray if initial cluster centers are specified (default: 1)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • n_split_trials (int) – Number tries to split a cluster. For each try 2-KMeans is executed with different cluster centers (default: 10)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

References

Hamerly, Greg, and Charles Elkan. “Learning the k in k-means.” Advances in neural information processing systems. 2004.

and

D’Agostino, Ralph B., and Michael A. Stephens. “Goodness-of-fit techniques.” Statistics: Textbooks and Monographs (1986).

fit(X: ndarray, y: ndarray | None = None) GMeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the GMeans algorithm

Return type:

GMeans

class clustpy.partition.GapStatistic(min_n_clusters: int = 1, max_n_clusters: int = 10, n_boots: int = 10, use_principal_components: bool = True, use_log: bool = True, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Estimate the number of cluster for KMeans using the Gar Statistic. Calculate the Gap Statistic by comparing within cluster dispersion of the input data set with that of ranomly sampled data. The Gap Statistic is evaluated for multiple numebers of clusters. First clustering result that fulfills the Gap condition ‘Gap(k) >= Gap(k+1)-s_{k+1}’ will be returned. Beware: Result can be None if no clustering result fulfills that condition!

Parameters:
  • min_n_clusters (int) – Minimum number of clusters. Must be smaller than max_n_clusters (default: 1)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than min_n_clusters (default: 10)

  • n_boots (int) – Number of random data sets that should be created to calculate Gap Statistic (default: 10)

  • use_principal_components (bool) – True, if the random data sets should be created using the feature-wise minimum and maximum value of the Principle Components. Else, the minimum and maximum value of the regular data set will be used (default: True)

  • use_log (bool) – True, if the logarithm of the within cluster dispersion should be used. For more information see Mohajer et al. (default: True)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The first number of clusters that fulfills the Gap condition (can be None)

Type:

int

labels_

The labels as identified by the Gap Statistic (can be None)

Type:

np.ndarray

cluster_centers_

The cluster centers as identified by the Gap Statistic (Can be None)

Type:

np.ndarray

gaps_

The Gap values,

Type:

np.ndarray

sks_

The sk values

Type:

np.ndarray

Examples

>>> from sklearn.datasets import make_blobs
>>> X, L = make_blobs(1000, 2, centers=7, cluster_std=0.7)
>>> gs = GapStatistic()
>>> gs.fit(X)
>>> print(gs.n_clusters_)
>>> gs.plot()

References

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. “Estimating the number of clusters in a data set via the gap statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63.2 (2001): 411-423.

and

Mohajer, Mojgan, Karl-Hans Englmeier, and Volker J. Schmid. “A comparison of Gap statistic definitions with and without logarithm function.” arXiv preprint arXiv:1103.4767 (2011).

fit(X: ndarray, y: ndarray | None = None) GapStatistic[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the GapStatistic algorithm

Return type:

GapStatistic

plot() None[source]

Plot the result of the Gap Statistic. Shows the number of the clusters on the x-axis and the Gap values on the y-axis.

class clustpy.partition.LDAKmeans(n_clusters: int, n_dims: int | None = None, max_iter: int = 300, n_init: int = 1, kmeans_repetitions: int = 10, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Execute the LDA-Kmeans clustering procedure. The initial rotation are normally the (n_clusters-1) components of a PCA. Afterward, Kmeans and LDA are executed one after the other until the labels do not change anymore. Kmeans always takes place in the rotated subspace.

Parameters:
  • n_clusters (int) – the number of clusters

  • n_dims (int) – The number of features in the resulting subspace. If None this will be equal to n_clusters - 1 (default: None)

  • max_iter (int) – the maximum number of iterations (default: 300)

  • n_init (int) – number of times LDAKmeans is executed using different seeds. The final result will be the one with lowest costs (default: 1)

  • kmeans_repetitions (int) – Number of repetitions when executing KMeans. For more information see sklearn.cluster.KMeans (default: 10)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels

Type:

np.ndarray

rotation_

The final rotation matrix

Type:

np.ndarray

cluster_centers_ = np.ndarray

The cluster centers in the subspace

error_

The final error (KMeans error in the subspace)

Type:

float

References

Ding, Chris, and Tao Li. “Adaptive dimension reduction using discriminant analysis and k-means clustering.” Proceedings of the 24th international conference on Machine learning. 2007.

fit(X: ndarray, y: ndarray | None = None) LDAKmeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels are contained in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the LDAKmeans algorithm

Return type:

LDAKmeans

transform_subspace(X: ndarray) ndarray[source]

Transform the input dataset with the rotation matrix identified by the fit function.

Parameters:

X (np.ndarray) – the given data set

Returns:

rotated_data – The rotated data set

Return type:

np.ndarray

class clustpy.partition.PGMeans(significance: float = 0.001, n_projections: int | None = None, n_samples: int | None = None, n_new_centers: int = 10, amount_random_centers: float = 0.5, n_clusters_init: int = 1, max_n_clusters: int = inf, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Execute the PGMeans clustering procedure. Determines the number of clusters by executing the EM algorithm multiple times to create different Gaussian Mixtures (GMMs). For each GMM it projects the GMM model onto random projection axes and uses the Kolmogorov Smirnov Test to decide whether the data matches the fitted model. If this is not the case, a new center will be added and the next GMM will be fitted. This is repeated until no cluster are added anymore.

Parameters:
  • significance (float) – Threshold to decide if the result of the Kolmogorov Smirnov Test indicates a Gaussian Mixture Model (default: 0.001)

  • n_projections (int) – Number of projection axes to test different projected GMMs on. Can be None, in that case it will be set to: -2.6198 * log(significance) (default: None)

  • n_samples (int) – Number of samples generated from the fitted GMM and used to execute the Kolmogorov Smirnov Test. If it is chosen larger than the number of data samples, it will be equal to this value. Can be None, in that case it will be set to: 3 / significance (default: None)

  • n_new_centers (int) – Nummber of centers to test when a new center should be added to the current GMM model. The additional center producing the best new GMM will be used for subsequent iterations (default: 10)

  • amount_random_centers (float) – Amount of random centers tested. Must be a value in the range [0, 1]. In total (n_new_centers * amount_random_centers) random centers will be tested. The other possible centers will be chosen based on the probability densities of the current GMM modal (default: 0.5)

  • n_clusters_init (int) – The initial number of clusters. Can also by of type np.ndarray if initial cluster centers are specified (default: 1)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

References

Feng, Yu, and Greg Hamerly. “PG-means: learning the number of clusters in data.” Advances in neural information processing systems. 2007.

fit(X: ndarray, y: ndarray | None = None) PGMeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the PGMeans algorithm

Return type:

PGMeans

class clustpy.partition.ProjectedDipMeans(significance: float = 0.001, n_random_projections: int = 0, pval_strategy: str = 'table', n_boots: int = 1000, n_split_trials: int = 10, n_clusters_init: int = 1, max_n_clusters: int = inf, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Execute the Projected DipMeans clustering procedure. It repeatedly creates random projection axes for each cluster and tests whether the data projected onto that projection axis is unimodal. If the probability of unimodality is below significance another the cluster will be split. This is done by using 2-Means. The algorithm terminates if all clusters show a unimdoal behaviour on all projection axes.

Parameters:
  • significance (float) – Threshold to decide if the result of the dip-test is unimodal or multimodal (default: 0.001)

  • n_random_projections (int) – Number of random projections that should be applied in addition to the original features and the components from a PCA (default: 0)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • n_split_trials (int) – Number tries to split a cluster. For each try 2-KMeans is executed with different cluster centers (default: 10)

  • n_clusters_init (int) – The initial number of clusters. Can also by of type np.ndarray if initial cluster centers are specified (default: 1)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

References

Chamalis, Theofilos, and Aristidis Likas. “The projected dip-means clustering algorithm.” Proceedings of the 10th Hellenic Conference on Artificial Intelligence. 2018.

fit(X: ndarray, y: ndarray | None = None) ProjectedDipMeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the ProjectedDipMeans algorithm

Return type:

ProjectedDipMeans

class clustpy.partition.SkinnyDip(significance: float = 0.05, pval_strategy: str = 'table', n_boots: int = 1000, add_tails: bool = False, outliers: bool = True, max_cluster_size_diff_factor: float = 2, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

Execute the SkinnyDip clustering procedure. This approach iteratively executes the univariate clustering algorithm UniDip on each feature. The result are clusters formed as hypercubes, which have a much higher density than surrounding noise. See UniDip for more information.

Parameters:
  • significance (float) – Threshold to decide if the result of the dip-test is unimodal or multimodal (default: 0.05)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • add_tails (bool) – Defines if TailoredDip should try to add tails to the surrounding clusters (default: False)

  • outliers (bool) – Defines if outliers should be identified as described by UniDip (default: True)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for merging and assigning tails of distributions if ‘add_tails’ is True (default: 2)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only relevant if pval_strategy is ‘bootstrap’ (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

Examples

>>> from sklearn.datasets import make_blobs
>>> X, L = make_blobs(1000, 3, centers=5)
>>> sk = SkinnyDip()
>>> sk.fit(X)

References

Maurus, Samuel, and Claudia Plant. “Skinny-dip: clustering in a sea of noise.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.

and

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

fit(X: ndarray, y: ndarray | None = None) SkinnyDip[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the SkinnyDip algorithm

Return type:

SkinnyDip

class clustpy.partition.SpecialK(significance: float = 0.01, n_dimensions: int = 200, similarity_matrix: str = 'NAM', n_neighbors: int = 10, percentage: float = 0.99, n_cluster_pairs_to_consider: int = 10, max_n_clusters: int = inf, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

Execute the SpecialK clustering procedure. SpecialK is able to autonomously identify a suitable number of clusters for spectral clustering procedures. Therefore, it uses probability bounds on the operator norm of centered, symmetric decompositions based on the matrix Bernstein concentration inequality. It iteratively increases the number of clusters until the probability of objects originating from two instead of one distribution does not increase further. Can be based on an epsilon-neighborhood adjacency matrix or the symmetrically normalized adjacency matrix of the kNN graph.

Parameters:
  • significance (float) – Threshold to decide if the samples originate from a single distribution or two distributions (default: 0.01)

  • n_dimensions (int) – Dimensionality of the embedding (default: 200)

  • similarity_matrix (str) – Defines the similarity matrix to use. Can be one of the following strings or a numpy array / scipy sparse csr matrix. If ‘NAM’, a neighborhood adjacency matrix is used. If ‘SAM’ a symmetrically normalized adjacency matrix is used (default: ‘NAM’)

  • n_neighbors (int) – Number of neighbors for the construction of the similarity matrix. Does not include the object itself (default: 10)

  • percentage (float) – The amount of data points that should have at least n_neighbors neighbors. Only relevant if use_neighborhood_adj_mat is set to True (default: 0.99)

  • n_cluster_pairs_to_consider (int) – The number of cluster pairs to consider when calculating the probability boundary. The cluster pairs responsible for the highest cut. If None, all pairs will be considered. Smaller values for n_cluster_pairs_to_consider will decrease the computing time (default: 10)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

References

Hess, Sibylle, and Wouter Duivesteijn. “k is the magic number—inferring the number of clusters through nonparametric concentration inequalities.” Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I. Springer International Publishing, 2020.

fit(X: ndarray, y: ndarray | None = None) SpecialK[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the SpecialK algorithm

Return type:

SpecialK

class clustpy.partition.SubKmeans(n_clusters: int, V: ndarray | None = None, m: int | None = None, cluster_centers: ndarray | None = None, mdl_for_noisespace: bool = False, outliers: bool = False, max_iter: int = 300, n_init: int = 1, cost_type: str = 'default', threshold_negative_eigenvalue: float = -1e-07, max_distance: float | None = None, precision: float | None = None, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

The Subspace Kmeans (SubKmeans) algorithm. The algorithm will simultaneously search for cluster assignments and an optimal subspace regarding those assignments. Here, the feature-space will be divided into an clustered space containing the actual clustering result and the noise space which is a special subspace containing a single cluster. The rotation and dimensions of the clustered space are defined through an eigenvalue decomposition.

This implementation includes some extensions from ‘Automatic Parameter Selection for Non-Redundant Clustering’.

Parameters:
  • n_clusters (int) – the number of clusters

  • V (np.ndarray) – the orthonormal rotation matrix (default: None)

  • m (int) – the initial dimensionality of the clustered space (default: None)

  • cluster_centers (np.ndarray) – list containing the initial cluster centers (default: None)

  • mdl_for_noisespace (bool) – defines if MDL should be used to identify noise space dimensions instead of only considering negative eigenvalues (default: False)

  • outliers (bool) – defines if outliers should be identified through MDL (default: False)

  • max_iter (int) – maximum number of iterations for the algorithm (default: 300)

  • n_init (int) – number of times SubKmeans is executed using different seeds. The final result will be the one with lowest costs. Costs can be the standard SubKmeans costs or MDL costs (defined by the cost_type parameter) (default: 1)

  • cost_type (str) – Can be “default” or “mdl” and defines whether the the standard SubKmeans cost function or MDL costs should be considered to identify the best result. Only relevant if n_init is larger than 1 (default: “default”)

  • threshold_negative_eigenvalue (float) – threshold to consider an eigenvalue as negative. Used for the update of the subspace dimensions (default: -1e-7)

  • max_distance (float) – distance used to encode cluster centers and outliers. Only relevant if a MDL strategy is used (default: None)

  • precision (float) – precision used to convert probability densities to actual probabilities. Only relevant if a MDL strategy is used (default: None)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

labels_

The final labels

Type:

np.ndarray

scatter_matrix_

The final scatter matrix

Type:

np.ndarray

References

Mautz, Dominik, et al. “Towards an optimal subspace for k-means.” Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017.

and

Leiber, Collin, et al. “Automatic Parameter Selection for Non-Redundant Clustering.” Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2022.

calculate_cost_function(X: ndarray) float[source]

Calculate the result of the SubKmeans cost function. Depends on the rotation and the scatter matrix. Calculates for both subspaces j: P_j^T*V^T*S_j*V*P_j

Parameters:

X (np.ndarray) – the given data set

Returns:

costs – The total loss of this SubKmeans object

Return type:

float

calculate_mdl_costs(X: ~numpy.ndarray) -> (<class 'float'>, <class 'float'>, <class 'list'>)[source]

Calculate the Mdl Costs of this SubKmeans result.

Parameters:

X (np.ndarray) – the given data set

Returns:

tuple – The total costs (global costs + sum of subspace costs), The global costs, The subspace specific costs (one entry for each subspace)

Return type:

(float, float, list)

fit(X: ndarray, y: ndarray | None = None) SubKmeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Xnp.ndarray

the given data set

ynp.ndarray

the labels (can be ignored)

Returns:

self – this instance of the SubKmeans algorithm

Return type:

SubKmeans

plot_clustered_space(X: ndarray, labels: ndarray | None = None, plot_centers: bool = False, gt: ndarray | None = None, equal_axis=False) None[source]

Plot the clustered space identified by SubKmeans as scatter matrix plot.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – the cluster labels used for coloring the plot. If none, the labels identified by the fit() function will be used (default: None)

  • plot_centers (bool) – defines whether the cluster centers should be plotted (default: False)

  • gt (np.ndarray) – the ground truth labels. In contrast to the labels parameter this will be displayed using different markers instead of colors (default: None)

  • equal_axis (bool) – defines whether the axes should be scaled equally

transform_clustered_space(X: ndarray) ndarray[source]

Transform the input dataset with the orthonormal rotation matrix identified by the fit function and project it into the clustered space.

Parameters:

X (np.ndarray) – the given data set

Returns:

rotated_data – The rotated and projected dataset

Return type:

np.ndarray

transform_full_space(X: ndarray) ndarray[source]

Transform the input dataset with the orthonormal rotation matrix identified by the fit function.

Parameters:

X (np.ndarray) – the given data set

Returns:

rotated_data – The rotated dataset

Return type:

np.ndarray

class clustpy.partition.UniDip(significance: float = 0.05, pval_strategy: str = 'table', n_boots: int = 1000, add_tails: bool = False, outliers: bool = True, max_cluster_size_diff_factor: float = 2, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

Execute the UniDip clustering procedure. This univariate clustering algorithm recursively uses the Dip-test of unimodality to check if a set of samples is distributed uni- or multimodally. If the test indicates multimodality, the data will be split into three sets: left of the main mode, the main mode itself and right of the main mode. Each set will itself be tested by the Dip-test to see if it has to be further subdivided.

This implementation includes the extensions made by TailoredDip (see Bauer et al.).

Parameters:
  • significance (float) – Threshold to decide if the result of the dip-test is unimodal or multimodal (default: 0.05)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • add_tails (bool) – Defines if TailoredDip should try to add tails to the surrounding clusters (default: False)

  • outliers (bool) – Defines if outliers should be identified as described by UniDip (default: True)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for merging and assigning tails of distributions if ‘add_tails’ is True (default: 2)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only relevant if pval_strategy is ‘bootstrap’ (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_boundaries_

List of tuples containing the id of the first sample in a cluster and the first sample that is not part of the cluster anymore

Type:

list

References

Maurus, Samuel, and Claudia Plant. “Skinny-dip: clustering in a sea of noise.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.

and

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

fit(X: ndarray, y: ndarray | None = None) UniDip[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the UniDip algorithm

Return type:

UniDip

class clustpy.partition.XMeans(n_clusters_init: int = 2, max_n_clusters: int = inf, check_global_score: bool = True, allow_merging: bool = False, n_split_trials: int = 10, random_state: RandomState | None = None)[source]

Bases: BaseEstimator, ClusterMixin

Parameters:
  • n_clusters_init (int) – The initial number of clusters. Can also by of type np.ndarray if initial cluster centers are specified (default: 2)

  • max_n_clusters (int) – Maximum number of clusters. Must be larger than n_clusters_init (default: np.inf)

  • check_global_score (bool) – Defines whether the global BIC score should be checked after the ‘Improve-Params’ step. Some implementations skip this step (default: True)

  • allow_merging (bool) –

    Try to merge clusters after the regular XMeans algorithm terminated. See Ishioka et al. for more information.

    Normally, if allow_merging is True, check_global_score should be False (default: False)

  • n_split_trials (int) – Number tries to split a cluster. For each try 2-KMeans is executed with different cluster centers (default: 10)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

Examples

>>> from clustpy.partition import XMeans
>>> from sklearn.datasets import make_blobs
>>> from clustpy.utils import plot_with_transformation
>>> rs = np.random.RandomState(11)
>>> X, L = make_blobs(500, 2, centers=1, cluster_std=2, random_state=rs)
>>> X2, L2 = make_blobs(1000, 2, centers=4, cluster_std=0.5, random_state=rs)
>>> X = np.r_[X, X2]
>>> for b in [False, True]:
>>>     xm = XMeans(allow_merging=b, random_state=rs)
>>>     xm.fit(X)
>>>     plot_with_transformation(X, xm.labels_, xm.cluster_centers_)

References

Pelleg, Dan, and Andrew W. Moore. “X-means: Extending k-means with efficient estimation of the number of clusters.” Icml. Vol. 1. 2000.

and

Ishioka, Tsunenori. “An expansion of X-means for automatically determining the optimal number of clusters.” Proceedings of International Conference on Computational Intelligence. Vol. 2. 2005.

fit(X: ndarray, y: ndarray | None = None) XMeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the XMeans algorithm

Return type:

XMeans