clustpy.deep package

Subpackages

Submodules

clustpy.deep.aec module

@authors: Collin Leiber

class clustpy.deep.aec.AEC(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Auto-encoder Based Data Clustering (AEC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the AEC loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining. If this is None, random labels will be used (default: None)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import AEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> aec = AEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> AEC.fit(data)

References

Song, Chunfeng, et al. “Auto-encoder based data clustering.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress, CIARP 2013, Havana, Cuba, November 20-23, 2013, Proceedings, Part I 18. Springer Berlin Heidelberg, 2013.

fit(X: ndarray, y: ndarray = None) AEC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the AEC algorithm

Return type:

AEC

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') AEC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.dcn module

@authors: Lukas Miklautz, Dominik Mautz

class clustpy.deep.dcn.DCN(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Clustering Network (DCN) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DCN loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dcn_labels_

The final DCN labels

Type:

np.ndarray

dcn_cluster_centers_

The final DCN cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DCN
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dcn = DCN(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dcn.fit(data)

References

Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” international conference on machine learning. PMLR, 2017.

fit(X: ndarray, y: ndarray = None) DCN[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DCN algorithm

Return type:

DCN

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DCN

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.ddc_n2d module

@authors: Collin Leiber

class clustpy.deep.ddc_n2d.DDC(ratio: float = 0.1, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, tsne_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Density-based Image Clustering (DDC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE is executed on the embedded data and a variant of the Density Peak Clustering algorithm is executed.

Parameters:
  • ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio (default: 0.1)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • tsne_params (dict) – Parameters for the t-SNE execution. For example, perplexity can be changed by setting tsne_params to {“n_components”: 2, “perplexity”: 25}. Check out sklearn.manifold.TSNE for more information. If None, it will be set to {“n_components”: 2} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels (obtained by a variant of Density Peak Clustering)

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

tsne_

The t-SNE object

Type:

TSNE

n_features_in_

the number of features used for the fitting

Type:

int

cluster_centers_

The final cluster centers defined as the mean of assigned samples within the AE embedding

Type:

np.ndarray

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DDC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> ddc = DDC(pretrain_epochs=3)
>>> ddc.fit(data)

References

Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.

fit(X: ndarray, y: ndarray = None) DDC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DDC algorithm

Return type:

DDC

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data. Note that this is just a very imprecise estimation as the manifold does not learn a function f() to map the data into the final embedding. Therefore, the prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() DDC

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.ddc_n2d.DDC_density_peak_clustering(ratio: float)[source]

Bases: ClusterMixin, BaseEstimator

A variant of the Density Peak Algorithm as proposed in the DDC paper.

Parameters:

ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

n_features_in_

the number of features used for the fitting

Type:

int

References

Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.

fit(X: ndarray, y: ndarray = None) DDC_density_peak_clustering[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DDC variant of the Density Peak Clsutering algorithm

Return type:

DDC_density_peak_clustering

class clustpy.deep.ddc_n2d.N2D(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, manifold_class: ~sklearn.base.TransformerMixin = <class 'sklearn.manifold._t_sne.TSNE'>, manifold_params: dict = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Not 2 Deep (N2D) clustering algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE/UMAP/ISOMAP is executed on the embedded data and the EM algorithm is executed.

Parameters:
  • n_clusters (int) – number of clusters (default: 8)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • manifold_class (TransformerMixin) – the manifold technique class (default: TSNE)

  • manifold_params (dict) – Parameters for the manifold execution. For example, perplexity can be changed for TSNE by setting manifold_params to {“n_components”: 2, “perplexity”: 25}. Check out e.g. sklearn.manifold.TSNE for more information. If None, it will be set to {“n_components”: n_clusters} (default: None)

  • initial_clustering_params (dict) – parameters for the GMM clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels

Type:

np.ndarray

cluster_centers_manifold_

The final cluster centers within the embedding of the manifold

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

manifold_

The manifold object

Type:

TransformerMixin

n_features_in_

the number of features used for the fitting

Type:

int

cluster_centers_

The final cluster centers defined as the mean of assigned samples within the AE embedding

Type:

np.ndarray

References

McConville, Ryan, et al. “N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding.” 2020 25th international conference on pattern recognition (ICPR). IEEE, 2021.

fit(X: ndarray, y: ndarray = None) N2D[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the N2D algorithm

Return type:

N2D

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data. Note that this is just a very imprecise estimation as the manifold does not learn a function f() to map the data into the final embedding. Therefore, the prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() N2D

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.dec module

@authors: Lukas Miklautz, Dominik Mautz, Collin Leiber

class clustpy.deep.dec.DEC(n_clusters: int = 8, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Clustering (DEC) algorithm. First, a neural_network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DEC loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • alpha (float) – alpha value for the prediction (default: 1.0)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dec_labels_

The final DEC labels

Type:

np.ndarray

dec_cluster_centers_

The final DEC cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dec = DEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dec.fit(data)

References

Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.

fit(X: ndarray, y: ndarray = None) DEC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DEC algorithm

Return type:

DEC

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DEC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.dec.IDEC(n_clusters: int = 8, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: DEC

The Improved Deep Embedded Clustering (IDEC) algorithm. Is equal to the DEC algorithm but uses the self-supervised learning loss also during the clustering optimization. Further, clustering_loss_weight is set to 0.1 instead of 1 when using the default settings.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • alpha (float) – alpha value for the prediction (default: 1.0)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 0.1)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dec_labels_

The final DEC labels

Type:

np.ndarray

dec_cluster_centers_

The final DEC cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import IDEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> idec = IDEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> idec.fit(data)

References

Guo, Xifeng, et al. “Improved deep embedded clustering with local structure preservation.” IJCAI. 2017.

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') IDEC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.deepect module

@authors: Collin Leiber, Julian Schilcher

class clustpy.deep.deepect.DeepECT(max_n_leaf_nodes: int = 20, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, grow_interval: int = 2, pruning_threshold: float = 0.1, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Cluster Tree (DeepECT) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, a cluster tree will be grown and the network will be optimized using the DeepECT loss function.

Parameters:
  • max_n_leaf_nodes (int) – Maximum number of leaf nodes in the cluster tree (default: 20)

  • batch_size (int) – Size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – Number of epochs for the actual clustering procedure (default: 150)

  • grow_interval (int) – Number of epochs after which the the tree is grown (default: 2)

  • pruning_threshold (float) – The threshold for pruning the tree (default: 0.1)

  • optimizer_class (torch.optim.Optimizer) – The optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – Size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState) – Use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

tree_

The prediction cluster tree after training

Type:

PredictionClusterTree

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

References

Mautz, Dominik, Claudia Plant, and Christian Böhm. “Deep embedded cluster tree.” 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019.

fit(X: ndarray, y: ndarray = None) DeepECT[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – This instance of the DeepECT algorithm

Return type:

DeepECT

flat_clustering(n_leaf_nodes_to_keep: int) ndarray[source]

Transform the predicted labels into a flat clustering result by only keeping n_leaf_nodes_to_keep leaf nodes in the tree. Returns labels as if the clustering procedure would have stopped at the specified number of nodes. Note that each leaf node corresponds to a cluster.

Parameters:

n_leaf_nodes_to_keep (int) – The number of leaf nodes to keep in the cluster tree

Returns:

labels_pruned – The new cluster labels

Return type:

np.ndarray

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() DeepECT

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.den module

@authors: Collin Leiber

class clustpy.deep.den.DEN(n_clusters: int = 8, group_size: int | list | None = 2, n_neighbors: int = 5, weight_locality_constraint: float = 0.5, weight_sparsity_constraint: float = 1.0, heat_kernel_t_parameter: float = 1.0, group_lasso_lambda_parameter: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int | None = None, custom_dataloaders: tuple = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedding Network (DEN) algorithm. It trains a neural network by optimizing a loss functions consisting of three components. These are (1) the standrad loss function of the neural netork (e.g. reconstruction loss for autoencoders), (2) the locality-preserving constraint and (3) the group sparsity constraint. Finally, k-Means is excuted in the resulting embedding.

Parameters:
  • n_clusters (int) – number of clusters (default: 8)

  • group_size (int | list) – the number of features in each group. Can also be a list, specifying the size of each group separately. Can be None if embedding_size is specified (default: 2)

  • n_neighbors (int) – the number of nearest-neighbors (including itself) for the locality-preserving constraint. Nearest-neighbors will be calculated by using the Euclidean distance. If another distance should be used to define the nearest-neighbors, the neighbors can be included in the custom_dataloader as additional_inputs. In this case, it is expected that the trainloader is composed of: (sample_ids, original_samples, 1st-NNs, 2nd-NNs, …, (n_neighbors-1)-NNs) (default: 5)

  • weight_locality_constraint (float) – weight alpha for the locality-preserving constraint (default: 0.5)

  • weight_sparsity_constraint (float) – weight beta for the group sparsity constraint (default: 1.)

  • heat_kernel_t_parameter (float) – the t parameter for the heat kernel included in the locality-preserving constraint (default: 1.)

  • group_lasso_lambda_parameter (float) – the lambda parameter for the group lasso included in the group sparsity constraint (default: 1.)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: None)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by KMeans)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by KMeans)

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DEN
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> den = DEN(n_clusters=3, pretrain_epochs=3)
>>> den.fit(data)

References

Huang, Peihao, et al. “Deep embedding network for clustering.” 2014 22nd International conference on pattern recognition. IEEE, 2014.

fit(X: ndarray, y: ndarray = None) DEN[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DEN algorithm

Return type:

DEN

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DEN

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.dipdeck module

@authors: Collin Leiber

class clustpy.deep.dipdeck.DipDECK(n_clusters_init: int = 35, dip_merge_threshold: float = 0.9, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, max_n_clusters: int = inf, min_n_clusters: int = 1, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 5, max_cluster_size_diff_factor: float = 2, pval_strategy: str = 'table', n_boots: int = 1000, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None, debug: bool = False)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Clustering with k-Estimation (DipDECK) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters using an overestimated number of clusters. Last, the network will be optimized using the DipDECK loss function. If any Dip-value exceeds the dip_merge_threshold, the corresponding clusters will be merged.

Parameters:
  • n_clusters_init (int) – initial number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 35)

  • dip_merge_threshold (float) – threshold regarding the Dip-p-value that defines if two clusters should be merged. Must be bvetween 0 and 1 (default: 0.9)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • max_n_clusters (int) – maximum number of clusters. Must be larger than min_n_clusters. If the result has more clusters, a merge will be forced (default: np.inf)

  • min_n_clusters (int) – minimum number of clusters. Must be larger than 0, smaller than max_n_clusters and smaller than n_clusters_init. When this number of clusters is reached, all further merge processes will be hindered (default: 1)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure. Will reset after each merge (default: 50)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 5)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for the Dip calculation (default: 2)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

labels_

The final labels

Type:

np.ndarray

n_clusters_

The final number of clusters

Type:

int

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DipDECK
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dipdeck = DipDECK(pretrain_epochs=3, clustering_epochs=3)
>>> dipdeck.fit(data)

References

Leiber, Collin, et al. “Dip-based deep embedded clustering with k-estimation.” Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021.

fit(X: ndarray, y: ndarray = None) DipDECK[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DipDECK algorithm

Return type:

DipDECK

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() DipDECK

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.dipencoder module

@authors: Collin Leiber

class clustpy.deep.dipencoder.DipEncoder(n_clusters: int = 8, batch_size: int = None, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, max_cluster_size_diff_factor: float = 3, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The DipEncoder. Can be used either as a clustering procedure if no ground truth labels are given or as a supervised dimensionality reduction technique. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DipEncoder loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • batch_size (int) – size of the data batches for the actual training of the DipEncoder. Should be larger the more clusters we have. If it is None, it will be set to (25 x n_clusters) (default: None)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used (default: 3)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss. If None, it will be equal to 1/(4L), where L is the reconstruction loss of the first batch of an untrained neural network (default: None)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels

Type:

np.ndarray

projection_axes_

The final projection axes between the clusters

Type:

np.ndarray

index_dict_

A dictionary to match the indices of two clusters to a projection axis

Type:

dict

projection_thresholds_

A list containing the thresholds for each projection axis and a tuple indicating which cluster is left and right of the threshold

Type:

list

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DipEncoder
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dipencoder = DipEncoder(3, pretrain_epochs=3, clustering_epochs=3)
>>> dipencoder.fit(data)

References

Leiber, Collin and Bauer, Lena G. M. and Neumayr, Michael and Plant, Claudia and Böhm, Christian “The DipEncoder: Enforcing Multimodality in Autoencoders.” Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2022.

fit(X: ndarray, y: ndarray = None) DipEncoder[source]

Initiate the actual clustering/dimensionality reduction process on the input data set. If no ground truth labels are given, the resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – The given (training) data set

  • y (np.ndarray) – The ground truth labels. If None, the DipEncoder will be used for clustering (default: None)

Returns:

self – This instance of the DipEncoder

Return type:

DipEncoder

plot(X: ndarray, edge_width: float = 0.2, show_legend: bool = True) None[source]

Plot the current state of the DipEncoder. First the data set will be encoded using the neural network, afterwards the plot will be created. Uses the plot_scatter_matrix as a basis and adds projection axes in red.

Parameters:
  • X (np.ndarray) – The data set

  • edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots

  • show_legend (bool) – Specifies whether a legend should be added to the plot

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() DipEncoder

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.dipencoder.plot_dipencoder_embedding(X_embed: ndarray, n_clusters: int, labels: ndarray, projection_axes: ndarray, index_dict: dict, edge_width: float = 0.1, show_legend: bool = False, show_plot: bool = True) None[source]

Plot the current state of the DipEncoder. Uses the plot_scatter_matrix as a basis and adds projection axes in red.

Parameters:
  • X_embed (np.ndarray) – The embedded data set

  • n_clusters (int) – Number of clusters

  • labels (np.ndarray) – The cluster labels

  • projection_axes (np.ndarray) – The projection axes between the clusters

  • index_dict (dict) – A dictionary to match the indices of two clusters to a projection axis

  • edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots

  • show_legend (bool) – Specifies whether a legend should be added to the plot

  • show_plot (bool) – Specifies whether the plot should be plotted, i.e. if plt.show() should be executed (default: True)

clustpy.deep.dkm module

@authors: Collin Leiber

class clustpy.deep.dkm.DKM(n_clusters: int = 8, alphas: tuple = 1000, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep k-Means (DKM) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DKM loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • alphas (tuple) – tuple of different alpha values used for the prediction. Small values close to 0 are equivalent to homogeneous assignments to all clusters. Large values simulate a clear assignment as with kMeans. If None, the default calculation of the paper will be used. This is equal to lpha_{i+1}=2^{1/log(i)^2}*lpha_i with lpha_1=0.1 and maximum i=40. Alpha can also be a tuple with (None, lpha_1, maximum i) (default: (1000))

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for each alpha value for the actual clustering procedure. The total number of clustering epochs therefore corresponds to: len(alphas)*clustering_epochs (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dkm_labels_

The final DKM labels

Type:

np.ndarray

dkm_cluster_centers_

The final DKM cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DKM
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dkm = DKM(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dkm.fit(data)

References

Fard, Maziar Moradi, Thibaut Thonet, and Eric Gaussier. “Deep k-means: Jointly clustering with k-means and learning representations.” Pattern Recognition Letters 138 (2020): 185-192.

fit(X: ndarray, y: ndarray = None) DKM[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DKM algorithm

Return type:

DKM

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DKM

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.enrc module

@authors: Lukas Miklautz

class clustpy.deep.enrc.ACeDeC(n_clusters: int, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 20, init: str = 'acedec', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]

Bases: ENRC

Autoencoder Centroid-based Deep Cluster (ACeDeC) can be seen as a special case of ENRC where we have one cluster space and one shared space with a single cluster.

Parameters:
  • n_clusters (int) – number of clusters

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • P (list) – list containing projections for clusters in clustered space and cluster in shared space (optional) (default: None)

  • input_centers (list) – list containing the cluster centers for clusters in clustered space and cluster in shared space (optional) (default: None)

  • batch_size (int) – size of the data batches (default: 128)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)

  • tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)

  • optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)

  • init (str) – choose which initialization strategy should be used. Has to be one of ‘acedec’, ‘subkmeans’, ‘random’ or ‘sgd’ (default: ‘acedec’)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)

  • scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)

  • init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)

  • init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (default: True)

  • debug (bool) – if True additional information during the training will be printed (default: False)

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_networ_trained_

The final neural_network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

:raises ValueError : if init is not one of ‘acedec’, ‘subkmeans’, ‘random’, ‘auto’ or ‘sgd’.:

References

Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Böhm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832

fit(X: ndarray, y: ndarray = None) ACeDeC[source]

Cluster the input dataset with the ACeDeC algorithm. Saves the labels, centers, V, m, Betas, and P in the ACeDeC object. The resulting cluster labels will be stored in the labels_ attribute. :param X: input data :type X: np.ndarray :param y: the labels (can be ignored) :type y: np.ndarray

Returns:

self – returns the AceDeC object

Return type:

ACeDeC

predict(X: ndarray, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]

Predicts the labels of the input data.

Parameters:
  • X (np.ndarray) – input data

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ACeDeC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:
  • dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for dataloader parameter in predict.

  • use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for use_P parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.enrc.ENRC(n_clusters: list, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 20, init: str = 'nrkmeans', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]

Bases: _AbstractDeepClusteringAlgo

The Embeddedn Non-Redundant Clustering (ENRC) algorithm.

Parameters:
  • n_clusters (list) – list containing number of clusters for each clustering

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • P (list) – list containing projections for each clustering (optional) (default: None)

  • input_centers (list) – list containing the cluster centers for each clustering (optional) (default: None)

  • batch_size (int) – size of the data batches (default: 128)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)

  • tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)

  • optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)

  • init (str) – choose which initialization strategy should be used. Has to be one of ‘nrkmeans’, ‘random’ or ‘sgd’ (default: ‘nrkmeans’)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)

  • scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)

  • init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)

  • init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (defaul: False)

  • debug (bool) – if True additional information during the training will be printed (default: False)

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

:raises ValueError : if init is not one of ‘nrkmeans’, ‘random’, ‘auto’ or ‘sgd’.:

References

Miklautz, Lukas & Dominik Mautz et al. “Deep embedded non-redundant clustering.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.

fit(X: ndarray, y: ndarray = None) ENRC[source]

Cluster the input dataset with the ENRC algorithm. Saves the labels, centers, V, m, Betas, and P in the ENRC object. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – input data

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – returns the ENRC object

Return type:

ENRC

plot_subspace(X: ndarray, subspace_index: int = 0, labels: ndarray = None, plot_centers: bool = False, gt: ndarray = None, equal_axis: bool = False) None[source]

Plot the specified subspace_nr as scatter matrix plot.

Parameters:
  • X (np.ndarray) – input data

  • subspace_index (int) – index of the subspace_nr (default: 0)

  • labels (np.ndarray) – the labels to use for the plot (default: labels found by Nr-Kmeans) (default: None)

  • plot_centers (bool) – plot centers if True (default: False)

  • gt (np.ndarray) – of ground truth labels (default=None)

  • equal_axis (bool) – equalize axis if True (default: False)

Return type:

scatter matrix plot of the input data

predict(X: ndarray = None, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]

Predicts the labels for each clustering of X in a mini-batch manner.

Parameters:
  • X (np.ndarray) – input data

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)

Returns:

predicted_labels – n x c matrix, where n is the number of data points in X and c is the number of clusterings.

Return type:

np.ndarray

reconstruct_subspace_centroids(subspace_index: int = 0) ndarray[source]

Reconstructs the centroids in the specified subspace_nr.

Parameters:

subspace_index (int) – index of the subspace_nr (default: 0)

Returns:

centers_rec – reconstructed centers as np.ndarray

Return type:

centers_rec

set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ENRC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:
  • dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for dataloader parameter in predict.

  • use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for use_P parameter in predict.

Returns:

self – The updated object.

Return type:

object

transform_full_space(X: ndarray, embedded=False) ndarray[source]

Embedds the input dataset with the neural network and the matrix V from the ENRC object.

Parameters:
  • X (np.ndarray) – input data

  • embedded (bool) – if True, then X is assumed to be already embedded (default: False)

Returns:

rotated – The transformed data

Return type:

np.ndarray

transform_subspace(X: ndarray, subspace_index: int = 0, embedded: bool = False) ndarray[source]

Embedds the input dataset with the neural network and with the matrix V projected onto a special clusterspace_nr.

Parameters:
  • X (np.ndarray) – input data

  • subspace_index (int) – index of the subspace_nr (default: 0)

  • embedded (bool) – if True, then X is assumed to be already embedded (default: False)

Returns:

subspace – The transformed subspace

Return type:

np.ndarray

clustpy.deep.enrc.acedec_init(data: ~numpy.ndarray, n_clusters: list, optimizer_params: dict, batch_size: int = 128, optimizer_class: ~torch.optim.optimizer.Optimizer = None, rounds: int = None, epochs: int = 10, random_state: ~numpy.random.mtrand.RandomState = None, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, device: ~torch.device = device(type='cpu'), debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Initialization strategy based on optimizing ACeDeC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the nrkmeans_init and only constraints V using the reconstruction error (mean_squared_error), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the sgd_init strategy is that it can be less stable for small data sets.

Parameters:
  • data (np.ndarray) – input data

  • n_clusters (list) – list of ints, number of clusters for each clustering

  • optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate

  • batch_size (int) – size of the data batches (default: 128)

  • optimizer_params – parameters of the optimizer for the actual clustering procedure, includes the learning rate

  • optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used (default: None)

  • rounds (int) – not used here (default: None)

  • epochs (int) – epochs is automatically set to be close to 20.000 minibatch iterations as in the ACeDeC paper. If this determined value is smaller than the passed epochs, then epochs is used (default: 10)

  • random_state (np.random.RandomState) – random state for reproducible results (default: None)

  • input_centers (list) – list of np.ndarray, default=None, optional parameter if initial cluster centers want to be set (optional)

  • P (list) – list containing projections for each subspace (optional) (default: None)

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • device (torch.device) – device on which should be trained on (default: torch.device(‘cpu’))

  • debug (bool) – if True then the cost of each round will be printed (default: True)

Returns:

tuple – list of cluster centers for each subspace, list containing projections for each subspace, orthogonal rotation matrix, weights for softmax function to get beta values.

Return type:

(list, list, np.ndarray, np.ndarray)

clustpy.deep.enrc.available_init_strategies() list[source]

Returns a list of strings of available initialization strategies for ENRC and ACeDeC. At the moment following strategies are supported: nrkmeans, random, sgd, auto

clustpy.deep.enrc.beta_weights_init(P: list, n_dims: int, high_value: float = 0.9) Tensor[source]

Initializes parameters of the softmax such that betas will be set to high_value in dimensions which form a cluster subspace according to P and set to (1 - high_value)/(len(P) - 1) for the other clusterings.

Parameters:
  • P (list) – list containing projections for each subspace

  • n_dims (int) – dimensionality of the embedded data

  • high_value (float) – value that should be initially used to indicate strength of assignment of a specific dimension to the clustering (default: 0.9)

Returns:

beta_weights – initialized weights that are input in the softmax to get the betas.

Return type:

torch.Tensor

clustpy.deep.enrc.calculate_beta_weight(data: Tensor, centers: list, V: Tensor, P: list, high_beta_value: float = 0.9) Tensor[source]

The beta weights have a closed form solution if we have two subspaces, so the optimal values given the data, centers and V can be computed. See supplement of Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Boehm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832 here: https://gitlab.cs.univie.ac.at/lukas/acedec_public/-/blob/master/supplement.pdf For number of subspaces > 2, we calculate the beta weight assuming that an assigned subspace should have a weight of 0.9.

Parameters:
  • data (torch.Tensor) – input data

  • centers (list) – list of torch.Tensor, cluster centers for each clustering

  • V (torch.Tensor) – orthogonal rotation matrix

  • P (list) – list containing projections for each subspace

  • high_beta_value (float) – value that should be initially used to indicate strength of assignment of a specific dimension to the clustering (default: 0.9)

Returns:

beta_weights – a c x d vector containing the weights for the softmax to indicate which dimensions d are important for each clustering c.

Return type:

torch.Tensor

Raises:

ValueError – If number of clusterings is smaller than 2:

clustpy.deep.enrc.calculate_optimal_beta_weights_special_case(data: Tensor, centers: list, V: Tensor, batch_size: int = 32) Tensor[source]

The beta weights have a closed form solution if we have two subspaces, so the optimal values given the data, centers and V can be computed. See supplement of Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Boehm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832 here: https://gitlab.cs.univie.ac.at/lukas/acedec_public/-/blob/master/supplement.pdf

Parameters:
  • data (torch.Tensor) – input data

  • centers (list) – list of torch.Tensor, cluster centers for each clustering

  • V (torch.Tensor) – orthogonal rotation matrix

  • batch_size (int) – size of the data batches (default: 32)

Returns:

optimal_beta_weights – a c x d vector containing the optimal weights for the softmax to indicate which dimensions d are important for each clustering c.

Return type:

torch.Tensor

clustpy.deep.enrc.enrc_encode_decode_batchwise_with_loss(V: Tensor, centers: list, model: Module, dataloader: DataLoader, device: device = device(type='cpu'), ssl_loss_fn: Callable | _Loss = None) ndarray[source]

Encode and Decode input data of a dataloader in a mini-batch manner with ENRC.

Parameters:
  • V (torch.Tensor) – orthogonal rotation matrix

  • centers (list) – list of torch.Tensor, cluster centers for each clustering

  • model (torch.nn.Module) – the input model for encoding the data

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used for prediction

  • device (torch.device) – device to be predicted on (default: torch.device(‘cpu’))

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: None)

Returns:

  • enrc_encoding (np.ndarray) – n x d matrix, where n is the number of data points and d is the number of dimensions of z.

  • enrc_decoding (np.ndarray) – n x D matrix, where n is the number of data points and D is the data dimensionality.

  • reconstruction_error (flaot) – reconstruction error (will be None if ssl_loss_fn is not specified)

clustpy.deep.enrc.enrc_init(data: ~numpy.ndarray, n_clusters: list, init: str = 'auto', rounds: int = 10, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, max_iter: int = 100, optimizer_params: dict = None, optimizer_class: ~torch.optim.optimizer.Optimizer = None, batch_size: int = 128, epochs: int = 10, device: ~torch.device = device(type='cpu'), debug: bool = True, init_kwargs: dict = None) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Initialization strategy for the ENRC algorithm.

Parameters:
  • data (np.ndarray) – input data

  • n_clusters (list) – list of ints, number of clusters for each clustering

  • init (str) –

    {‘nrkmeans’, ‘random’, ‘sgd’, ‘auto’} or callable. Initialization strategies for parameters cluster_centers, V and beta of ENRC. (default=’auto’)

    ’nrkmeans’ : Performs the NrKmeans algorithm to get initial parameters. This strategy is preferred for small data sets, but the orthogonality constraint on V and subsequently for the clustered subspaces can be sometimes to limiting in practice, e.g., if clusterings in the data are not perfectly non-redundant.

    ’random’ : Same as ‘nrkmeans’, but max_iter is set to 10, so the performance is faster, but also less optimized, thus more random.

    ’sgd’ : Initialization strategy based on optimizing ENRC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the ‘nrkmeans’ option and only constraints V using the reconstruction error (mean_squared_error), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the ‘sgd’ strategy is that it can be less stable for small data sets.

    ’auto’ : Selects ‘sgd’ init if data.shape[0] > 100,000 or data.shape[1] > 1,000. For smaller data sets ‘nrkmeans’ init is used.

    If a callable is passed, it should take arguments data and n_clusters (additional parameters can be provided via the dictionary init_kwargs) and return an initialization (centers, P, V and beta_weights).

  • rounds (int) – number of repetitions of the initialization procedure (default: 10)

  • input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)

  • P (list) – list containing projections for each subspace (optional) (default: None)

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • random_state (np.random.RandomState) – random state for reproducible results (default: None)

  • max_iter (int) – maximum number of iterations of NrKmeans. Only used for init=’nrkmeans’ (default: 100)

  • optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate. Only used for init=’sgd’

  • optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used. Only used for init=’sgd’ (default: None)

  • batch_size (int) – size of the data batches. Only used for init=’sgd’ (default: 128)

  • epochs (int) – number of epochs for the actual clustering procedure. Only used for init=’sgd’ (default: 10)

  • device (torch.device) – device on which should be trained on. Only used for init=’sgd’ (default: torch.device(‘cpu’))

  • debug (bool) – if True then the cost of each round will be printed (default: True)

  • init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)

Returns:

tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.

Return type:

(list, list, np.ndarray, np.ndarray)

:raises ValueError : if init variable is passed that is not implemented.:

clustpy.deep.enrc.enrc_predict(z: Tensor, V: Tensor, centers: list, subspace_betas: Tensor, use_P: bool = False) ndarray[source]

Predicts the labels for each clustering of an input z.

Parameters:
  • z (torch.Tensor) – embedded input data point, can also be a mini-batch of embedded points

  • V (torch.tensor) – orthogonal rotation matrix

  • centers (list) – list of torch.Tensor, cluster centers for each clustering

  • subspace_betas (torch.Tensor) – weights for each dimension per clustering. Calculated via softmax(beta_weights).

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft subspace_beta weights are used (default: False)

Returns:

predicted_labels – n x c matrix, where n is the number of data points in z and c is the number of clusterings.

Return type:

np.ndarray

clustpy.deep.enrc.enrc_predict_batchwise(V: Tensor, centers: list, subspace_betas: Tensor, model: Module, dataloader: DataLoader, device: device = device(type='cpu'), use_P: bool = False) ndarray[source]

Predicts the labels for each clustering of a dataloader in a mini-batch manner.

Parameters:
  • V (torch.Tensor) – orthogonal rotation matrix

  • centers (list) – list of torch.Tensor, cluster centers for each clustering

  • subspace_betas (torch.Tensor) – weights for each dimension per clustering. Calculated via softmax(beta_weights).

  • model (torch.nn.Module) – the input model for encoding the data

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used for prediction

  • device (torch.device) – device to be predicted on (default: torch.device(‘cpu’))

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: False)

Returns:

predicted_labels – n x c matrix, where n is the number of data points in z and c is the number of clusterings.

Return type:

np.ndarray

clustpy.deep.enrc.nrkmeans_init(data: ~numpy.ndarray, n_clusters: list, rounds: int = 10, max_iter: int = 100, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, debug=True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Initialization strategy based on the NrKmeans Algorithm. This strategy is preferred for small data sets, but the orthogonality constraint on V and subsequently for the clustered subspaces can be sometimes to limiting in practice, e.g., if clusterings are not perfectly non-redundant.

Parameters:
  • data (np.ndarray) – input data

  • n_clusters (list) – list of ints, number of clusters for each clustering

  • rounds (int) – number of repetitions of the NrKmeans algorithm (default: 10)

  • max_iter (int) – maximum number of iterations of NrKmeans (default: 100)

  • input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)

  • P (list) – list containing projections for each subspace (optional) (default: None)

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • debug (bool) – if True then the cost of each round will be printed (default: True)

Returns:

tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.

Return type:

(list, list, np.ndarray, np.ndarray)

clustpy.deep.enrc.optimal_beta(kmeans_loss: Tensor, other_losses_mean_sum: Tensor) Tensor[source]

Calculate optimal values for the beta weight for each dimension.

Parameters:
  • kmeans_loss (torch.Tensor) – a 1 x d vector of the kmeans losses per dimension.

  • other_losses_mean_sum (torch.Tensor) – a 1 x d vector of the kmeans losses of all other clusterings except the one in ‘kmeans_loss’.

Returns:

optimal_beta_weights – a 1 x d vector containing the optimal weights for the softmax to indicate which dimensions are important for each clustering. Calculated via -torch.log(kmeans_loss/other_losses_mean_sum)

Return type:

torch.Tensor

clustpy.deep.enrc.random_nrkmeans_init(data: ~numpy.ndarray, n_clusters: list, rounds: int = 10, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Initialization strategy based on the NrKmeans Algorithm. For documentation see nrkmeans_init function. Same as nrkmeans_init, but max_iter is set to 1, so the results will be faster and more random.

Parameters:
  • data (np.ndarray) – input data

  • n_clusters (list) – list of ints, number of clusters for each clustering

  • rounds (int) – number of repetitions of the NrKmeans algorithm (default: 10)

  • input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)

  • P (list) – list containing projections for each subspace (optional) (default: None)

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • debug (bool) – if True then the cost of each round will be printed (default: True)

Returns:

tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.

Return type:

(list, list, np.ndarray, np.ndarray)

clustpy.deep.enrc.reinit_centers(enrc: _ENRC_Module, subspace_id: int, dataloader: DataLoader, model: Module, n_samples: int = 512, kmeans_steps: int = 10, split: str = 'random', debug: bool = False) None[source]

Reinitializes centers that have been lost, i.e. if they did not get any data point assigned. Before a center is reinitialized, this method checks whether a center has not get any points assigned over several mini-batch iterations and if this count is higher than enrc.reinit_threshold the center will be reinitialized.

Parameters:
  • enrc (_ENRC_Module) – torch.nn.Module instance for the ENRC algorithm

  • subspace_id (int) – integer which indicates which subspace the cluster to be checked are in.

  • dataloader (torch.utils.data.DataLoader) – dataloader from which data is randomly sampled. Important shuffle=True needs to be set, because n_samples random samples are drawn.

  • model (torch.nn.Module) – neural network used for the embedding

  • n_samples (int) – number of samples that should be used for the reclustering (default: 512)

  • kmeans_steps (int) – number of mini-batch kmeans steps that should be conducted with the new centroid (default: 10)

  • split (str) – {‘random’, ‘cost’}, default=’random’, select how clusters should be split for renitialization. ‘random’ : split a random point from the random sample of size=n_samples. ‘cost’ : split the cluster with max kmeans cost.

  • debug (bool) – if True than training errors will be printed (default: True)

clustpy.deep.enrc.sgd_init(data: ~numpy.ndarray, n_clusters: list, optimizer_params: dict, batch_size: int = 128, optimizer_class: ~torch.optim.optimizer.Optimizer = None, rounds: int = 2, epochs: int = 10, random_state: ~numpy.random.mtrand.RandomState = None, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, device: ~torch.device = device(type='cpu'), debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Initialization strategy based on optimizing ENRC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the nrkmeans_init and only constraints V using the reconstruction error (mean_squared_error), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the sgd_init strategy is that it can be less stable for small data sets.

Parameters:
  • data (np.ndarray) – input data

  • n_clusters (list) – list of ints, number of clusters for each clustering

  • optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate

  • batch_size (int) – size of the data batches (default: 128)

  • optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used (default: None)

  • rounds (int) – number of repetitions of the initialization procedure (default: 2)

  • epochs (int) – number of epochs for the actual clustering procedure (default: 10)

  • random_state (np.random.RandomState) – random state for reproducible results (default: None)

  • input_centers (list) – list of np.ndarray, default=None, optional parameter if initial cluster centers want to be set (optional)

  • P (list) – list containing projections for each subspace (optional) (default: None)

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • device (torch.device) – device on which should be trained on (default: torch.device(‘cpu’))

  • debug (bool) – if True then the cost of each round will be printed (default: True)

Returns:

tuple – list of cluster centers for each subspace, list containing projections for each subspace, orthogonal rotation matrix, weights for softmax function to get beta values.

Return type:

(list, list, np.ndarray, np.ndarray)

clustpy.deep.shade module

@authors: Pascal Weber

class clustpy.deep.shade.SHADE(clustering_class: ~sklearn.base.ClusterMixin | None = <class 'clustpy.hierarchical.dctree_clusterer.DCTree_Clusterer'>, clustering_params: dict = None, min_points: int = 5, use_complete_dc_tree: bool = True, use_matrix_dc_distance: bool = True, use_less_memory: bool = False, batch_size: int = 500, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 0, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~typing.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, density_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Structure-preserving High-dimensional Analysis with Density-based Exploration (SHADE) algorithm. A neural network (autoencoder AE) will be trained with the reconstruction loss and the d_dc loss function. Afterward, KMeans or HDBSCAN identifies the initial clusters.

Parameters:
  • clustering_class (ClusterMixin) – clustering class to obtain the cluster labels after getting the embedding (default: DCTree_Clusterer)

  • clustering_params (dict) – parameters for the clustering class. If None, it will be set to {“min_points”: min_points} (default: None)

  • min_points (int) – the minimum number of points (default: 5)

  • use_complete_dc_tree (bool) – Defines whether the complete DC Tree should be used instead of a batch-wise version (default: True)

  • use_matrix_dc_distance (bool) – Defines whether the matrix DC distance should be stored - can cause memory issues (default: True)

  • use_less_memory (bool) – Use less memory when constructing the DCTree. This will, however, increase the runtime (default: False)

  • batch_size (int) – Size of the data batches. (default: 500)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3}. (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network. (default: 0)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • density_loss_weight (float) – weight of the density loss compared to the reconstruction loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers defined as the mean of assigned samples within the AE embedding

Type:

np.ndarray

dc_tree_

The dc tree

Type:

DCTree

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> shade = SHADE()
>>> shade.fit(data)

References

SHADE: Deep Density-based Clustering Anna Beer; Pascal Weber; Lukas Miklautz; Collin Leiber; Walid Durani; Christian Böhm IEEE International Conference on Data Mining (ICDM), Abu Dhabi, United Arab Emirates, 2024, pp. 675-680, doi: 10.1109/ICDM59182.2024.

fit(X: ndarray, y: ndarray = None) SHADE[source]

Cluster the input dataset with the SHADE algorithm. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – The given data set.

  • y (np.ndarray) – The labels. (can be ignored)

Returns:

self – This instance of the SHADE algorithm.

Return type:

SHADE

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data. Note that this is just a very imprecise estimation as we are not using the DC Tree to predict the labels. The prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() SHADE

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.vade module

@authors: Donatella Novakovic, Lukas Miklautz, Collin Leiber

class clustpy.deep.vade.VaDE(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = BCELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Variational Deep Embedding (VaDE) algorithm. First, an variational autoencoder (VAE) will be trained (will be skipped if input neural network is given). Afterward, a GMM will be fit to identify the initial clustering structures. Last, the VAE will be optimized using the VaDE loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.BCELoss(reduction=’sum’))

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new VariationalAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (central layer with mean and variance) (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: GaussianMixture)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {“n_init”: 10, “covariance_type”: “diag”} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The labels as identified by a final Gaussian Mixture Model

Type:

np.ndarray

cluster_centers_

The cluster centers as identified by a final Gaussian Mixture Model

Type:

np.ndarray

covariances_

The covariance matrices as identified by a final Gaussian Mixture Model

Type:

np.ndarray

weights_

The weights as identified by a final Gaussian Mixture Model

Type:

np.ndarray

vade_labels_

The labels as identified by VaDE after the training terminated

Type:

np.ndarray

vade_cluster_centers_

The cluster centers as identified by VaDE after the training terminated

Type:

np.ndarray

vade_covariances_

The covariance matrices as identified by VaDE after the training terminated

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> data = (data - np.mean(data)) / np.std(data)
>>> vade = VaDE(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> vade.fit(data)

References

Jiang, Zhuxi, et al. “Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering.” IJCAI. 2017.

fit(X: ndarray, y: ndarray = None) VaDE[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the VaDE algorithm

Return type:

VaDE

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() VaDE

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

Module contents

class clustpy.deep.ACeDeC(n_clusters: int, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 20, init: str = 'acedec', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]

Bases: ENRC

Autoencoder Centroid-based Deep Cluster (ACeDeC) can be seen as a special case of ENRC where we have one cluster space and one shared space with a single cluster.

Parameters:
  • n_clusters (int) – number of clusters

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • P (list) – list containing projections for clusters in clustered space and cluster in shared space (optional) (default: None)

  • input_centers (list) – list containing the cluster centers for clusters in clustered space and cluster in shared space (optional) (default: None)

  • batch_size (int) – size of the data batches (default: 128)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)

  • tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)

  • optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)

  • init (str) – choose which initialization strategy should be used. Has to be one of ‘acedec’, ‘subkmeans’, ‘random’ or ‘sgd’ (default: ‘acedec’)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)

  • scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)

  • init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)

  • init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (default: True)

  • debug (bool) – if True additional information during the training will be printed (default: False)

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_networ_trained_

The final neural_network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

:raises ValueError : if init is not one of ‘acedec’, ‘subkmeans’, ‘random’, ‘auto’ or ‘sgd’.:

References

Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Böhm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832

fit(X: ndarray, y: ndarray = None) ACeDeC[source]

Cluster the input dataset with the ACeDeC algorithm. Saves the labels, centers, V, m, Betas, and P in the ACeDeC object. The resulting cluster labels will be stored in the labels_ attribute. :param X: input data :type X: np.ndarray :param y: the labels (can be ignored) :type y: np.ndarray

Returns:

self – returns the AceDeC object

Return type:

ACeDeC

predict(X: ndarray, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]

Predicts the labels of the input data.

Parameters:
  • X (np.ndarray) – input data

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ACeDeC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:
  • dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for dataloader parameter in predict.

  • use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for use_P parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.AEC(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Auto-encoder Based Data Clustering (AEC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the AEC loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining. If this is None, random labels will be used (default: None)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import AEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> aec = AEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> AEC.fit(data)

References

Song, Chunfeng, et al. “Auto-encoder based data clustering.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress, CIARP 2013, Havana, Cuba, November 20-23, 2013, Proceedings, Part I 18. Springer Berlin Heidelberg, 2013.

fit(X: ndarray, y: ndarray = None) AEC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the AEC algorithm

Return type:

AEC

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') AEC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.DCN(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Clustering Network (DCN) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DCN loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dcn_labels_

The final DCN labels

Type:

np.ndarray

dcn_cluster_centers_

The final DCN cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DCN
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dcn = DCN(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dcn.fit(data)

References

Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” international conference on machine learning. PMLR, 2017.

fit(X: ndarray, y: ndarray = None) DCN[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DCN algorithm

Return type:

DCN

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DCN

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.DDC(ratio: float = 0.1, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, tsne_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Density-based Image Clustering (DDC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE is executed on the embedded data and a variant of the Density Peak Clustering algorithm is executed.

Parameters:
  • ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio (default: 0.1)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • tsne_params (dict) – Parameters for the t-SNE execution. For example, perplexity can be changed by setting tsne_params to {“n_components”: 2, “perplexity”: 25}. Check out sklearn.manifold.TSNE for more information. If None, it will be set to {“n_components”: 2} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels (obtained by a variant of Density Peak Clustering)

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

tsne_

The t-SNE object

Type:

TSNE

n_features_in_

the number of features used for the fitting

Type:

int

cluster_centers_

The final cluster centers defined as the mean of assigned samples within the AE embedding

Type:

np.ndarray

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DDC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> ddc = DDC(pretrain_epochs=3)
>>> ddc.fit(data)

References

Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.

fit(X: ndarray, y: ndarray = None) DDC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DDC algorithm

Return type:

DDC

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data. Note that this is just a very imprecise estimation as the manifold does not learn a function f() to map the data into the final embedding. Therefore, the prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() DDC

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.DEC(n_clusters: int = 8, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Clustering (DEC) algorithm. First, a neural_network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DEC loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • alpha (float) – alpha value for the prediction (default: 1.0)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dec_labels_

The final DEC labels

Type:

np.ndarray

dec_cluster_centers_

The final DEC cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dec = DEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dec.fit(data)

References

Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.

fit(X: ndarray, y: ndarray = None) DEC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DEC algorithm

Return type:

DEC

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DEC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.DEN(n_clusters: int = 8, group_size: int | list | None = 2, n_neighbors: int = 5, weight_locality_constraint: float = 0.5, weight_sparsity_constraint: float = 1.0, heat_kernel_t_parameter: float = 1.0, group_lasso_lambda_parameter: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int | None = None, custom_dataloaders: tuple = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedding Network (DEN) algorithm. It trains a neural network by optimizing a loss functions consisting of three components. These are (1) the standrad loss function of the neural netork (e.g. reconstruction loss for autoencoders), (2) the locality-preserving constraint and (3) the group sparsity constraint. Finally, k-Means is excuted in the resulting embedding.

Parameters:
  • n_clusters (int) – number of clusters (default: 8)

  • group_size (int | list) – the number of features in each group. Can also be a list, specifying the size of each group separately. Can be None if embedding_size is specified (default: 2)

  • n_neighbors (int) – the number of nearest-neighbors (including itself) for the locality-preserving constraint. Nearest-neighbors will be calculated by using the Euclidean distance. If another distance should be used to define the nearest-neighbors, the neighbors can be included in the custom_dataloader as additional_inputs. In this case, it is expected that the trainloader is composed of: (sample_ids, original_samples, 1st-NNs, 2nd-NNs, …, (n_neighbors-1)-NNs) (default: 5)

  • weight_locality_constraint (float) – weight alpha for the locality-preserving constraint (default: 0.5)

  • weight_sparsity_constraint (float) – weight beta for the group sparsity constraint (default: 1.)

  • heat_kernel_t_parameter (float) – the t parameter for the heat kernel included in the locality-preserving constraint (default: 1.)

  • group_lasso_lambda_parameter (float) – the lambda parameter for the group lasso included in the group sparsity constraint (default: 1.)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: None)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by KMeans)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by KMeans)

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DEN
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> den = DEN(n_clusters=3, pretrain_epochs=3)
>>> den.fit(data)

References

Huang, Peihao, et al. “Deep embedding network for clustering.” 2014 22nd International conference on pattern recognition. IEEE, 2014.

fit(X: ndarray, y: ndarray = None) DEN[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DEN algorithm

Return type:

DEN

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DEN

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.DKM(n_clusters: int = 8, alphas: tuple = 1000, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep k-Means (DKM) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DKM loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • alphas (tuple) – tuple of different alpha values used for the prediction. Small values close to 0 are equivalent to homogeneous assignments to all clusters. Large values simulate a clear assignment as with kMeans. If None, the default calculation of the paper will be used. This is equal to lpha_{i+1}=2^{1/log(i)^2}*lpha_i with lpha_1=0.1 and maximum i=40. Alpha can also be a tuple with (None, lpha_1, maximum i) (default: (1000))

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for each alpha value for the actual clustering procedure. The total number of clustering epochs therefore corresponds to: len(alphas)*clustering_epochs (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dkm_labels_

The final DKM labels

Type:

np.ndarray

dkm_cluster_centers_

The final DKM cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DKM
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dkm = DKM(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dkm.fit(data)

References

Fard, Maziar Moradi, Thibaut Thonet, and Eric Gaussier. “Deep k-means: Jointly clustering with k-means and learning representations.” Pattern Recognition Letters 138 (2020): 185-192.

fit(X: ndarray, y: ndarray = None) DKM[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DKM algorithm

Return type:

DKM

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DKM

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.DeepECT(max_n_leaf_nodes: int = 20, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, grow_interval: int = 2, pruning_threshold: float = 0.1, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Cluster Tree (DeepECT) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, a cluster tree will be grown and the network will be optimized using the DeepECT loss function.

Parameters:
  • max_n_leaf_nodes (int) – Maximum number of leaf nodes in the cluster tree (default: 20)

  • batch_size (int) – Size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – Number of epochs for the actual clustering procedure (default: 150)

  • grow_interval (int) – Number of epochs after which the the tree is grown (default: 2)

  • pruning_threshold (float) – The threshold for pruning the tree (default: 0.1)

  • optimizer_class (torch.optim.Optimizer) – The optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – Size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState) – Use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

tree_

The prediction cluster tree after training

Type:

PredictionClusterTree

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

References

Mautz, Dominik, Claudia Plant, and Christian Böhm. “Deep embedded cluster tree.” 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019.

fit(X: ndarray, y: ndarray = None) DeepECT[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – This instance of the DeepECT algorithm

Return type:

DeepECT

flat_clustering(n_leaf_nodes_to_keep: int) ndarray[source]

Transform the predicted labels into a flat clustering result by only keeping n_leaf_nodes_to_keep leaf nodes in the tree. Returns labels as if the clustering procedure would have stopped at the specified number of nodes. Note that each leaf node corresponds to a cluster.

Parameters:

n_leaf_nodes_to_keep (int) – The number of leaf nodes to keep in the cluster tree

Returns:

labels_pruned – The new cluster labels

Return type:

np.ndarray

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() DeepECT

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.DipDECK(n_clusters_init: int = 35, dip_merge_threshold: float = 0.9, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, max_n_clusters: int = inf, min_n_clusters: int = 1, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 5, max_cluster_size_diff_factor: float = 2, pval_strategy: str = 'table', n_boots: int = 1000, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None, debug: bool = False)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Clustering with k-Estimation (DipDECK) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters using an overestimated number of clusters. Last, the network will be optimized using the DipDECK loss function. If any Dip-value exceeds the dip_merge_threshold, the corresponding clusters will be merged.

Parameters:
  • n_clusters_init (int) – initial number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 35)

  • dip_merge_threshold (float) – threshold regarding the Dip-p-value that defines if two clusters should be merged. Must be bvetween 0 and 1 (default: 0.9)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • max_n_clusters (int) – maximum number of clusters. Must be larger than min_n_clusters. If the result has more clusters, a merge will be forced (default: np.inf)

  • min_n_clusters (int) – minimum number of clusters. Must be larger than 0, smaller than max_n_clusters and smaller than n_clusters_init. When this number of clusters is reached, all further merge processes will be hindered (default: 1)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure. Will reset after each merge (default: 50)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 5)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for the Dip calculation (default: 2)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

labels_

The final labels

Type:

np.ndarray

n_clusters_

The final number of clusters

Type:

int

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DipDECK
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dipdeck = DipDECK(pretrain_epochs=3, clustering_epochs=3)
>>> dipdeck.fit(data)

References

Leiber, Collin, et al. “Dip-based deep embedded clustering with k-estimation.” Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021.

fit(X: ndarray, y: ndarray = None) DipDECK[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DipDECK algorithm

Return type:

DipDECK

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() DipDECK

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.DipEncoder(n_clusters: int = 8, batch_size: int = None, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, max_cluster_size_diff_factor: float = 3, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The DipEncoder. Can be used either as a clustering procedure if no ground truth labels are given or as a supervised dimensionality reduction technique. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DipEncoder loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • batch_size (int) – size of the data batches for the actual training of the DipEncoder. Should be larger the more clusters we have. If it is None, it will be set to (25 x n_clusters) (default: None)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used (default: 3)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss. If None, it will be equal to 1/(4L), where L is the reconstruction loss of the first batch of an untrained neural network (default: None)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels

Type:

np.ndarray

projection_axes_

The final projection axes between the clusters

Type:

np.ndarray

index_dict_

A dictionary to match the indices of two clusters to a projection axis

Type:

dict

projection_thresholds_

A list containing the thresholds for each projection axis and a tuple indicating which cluster is left and right of the threshold

Type:

list

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DipEncoder
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dipencoder = DipEncoder(3, pretrain_epochs=3, clustering_epochs=3)
>>> dipencoder.fit(data)

References

Leiber, Collin and Bauer, Lena G. M. and Neumayr, Michael and Plant, Claudia and Böhm, Christian “The DipEncoder: Enforcing Multimodality in Autoencoders.” Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2022.

fit(X: ndarray, y: ndarray = None) DipEncoder[source]

Initiate the actual clustering/dimensionality reduction process on the input data set. If no ground truth labels are given, the resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – The given (training) data set

  • y (np.ndarray) – The ground truth labels. If None, the DipEncoder will be used for clustering (default: None)

Returns:

self – This instance of the DipEncoder

Return type:

DipEncoder

plot(X: ndarray, edge_width: float = 0.2, show_legend: bool = True) None[source]

Plot the current state of the DipEncoder. First the data set will be encoded using the neural network, afterwards the plot will be created. Uses the plot_scatter_matrix as a basis and adds projection axes in red.

Parameters:
  • X (np.ndarray) – The data set

  • edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots

  • show_legend (bool) – Specifies whether a legend should be added to the plot

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() DipEncoder

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.ENRC(n_clusters: list, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 20, init: str = 'nrkmeans', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]

Bases: _AbstractDeepClusteringAlgo

The Embeddedn Non-Redundant Clustering (ENRC) algorithm.

Parameters:
  • n_clusters (list) – list containing number of clusters for each clustering

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • P (list) – list containing projections for each clustering (optional) (default: None)

  • input_centers (list) – list containing the cluster centers for each clustering (optional) (default: None)

  • batch_size (int) – size of the data batches (default: 128)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)

  • tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)

  • optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)

  • init (str) – choose which initialization strategy should be used. Has to be one of ‘nrkmeans’, ‘random’ or ‘sgd’ (default: ‘nrkmeans’)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)

  • scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)

  • init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)

  • init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (defaul: False)

  • debug (bool) – if True additional information during the training will be printed (default: False)

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

:raises ValueError : if init is not one of ‘nrkmeans’, ‘random’, ‘auto’ or ‘sgd’.:

References

Miklautz, Lukas & Dominik Mautz et al. “Deep embedded non-redundant clustering.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.

fit(X: ndarray, y: ndarray = None) ENRC[source]

Cluster the input dataset with the ENRC algorithm. Saves the labels, centers, V, m, Betas, and P in the ENRC object. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – input data

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – returns the ENRC object

Return type:

ENRC

plot_subspace(X: ndarray, subspace_index: int = 0, labels: ndarray = None, plot_centers: bool = False, gt: ndarray = None, equal_axis: bool = False) None[source]

Plot the specified subspace_nr as scatter matrix plot.

Parameters:
  • X (np.ndarray) – input data

  • subspace_index (int) – index of the subspace_nr (default: 0)

  • labels (np.ndarray) – the labels to use for the plot (default: labels found by Nr-Kmeans) (default: None)

  • plot_centers (bool) – plot centers if True (default: False)

  • gt (np.ndarray) – of ground truth labels (default=None)

  • equal_axis (bool) – equalize axis if True (default: False)

Return type:

scatter matrix plot of the input data

predict(X: ndarray = None, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]

Predicts the labels for each clustering of X in a mini-batch manner.

Parameters:
  • X (np.ndarray) – input data

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)

Returns:

predicted_labels – n x c matrix, where n is the number of data points in X and c is the number of clusterings.

Return type:

np.ndarray

reconstruct_subspace_centroids(subspace_index: int = 0) ndarray[source]

Reconstructs the centroids in the specified subspace_nr.

Parameters:

subspace_index (int) – index of the subspace_nr (default: 0)

Returns:

centers_rec – reconstructed centers as np.ndarray

Return type:

centers_rec

set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ENRC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:
  • dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for dataloader parameter in predict.

  • use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for use_P parameter in predict.

Returns:

self – The updated object.

Return type:

object

transform_full_space(X: ndarray, embedded=False) ndarray[source]

Embedds the input dataset with the neural network and the matrix V from the ENRC object.

Parameters:
  • X (np.ndarray) – input data

  • embedded (bool) – if True, then X is assumed to be already embedded (default: False)

Returns:

rotated – The transformed data

Return type:

np.ndarray

transform_subspace(X: ndarray, subspace_index: int = 0, embedded: bool = False) ndarray[source]

Embedds the input dataset with the neural network and with the matrix V projected onto a special clusterspace_nr.

Parameters:
  • X (np.ndarray) – input data

  • subspace_index (int) – index of the subspace_nr (default: 0)

  • embedded (bool) – if True, then X is assumed to be already embedded (default: False)

Returns:

subspace – The transformed subspace

Return type:

np.ndarray

class clustpy.deep.IDEC(n_clusters: int = 8, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: DEC

The Improved Deep Embedded Clustering (IDEC) algorithm. Is equal to the DEC algorithm but uses the self-supervised learning loss also during the clustering optimization. Further, clustering_loss_weight is set to 0.1 instead of 1 when using the default settings.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • alpha (float) – alpha value for the prediction (default: 1.0)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 0.1)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dec_labels_

The final DEC labels

Type:

np.ndarray

dec_cluster_centers_

The final DEC cluster centers

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import IDEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> idec = IDEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> idec.fit(data)

References

Guo, Xifeng, et al. “Improved deep embedded clustering with local structure preservation.” IJCAI. 2017.

set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') IDEC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cluster_centers parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.N2D(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, manifold_class: ~sklearn.base.TransformerMixin = <class 'sklearn.manifold._t_sne.TSNE'>, manifold_params: dict = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Not 2 Deep (N2D) clustering algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE/UMAP/ISOMAP is executed on the embedded data and the EM algorithm is executed.

Parameters:
  • n_clusters (int) – number of clusters (default: 8)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • manifold_class (TransformerMixin) – the manifold technique class (default: TSNE)

  • manifold_params (dict) – Parameters for the manifold execution. For example, perplexity can be changed for TSNE by setting manifold_params to {“n_components”: 2, “perplexity”: 25}. Check out e.g. sklearn.manifold.TSNE for more information. If None, it will be set to {“n_components”: n_clusters} (default: None)

  • initial_clustering_params (dict) – parameters for the GMM clustering class. If None, it will be set to {} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels

Type:

np.ndarray

cluster_centers_manifold_

The final cluster centers within the embedding of the manifold

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

manifold_

The manifold object

Type:

TransformerMixin

n_features_in_

the number of features used for the fitting

Type:

int

cluster_centers_

The final cluster centers defined as the mean of assigned samples within the AE embedding

Type:

np.ndarray

References

McConville, Ryan, et al. “N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding.” 2020 25th international conference on pattern recognition (ICPR). IEEE, 2021.

fit(X: ndarray, y: ndarray = None) N2D[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the N2D algorithm

Return type:

N2D

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data. Note that this is just a very imprecise estimation as the manifold does not learn a function f() to map the data into the final embedding. Therefore, the prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() N2D

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.SHADE(clustering_class: ~sklearn.base.ClusterMixin | None = <class 'clustpy.hierarchical.dctree_clusterer.DCTree_Clusterer'>, clustering_params: dict = None, min_points: int = 5, use_complete_dc_tree: bool = True, use_matrix_dc_distance: bool = True, use_less_memory: bool = False, batch_size: int = 500, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 0, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~typing.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, density_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Structure-preserving High-dimensional Analysis with Density-based Exploration (SHADE) algorithm. A neural network (autoencoder AE) will be trained with the reconstruction loss and the d_dc loss function. Afterward, KMeans or HDBSCAN identifies the initial clusters.

Parameters:
  • clustering_class (ClusterMixin) – clustering class to obtain the cluster labels after getting the embedding (default: DCTree_Clusterer)

  • clustering_params (dict) – parameters for the clustering class. If None, it will be set to {“min_points”: min_points} (default: None)

  • min_points (int) – the minimum number of points (default: 5)

  • use_complete_dc_tree (bool) – Defines whether the complete DC Tree should be used instead of a batch-wise version (default: True)

  • use_matrix_dc_distance (bool) – Defines whether the matrix DC distance should be stored - can cause memory issues (default: True)

  • use_less_memory (bool) – Use less memory when constructing the DCTree. This will, however, increase the runtime (default: False)

  • batch_size (int) – Size of the data batches. (default: 500)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3}. (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network. (default: 0)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • density_loss_weight (float) – weight of the density loss compared to the reconstruction loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers defined as the mean of assigned samples within the AE embedding

Type:

np.ndarray

dc_tree_

The dc tree

Type:

DCTree

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> shade = SHADE()
>>> shade.fit(data)

References

SHADE: Deep Density-based Clustering Anna Beer; Pascal Weber; Lukas Miklautz; Collin Leiber; Walid Durani; Christian Böhm IEEE International Conference on Data Mining (ICDM), Abu Dhabi, United Arab Emirates, 2024, pp. 675-680, doi: 10.1109/ICDM59182.2024.

fit(X: ndarray, y: ndarray = None) SHADE[source]

Cluster the input dataset with the SHADE algorithm. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – The given data set.

  • y (np.ndarray) – The labels. (can be ignored)

Returns:

self – This instance of the SHADE algorithm.

Return type:

SHADE

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data. Note that this is just a very imprecise estimation as we are not using the DC Tree to predict the labels. The prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() SHADE

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.VaDE(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = BCELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Variational Deep Embedding (VaDE) algorithm. First, an variational autoencoder (VAE) will be trained (will be skipped if input neural network is given). Afterward, a GMM will be fit to identify the initial clustering structures. Last, the VAE will be optimized using the VaDE loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.BCELoss(reduction=’sum’))

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new VariationalAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (central layer with mean and variance) (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: GaussianMixture)

  • initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {“n_init”: 10, “covariance_type”: “diag”} (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The labels as identified by a final Gaussian Mixture Model

Type:

np.ndarray

cluster_centers_

The cluster centers as identified by a final Gaussian Mixture Model

Type:

np.ndarray

covariances_

The covariance matrices as identified by a final Gaussian Mixture Model

Type:

np.ndarray

weights_

The weights as identified by a final Gaussian Mixture Model

Type:

np.ndarray

vade_labels_

The labels as identified by VaDE after the training terminated

Type:

np.ndarray

vade_cluster_centers_

The cluster centers as identified by VaDE after the training terminated

Type:

np.ndarray

vade_covariances_

The covariance matrices as identified by VaDE after the training terminated

Type:

np.ndarray

neural_network_trained_

The final neural network

Type:

torch.nn.Module

n_features_in_

the number of features used for the fitting

Type:

int

Examples

>>> from clustpy.data import create_subspace_data
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> data = (data - np.mean(data)) / np.std(data)
>>> vade = VaDE(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> vade.fit(data)

References

Jiang, Zhuxi, et al. “Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering.” IJCAI. 2017.

fit(X: ndarray, y: ndarray = None) VaDE[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the VaDE algorithm

Return type:

VaDE

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request() VaDE

No-op.

Calling this method has no effect.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.decode_batchwise(dataloader: DataLoader, neural_network: Module) ndarray[source]

Utility function for decoding the whole data set in a mini-batch fashion, e.g., with an autoencoder. Note: Assumes an implemented decode function

Parameters:
  • dataloader (torch.utils.data.DataLoader) – data to decode

  • neural_network (torch.nn.Module) – the neural network that is used for the decoding (e.g. an autoencoder)

Returns:

decodings_numpy – The decoded data set

Return type:

np.ndarray

clustpy.deep.detect_device(device: device | int | str = None) device[source]

Automatically detects if you have a cuda enabled GPU. Device can also be read from environment variable “CLUSTPY_DEVICE”. It can be set using, e.g., os.environ[“CLUSTPY_DEVICE”] = “cuda:1”

Parameters:

device (torch.device | int | str) – the input device. Will be returned if it is not None (default: None)

Returns:

device – device on which the prediction should take place

Return type:

torch.device

clustpy.deep.encode_batchwise(dataloader: DataLoader, neural_network: Module) ndarray[source]

Utility function for embedding the whole data set in a mini-batch fashion

Parameters:
  • dataloader (torch.utils.data.DataLoader) – data to embed

  • neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)

Returns:

embeddings_numpy – The embedded data set

Return type:

np.ndarray

clustpy.deep.encode_decode_batchwise(dataloader: ~torch.utils.data.dataloader.DataLoader, neural_network: ~torch.nn.modules.module.Module) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Utility function for encoding and decoding the whole data set in a mini-batch fashion, e.g., with an autoencoder. Note: Assumes an implemented decode function

Parameters:
  • dataloader (torch.utils.data.DataLoader) – dataloader to be used

  • neural_network (torch.nn.Module) – the neural network that is used for the encoding and decoding (e.g. an autoencoder)

Returns:

tuple – The embedded data set, The decoded data set

Return type:

(np.ndarray, np.ndarray)

clustpy.deep.get_dataloader(X: ~numpy.ndarray | ~torch.Tensor, batch_size: int = 256, shuffle: bool = True, drop_last: bool = False, additional_inputs: list | ~numpy.ndarray | ~torch.Tensor = None, dataset_class: ~torch.utils.data.dataset.Dataset = <class 'clustpy.deep._data_utils._ClustpyDataset'>, ds_kwargs: dict = None, dl_kwargs: dict = None) DataLoader[source]

Create a dataloader for Deep Clustering algorithms. First entry always contains the indices of the data samples. Second entry always contains the actual data samples. If for example labels are desired, they can be passed through the additional_inputs parameter (should be a list). Other customizations (e.g. augmentation) can be implemented using a custom dataset_class. This custom class should stick to the conventions, [index, data, …].

Parameters:
  • X (np.ndarray | torch.Tensor) – the actual data set (can be np.ndarray or torch.Tensor)

  • batch_size (int) – the batch size (default: 256)

  • shuffle (bool) – boolean that defines if the data set should be shuffled (default: True)

  • drop_last (bool) – boolean that defines if the last batch should be ignored (default: False)

  • additional_inputs (list | np.ndarray | torch.Tensor) – additional inputs for the dataloader, e.g. labels or neighbors. Can be None, np.ndarray, torch.Tensor or a list containing np.ndarrays/torch.Tensors (default: None)

  • dataset_class (torch.utils.data.Dataset) – defines the class of the tensor dataset that is contained in the dataloader (default: _ClustpyDataset)

  • ds_kwargs (dict) –

    other arguments for dataset_class. An example usage would be to include augmentation or preprocessing transforms to the _ClustpyDataset by passing ds_kwargs={“aug_transforms_list”:[aug_transforms], “orig_transforms_list”:[orig_transforms]}, where aug_transforms and orig_transforms are transforming the input X, e.g., using torchvision.transforms.Compose to combine multiple transformations.

    Important: If aug_transform_list is passed via ds_kwargs the returned values of the dataloader change. The first entry will still be the indices of the data sample,

    but the second samples will be the transformed version of the actual data samples and third entry will be the original data samples. If orig_transforms_list is passed as well then the third entry will be transformed accordingly, this might be needed for preprocessing the data. An example for MNIST is shown below.

  • dl_kwargs (dict) – other arguments for torch.utils.data.DataLoader

Examples

>>> # Example for usage of data transformations with get_dataloader
>>> from clustpy.data import load_mnist
>>> import torch
>>> import torchvision
>>> # load and prepare data for torchvision.transforms
>>> data, labels = load_mnist()
>>> data = data.reshape(-1, 1, 28, 28)
>>> data /= 255.0
>>> data = torch.from_numpy(data).float()
>>> #
>>> # preprocessing functions
>>> mean = data.mean()
>>> std = data.std()
>>> normalize_fn = torchvision.transforms.Normalize([mean], [std])
>>> # flatten is only needed if a FeedForward network is used, otherwise this can be skipped.
>>> flatten_fn = torchvision.transforms.Lambda(torch.flatten)
>>> #
>>> # augmentation transforms
>>> transform_list = [
>>>     # transform input tensor to PIL image for augmentation
>>>     torchvision.transforms.ToPILImage(),
>>>     # apply transformations
>>>     torchvision.transforms.RandomAffine(degrees=(-16,+16),
>>>                                                 translate=(0.1, 0.1),
>>>                                                 shear=(-8, 8),
>>>                                                 fill=0),
>>>     # transform back to torch.tensor
>>>     torchvision.transforms.ToTensor(),
>>>     # preprocess and flatten
>>>     normalize_fn,
>>>     flatten_fn,
>>> ]
>>> #
>>> # augmentation transforms
>>> aug_transforms = torchvision.transforms.Compose(transform_list)
>>> # preprocessing transforms without augmentation
>>> orig_transforms = torchvision.transforms.Compose([normalize_fn, flatten_fn])
>>> #
>>> # pass transforms to dataloader
>>> aug_dl = get_dataloader(data, batch_size=32, shuffle=True,
>>>                         ds_kwargs={"aug_transforms_list":[aug_transforms], "orig_transforms_list":[orig_transforms]},
>>>                         )
Returns:

dataloader – The final dataloader

Return type:

torch.utils.data.DataLoader

clustpy.deep.get_default_augmented_dataloaders(X: ~numpy.ndarray | ~torch.Tensor, batch_size: int = 256, conv_used: bool = False, flatten: bool = True) -> (<class 'torch.utils.data.dataloader.DataLoader'>, <class 'torch.utils.data.dataloader.DataLoader'>)[source]

Receive a train- and a test dataloader using default augmentations. These transformations correspond to a min-max normalization followed by torchvision.transforms.RandomAffine(degrees=(-16, +16), translate=(0.1, 0.1), shear=(-8, 8), fill=0) and a channel-wise z-transformation. Optionally, the images can be flatten afterward.

Parameters:
  • X (np.ndarray | torch.Tensor) – the actual data set (can be np.ndarray or torch.Tensor)

  • batch_size (int) – the batch size (default: 256)

  • conv_used (bool) – defines whether a convolutional network will be used afterward. In this case, grayscale images will be transformed to receive three color channels by copying the grayscale channel three times (default: False)

  • flatten (bool) – defines whether the augmented images should be flatten afterward. Must be False if conv_used is True (default: True)

Returns:

tuple – The trainloader (with augmentations), The testloader (without augmentations)

Return type:

(torch.utils.data.DataLoader, torch.utils.data.DataLoader)

clustpy.deep.get_device_from_module(neural_network: Module) device[source]

Get the device from a given module.

Parameters:

neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)

Returns:

device – device of the module

Return type:

torch.device

clustpy.deep.mean_squared_error(tensor1: Tensor, tensor2: Tensor, weights: Tensor = None) Tensor[source]

Calculate the mean squared error between two tensors. Each row in the tensors is interpreted as a separate object, while each column represents its features. Optionally, features can be individually weighted. The default behavior is that all features are weighted by 1.

Parameters:
  • tensor1 (torch.Tensor) – the first tensor

  • tensor2 (torch.Tensor) – the second tensor

  • weights (torch.Tensor) – tensor containing the weights of the features (default: None)

Returns:

mse – the mean squared error

Return type:

torch.Tensor

clustpy.deep.predict_batchwise(dataloader: DataLoader, neural_network: Module, cluster_module: Module) ndarray[source]

Utility function for predicting the cluster labels over the whole data set in a mini-batch fashion. Method calls the predict_hard method of the cluster_module for each batch of data.

Parameters:
  • dataloader (torch.utils.data.DataLoader) – dataloader to be used

  • neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)

  • cluster_module (torch.nn.Module) – the cluster module that is used for the encoding (e.g. DEC). Usually contains the predict method.

Returns:

predictions_numpy – The predictions of the cluster_module for the data set

Return type:

np.ndarray

clustpy.deep.set_torch_seed(random_state: RandomState | int) None[source]

Set the random state for torch applications.

Parameters:

random_state (np.random.RandomState | int) – use a fixed random state or an integer to get a repeatable solution