clustpy.deep package

Subpackages

Submodules

clustpy.deep.aec module

@authors: Collin Leiber

class clustpy.deep.aec.AEC(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Auto-encoder Based Data Clustering (AEC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the AEC loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • clustering_loss_weight (float) – weight of the clustering loss (default: 0.05)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining. If this is None, random labels will be used (default: None)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import AEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> aec = AEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> AEC.fit(data)

References

Song, Chunfeng, et al. “Auto-encoder based data clustering.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress, CIARP 2013, Havana, Cuba, November 20-23, 2013, Proceedings, Part I 18. Springer Berlin Heidelberg, 2013.

fit(X: ndarray, y: ndarray = None) AEC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the AEC algorithm

Return type:

AEC

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

clustpy.deep.dcn module

@authors: Lukas Miklautz, Dominik Mautz

class clustpy.deep.dcn.DCN(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 0.05, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Clustering Network (DCN) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DCN loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • clustering_loss_weight (float) – weight of the clustering loss (default: 0.05)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dcn_labels_

The final DCN labels

Type:

np.ndarray

dcn_cluster_centers_

The final DCN cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DCN
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dcn = DCN(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dcn.fit(data)

References

Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” international conference on machine learning. PMLR, 2017.

fit(X: ndarray, y: ndarray = None) DCN[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DCN algorithm

Return type:

DCN

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

clustpy.deep.ddc_n2d module

@authors: Collin Leiber

class clustpy.deep.ddc_n2d.DDC(ratio: float = 0.1, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, tsne_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Density-based Image Clustering (DDC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE is executed on the embedded data and a variant of the Density Peak Clustering algorithm is executed.

Parameters:
  • ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio (default: 0.1)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • tsne_params (dict) – Parameters for the t-SNE execution. For example, perplexity can be changed by setting tsne_params to {“n_components”: 2, “perplexity”: 25}. Check out sklearn.manifold.TSNE for more information (default: {“n_components”: 2})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels (obtained by a variant of Density Peak Clustering)

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

tsne_

The t-SNE object

Type:

TSNE

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DDC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> ddc = DDC(pretrain_epochs=3, clustering_epochs=3)
>>> ddc.fit(data)

References

Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.

fit(X: ndarray, y: ndarray = None) DDC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DDC algorithm

Return type:

DDC

class clustpy.deep.ddc_n2d.DDC_density_peak_clustering(ratio: float)[source]

Bases: BaseEstimator, ClusterMixin

A variant of the Density Peak Algorithm as proposed in the DDC paper.

Parameters:

ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

References

Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.

fit(X: ndarray, y: ndarray = None) DDC_density_peak_clustering[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DDC variant of the Density Peak Clsutering algorithm

Return type:

DDC_density_peak_clustering

class clustpy.deep.ddc_n2d.N2D(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, manifold_class: ~sklearn.base.TransformerMixin = <class 'sklearn.manifold._t_sne.TSNE'>, manifold_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Not 2 Deep (N2D) clustering algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE/UMAP/ISOMAP is executed on the embedded data and the EM algorithm is executed.

Parameters:
  • n_clusters (int) – number of clusters

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • manifold_class (TransformerMixin) – the manifold technique class (default: TSNE)

  • manifold_params (dict) – Parameters for the manifold execution. For example, perplexity can be changed for TSNE by setting manifold_params to {“n_components”: 2, “perplexity”: 25}. Check out e.g. sklearn.manifold.TSNE for more information (default: {“n_components”: 2})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

manifold_

The manifold object

Type:

TransformerMixin

References

McConville, Ryan, et al. “N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding.” 2020 25th international conference on pattern recognition (ICPR). IEEE, 2021.

fit(X: ndarray, y: ndarray = None) N2D[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the N2D algorithm

Return type:

N2D

clustpy.deep.dec module

@authors: Lukas Miklautz, Dominik Mautz, Collin Leiber

class clustpy.deep.dec.DEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Clustering (DEC) algorithm. First, a neural_network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DEC loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • alpha (float) – alpha value for the prediction (default: 1.0)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dec_labels_

The final DEC labels

Type:

np.ndarray

dec_cluster_centers_

The final DEC cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dec = DEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dec.fit(data)

References

Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.

fit(X: ndarray, y: ndarray = None) DEC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DEC algorithm

Return type:

DEC

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

class clustpy.deep.dec.IDEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: DEC

The Improved Deep Embedded Clustering (IDEC) algorithm. Is equal to the DEC algorithm but uses the self-supervised learning loss also during the clustering optimization. Further, clustering_loss_weight is set to 0.1 instead of 1 when using the default settings.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • alpha (float) – alpha value for the prediction (default: 1.0)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 0.1)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dec_labels_

The final DEC labels

Type:

np.ndarray

dec_cluster_centers_

The final DEC cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import IDEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> idec = IDEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> idec.fit(data)

References

Guo, Xifeng, et al. “Improved deep embedded clustering with local structure preservation.” IJCAI. 2017.

clustpy.deep.deepect module

@authors: Collin Leiber, Julian Schilcher

class clustpy.deep.deepect.DeepECT(max_n_leaf_nodes: int = 20, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 200, grow_interval: int = 2, pruning_threshold: float = 0.1, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Cluster Tree (DeepECT) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, a cluster tree will be grown and the network will be optimized using the DeepECT loss function.

Parameters:
  • max_n_leaf_nodes (int) – Maximum number of leaf nodes in the cluster tree (default: 20)

  • batch_size (int) – Size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 50)

  • clustering_epochs (int) – Number of epochs for the actual clustering procedure (default: 200)

  • grow_interval (int) – Number of epochs after which the the tree is grown (default: 2)

  • pruning_threshold (float) – The threshold for pruning the tree (default: 0.1)

  • optimizer_class (torch.optim.Optimizer) – The optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – Size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState) – Use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

tree_

The prediction cluster tree after training

Type:

PredictionClusterTree

neural_network

The final neural network

Type:

torch.nn.Module

fit(X: ndarray, y: ndarray = None) DeepECT[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – This instance of the DeepECT algorithm

Return type:

DeepECT

flat_clustering(n_leaf_nodes_to_keep: int) ndarray[source]

Transform the predicted labels into a flat clustering result by only keeping n_leaf_nodes_to_keep leaf nodes in the tree. Returns labels as if the clustering procedure would have stopped at the specified number of nodes. Note that each leaf node corresponds to a cluster.

Parameters:

n_leaf_nodes_to_keep (int) – The number of leaf nodes to keep in the cluster tree

Returns:

labels_pruned – The new cluster labels

Return type:

np.ndarray

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

clustpy.deep.dipdeck module

@authors: Collin Leiber

class clustpy.deep.dipdeck.DipDECK(n_clusters_init: int = 35, dip_merge_threshold: float = 0.9, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, max_n_clusters: int = inf, min_n_clusters: int = 1, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 5, max_cluster_size_diff_factor: float = 2, pval_strategy: str = 'table', n_boots: int = 1000, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Clustering with k-Estimation (DipDECK) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters using an overestimated number of clusters. Last, the network will be optimized using the DipDECK loss function. If any Dip-value exceeds the dip_merge_threshold, the corresponding clusters will be merged.

Parameters:
  • n_clusters_init (int) – initial number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 35)

  • dip_merge_threshold (float) – threshold regarding the Dip-p-value that defines if two clusters should be merged. Must be bvetween 0 and 1 (default: 0.9)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • max_n_clusters (int) – maximum number of clusters. Must be larger than min_n_clusters. If the result has more clusters, a merge will be forced (default: np.inf)

  • min_n_clusters (int) – minimum number of clusters. Must be larger than 0, smaller than max_n_clusters and smaller than n_clusters_init. When this number of clusters is reached, all further merge processes will be hindered (default: 1)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure. Will reset after each merge (default: 50)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 5)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for the Dip calculation (default: 2)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

labels_

The final labels

Type:

np.ndarray

n_clusters_

The final number of clusters

Type:

int

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DipDECK
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dipdeck = DipDECK(pretrain_epochs=3, clustering_epochs=3)
>>> dipdeck.fit(data)

References

Leiber, Collin, et al. “Dip-based deep embedded clustering with k-estimation.” Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021.

fit(X: ndarray, y: ndarray = None) DipDECK[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DipDECK algorithm

Return type:

DipDECK

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

clustpy.deep.dipencoder module

@authors: Collin Leiber

class clustpy.deep.dipencoder.DipEncoder(n_clusters: int, batch_size: int = None, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, max_cluster_size_diff_factor: float = 3, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The DipEncoder. Can be used either as a clustering procedure if no ground truth labels are given or as a supervised dimensionality reduction technique. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DipEncoder loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • batch_size (int) – size of the data batches for the actual training of the DipEncoder. Should be larger the more clusters we have. If it is None, it will be set to (25 x n_clusters) (default: None)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used (default: 3)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss. If None, it will be equal to 1/(4L), where L is the reconstruction loss of the first batch of an untrained neural network (default: None)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels

Type:

np.ndarray

projection_axes_

The final projection axes between the clusters

Type:

np.ndarray

index_dict_

A dictionary to match the indices of two clusters to a projection axis

Type:

dict

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DipEncoder
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dipencoder = DipEncoder(3, pretrain_epochs=3, clustering_epochs=3)
>>> dipencoder.fit(data)

References

Leiber, Collin and Bauer, Lena G. M. and Neumayr, Michael and Plant, Claudia and Böhm, Christian “The DipEncoder: Enforcing Multimodality in Autoencoders.” Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2022.

fit(X: ndarray, y: ndarray = None) DipEncoder[source]

Initiate the actual clustering/dimensionality reduction process on the input data set. If no ground truth labels are given, the resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – The given (training) data set

  • y (np.ndarray) – The ground truth labels. If None, the DipEncoder will be used for clustering (default: None)

Returns:

self – This instance of the DipEncoder

Return type:

DipEncoder

plot(X: ndarray, edge_width: float = 0.2, show_legend: bool = True) None[source]

Plot the current state of the DipEncoder. First the data set will be encoded using the neural network, afterwards the plot will be created. Uses the plot_scatter_matrix as a basis and adds projection axes in red.

Parameters:
  • X (np.ndarray) – The data set

  • edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots

  • show_legend (bool) – Specifies whether a legend should be added to the plot

predict(X_train: ndarray, X_test: ndarray) ndarray[source]

Predict the labels of the X_test dataset using the information gained by the fit function and the X_train dataset. Beware that the current labels influence the labels obtained by predict(). Therefore, it can occur that the outcome of dipencoder.fit(X) does not match dipencoder.predict(X).

Parameters:
  • X_train (np.ndarray) – The data set used to train the DipEncoder (i.e. to retrieve the projection axes, modal intervals, …)

  • X_test (np.ndarray) – The data set for which we want to retrieve the labels

Returns:

labels_pred – The predicted labels for X_test

Return type:

np.ndarray

set_predict_request(*, X_test: bool | None | str = '$UNCHANGED$', X_train: bool | None | str = '$UNCHANGED$') DipEncoder

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_test parameter in predict.

  • X_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_train parameter in predict.

Returns:

self – The updated object.

Return type:

object

clustpy.deep.dipencoder.plot_dipencoder_embedding(X_embed: ndarray, n_clusters: int, labels: ndarray, projection_axes: ndarray, index_dict: dict, edge_width: float = 0.1, show_legend: bool = False, show_plot: bool = True) None[source]

Plot the current state of the DipEncoder. Uses the plot_scatter_matrix as a basis and adds projection axes in red.

Parameters:
  • X_embed (np.ndarray) – The embedded data set

  • n_clusters (int) – Number of clusters

  • labels (np.ndarray) – The cluster labels

  • projection_axes (np.ndarray) – The projection axes between the clusters

  • index_dict (dict) – A dictionary to match the indices of two clusters to a projection axis

  • edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots

  • show_legend (bool) – Specifies whether a legend should be added to the plot

  • show_plot (bool) – Specifies whether the plot should be plotted, i.e. if plt.show() should be executed (default: True)

clustpy.deep.dkm module

@authors: Collin Leiber

class clustpy.deep.dkm.DKM(n_clusters: int, alphas: tuple = 1000, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep k-Means (DKM) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DKM loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • alphas (tuple) – tuple of different alpha values used for the prediction. Small values close to 0 are equivalent to homogeneous assignments to all clusters. Large values simulate a clear assignment as with kMeans. If None, the default calculation of the paper will be used. This is equal to lpha_{i+1}=2^{1/log(i)^2}*lpha_i with lpha_1=0.1 and maximum i=40. Alpha can also be a tuple with (None, lpha_1, maximum i) (default: (1000))

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 50)

  • clustering_epochs (int) – number of epochs for each alpha value for the actual clustering procedure. The total number of clustering epochs therefore corresponds to: len(alphas)*clustering_epochs (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dkm_labels_

The final DKM labels

Type:

np.ndarray

dkm_cluster_centers_

The final DKM cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DKM
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dkm = DKM(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dkm.fit(data)

References

Fard, Maziar Moradi, Thibaut Thonet, and Eric Gaussier. “Deep k-means: Jointly clustering with k-means and learning representations.” Pattern Recognition Letters 138 (2020): 185-192.

fit(X: ndarray, y: ndarray = None) DKM[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DKM algorithm

Return type:

DKM

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

clustpy.deep.enrc module

@authors: Lukas Miklautz

class clustpy.deep.enrc.ACeDeC(n_clusters: int, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 20, init: str = 'acedec', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]

Bases: ENRC

Autoencoder Centroid-based Deep Cluster (ACeDeC) can be seen as a special case of ENRC where we have one cluster space and one shared space with a single cluster.

Parameters:
  • n_clusters (int) – number of clusters

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • P (list) – list containing projections for clusters in clustered space and cluster in shared space (optional) (default: None)

  • input_centers (list) – list containing the cluster centers for clusters in clustered space and cluster in shared space (optional) (default: None)

  • batch_size (int) – size of the data batches (default: 128)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)

  • tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)

  • optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)

  • init (str) – choose which initialization strategy should be used. Has to be one of ‘acedec’, ‘subkmeans’, ‘random’ or ‘sgd’ (default: ‘acedec’)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)

  • scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)

  • init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)

  • init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (default: True)

  • debug (bool) – if True additional information during the training will be printed (default: False)

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network

The final neural_network

Type:

torch.nn.Module

:raises ValueError : if init is not one of ‘acedec’, ‘subkmeans’, ‘random’, ‘auto’ or ‘sgd’.:

References

Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Böhm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832

fit(X: ndarray, y: ndarray = None) ACeDeC[source]

Cluster the input dataset with the ACeDeC algorithm. Saves the labels, centers, V, m, Betas, and P in the ACeDeC object. The resulting cluster labels will be stored in the labels_ attribute. :param X: input data :type X: np.ndarray :param y: the labels (can be ignored) :type y: np.ndarray

Returns:

self – returns the AceDeC object

Return type:

ACeDeC

predict(X: ndarray, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]

Predicts the labels of the input data.

Parameters:
  • X (np.ndarray) – input data

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ACeDeC

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for dataloader parameter in predict.

  • use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for use_P parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.enrc.ENRC(n_clusters: list, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 20, init: str = 'nrkmeans', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]

Bases: _AbstractDeepClusteringAlgo

The Embeddedn Non-Redundant Clustering (ENRC) algorithm.

Parameters:
  • n_clusters (list) – list containing number of clusters for each clustering

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • P (list) – list containing projections for each clustering (optional) (default: None)

  • input_centers (list) – list containing the cluster centers for each clustering (optional) (default: None)

  • batch_size (int) – size of the data batches (default: 128)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)

  • tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)

  • optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)

  • init (str) – choose which initialization strategy should be used. Has to be one of ‘nrkmeans’, ‘random’ or ‘sgd’ (default: ‘nrkmeans’)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)

  • scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)

  • init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)

  • init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (defaul: False)

  • debug (bool) – if True additional information during the training will be printed (default: False)

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

:raises ValueError : if init is not one of ‘nrkmeans’, ‘random’, ‘auto’ or ‘sgd’.:

References

Miklautz, Lukas & Dominik Mautz et al. “Deep embedded non-redundant clustering.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.

fit(X: ndarray, y: ndarray = None) ENRC[source]

Cluster the input dataset with the ENRC algorithm. Saves the labels, centers, V, m, Betas, and P in the ENRC object. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – input data

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – returns the ENRC object

Return type:

ENRC

plot_subspace(X: ndarray, subspace_index: int = 0, labels: ndarray = None, plot_centers: bool = False, gt: ndarray = None, equal_axis: bool = False) None[source]

Plot the specified subspace_nr as scatter matrix plot.

Parameters:
  • X (np.ndarray) – input data

  • subspace_index (int) – index of the subspace_nr (default: 0)

  • labels (np.ndarray) – the labels to use for the plot (default: labels found by Nr-Kmeans) (default: None)

  • plot_centers (bool) – plot centers if True (default: False)

  • gt (np.ndarray) – of ground truth labels (default=None)

  • equal_axis (bool) – equalize axis if True (default: False)

Return type:

scatter matrix plot of the input data

predict(X: ndarray = None, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]

Predicts the labels for each clustering of X in a mini-batch manner.

Parameters:
  • X (np.ndarray) – input data

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)

Returns:

predicted_labels – n x c matrix, where n is the number of data points in X and c is the number of clusterings.

Return type:

np.ndarray

reconstruct_subspace_centroids(subspace_index: int = 0) ndarray[source]

Reconstructs the centroids in the specified subspace_nr.

Parameters:

subspace_index (int) – index of the subspace_nr (default: 0)

Returns:

centers_rec – reconstructed centers as np.ndarray

Return type:

centers_rec

set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ENRC

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for dataloader parameter in predict.

  • use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for use_P parameter in predict.

Returns:

self – The updated object.

Return type:

object

transform_full_space(X: ndarray, embedded=False) ndarray[source]

Embedds the input dataset with the neural network and the matrix V from the ENRC object. :param X: input data :type X: np.ndarray :param embedded: if True, then X is assumed to be already embedded (default: False) :type embedded: bool

Returns:

rotated – The transformed data

Return type:

np.ndarray

transform_subspace(X: ndarray, subspace_index: int = 0, embedded: bool = False) ndarray[source]

Embedds the input dataset with the neural network and with the matrix V projected onto a special clusterspace_nr.

Parameters:
  • X (np.ndarray) – input data

  • subspace_index (int) – index of the subspace_nr (default: 0)

  • embedded (bool) – if True, then X is assumed to be already embedded (default: False)

Returns:

subspace – The transformed subspace

Return type:

np.ndarray

clustpy.deep.enrc.acedec_init(data: ~numpy.ndarray, n_clusters: list, optimizer_params: dict, batch_size: int = 128, optimizer_class: ~torch.optim.optimizer.Optimizer = None, rounds: int = None, epochs: int = 10, random_state: ~numpy.random.mtrand.RandomState = None, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, device: ~torch.device = device(type='cpu'), debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Initialization strategy based on optimizing ACeDeC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the nrkmeans_init and only constraints V using the reconstruction error (torch.nn.MSELoss), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the sgd_init strategy is that it can be less stable for small data sets.

Parameters:
  • data (np.ndarray) – input data

  • n_clusters (list) – list of ints, number of clusters for each clustering

  • optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate

  • batch_size (int) – size of the data batches (default: 128)

  • optimizer_params – parameters of the optimizer for the actual clustering procedure, includes the learning rate

  • optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used (default: None)

  • rounds (int) – not used here (default: None)

  • epochs (int) – epochs is automatically set to be close to 20.000 minibatch iterations as in the ACeDeC paper. If this determined value is smaller than the passed epochs, then epochs is used (default: 10)

  • random_state (np.random.RandomState) – random state for reproducible results (default: None)

  • input_centers (list) – list of np.ndarray, default=None, optional parameter if initial cluster centers want to be set (optional)

  • P (list) – list containing projections for each subspace (optional) (default: None)

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • device (torch.device) – device on which should be trained on (default: torch.device(‘cpu’))

  • debug (bool) – if True then the cost of each round will be printed (default: True)

Returns:

tuple – list of cluster centers for each subspace, list containing projections for each subspace, orthogonal rotation matrix, weights for softmax function to get beta values.

Return type:

(list, list, np.ndarray, np.ndarray)

clustpy.deep.enrc.available_init_strategies() list[source]

Returns a list of strings of available initialization strategies for ENRC and ACeDeC. At the moment following strategies are supported: nrkmeans, random, sgd, auto

clustpy.deep.enrc.beta_weights_init(P: list, n_dims: int, high_value: float = 0.9) Tensor[source]

Initializes parameters of the softmax such that betas will be set to high_value in dimensions which form a cluster subspace according to P and set to (1 - high_value)/(len(P) - 1) for the other clusterings.

Parameters:
  • P (list) – list containing projections for each subspace

  • n_dims (int) – dimensionality of the embedded data

  • high_value (float) – value that should be initially used to indicate strength of assignment of a specific dimension to the clustering (default: 0.9)

Returns:

beta_weights – initialized weights that are input in the softmax to get the betas.

Return type:

torch.Tensor

clustpy.deep.enrc.calculate_beta_weight(data: Tensor, centers: list, V: Tensor, P: list, high_beta_value: float = 0.9) Tensor[source]

The beta weights have a closed form solution if we have two subspaces, so the optimal values given the data, centers and V can be computed. See supplement of Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Boehm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832 here: https://gitlab.cs.univie.ac.at/lukas/acedec_public/-/blob/master/supplement.pdf For number of subspaces > 2, we calculate the beta weight assuming that an assigned subspace should have a weight of 0.9.

Parameters:
  • data (torch.Tensor) – input data

  • centers (list) – list of torch.Tensor, cluster centers for each clustering

  • V (torch.Tensor) – orthogonal rotation matrix

  • P (list) – list containing projections for each subspace

  • high_beta_value (float) – value that should be initially used to indicate strength of assignment of a specific dimension to the clustering (default: 0.9)

Returns:

beta_weights – a c x d vector containing the weights for the softmax to indicate which dimensions d are important for each clustering c.

Return type:

torch.Tensor

Raises:

ValueError – If number of clusterings is smaller than 2:

clustpy.deep.enrc.calculate_optimal_beta_weights_special_case(data: Tensor, centers: list, V: Tensor, batch_size: int = 32) Tensor[source]

The beta weights have a closed form solution if we have two subspaces, so the optimal values given the data, centers and V can be computed. See supplement of Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Boehm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832 here: https://gitlab.cs.univie.ac.at/lukas/acedec_public/-/blob/master/supplement.pdf

Parameters:
  • data (torch.Tensor) – input data

  • centers (list) – list of torch.Tensor, cluster centers for each clustering

  • V (torch.Tensor) – orthogonal rotation matrix

  • batch_size (int) – size of the data batches (default: 32)

Returns:

optimal_beta_weights – a c x d vector containing the optimal weights for the softmax to indicate which dimensions d are important for each clustering c.

Return type:

torch.Tensor

clustpy.deep.enrc.enrc_encode_decode_batchwise_with_loss(V: Tensor, centers: list, model: Module, dataloader: DataLoader, device: device = device(type='cpu'), ssl_loss_fn: _Loss = None) ndarray[source]

Encode and Decode input data of a dataloader in a mini-batch manner with ENRC.

Parameters:
  • V (torch.Tensor) – orthogonal rotation matrix

  • centers (list) – list of torch.Tensor, cluster centers for each clustering

  • model (torch.nn.Module) – the input model for encoding the data

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used for prediction

  • device (torch.device) – device to be predicted on (default: torch.device(‘cpu’))

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: None)

Returns:

  • enrc_encoding (np.ndarray) – n x d matrix, where n is the number of data points and d is the number of dimensions of z.

  • enrc_decoding (np.ndarray) – n x D matrix, where n is the number of data points and D is the data dimensionality.

  • reconstruction_error (flaot) – reconstruction error (will be None if ssl_loss_fn is not specified)

clustpy.deep.enrc.enrc_init(data: ~numpy.ndarray, n_clusters: list, init: str = 'auto', rounds: int = 10, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, max_iter: int = 100, optimizer_params: dict = None, optimizer_class: ~torch.optim.optimizer.Optimizer = None, batch_size: int = 128, epochs: int = 10, device: ~torch.device = device(type='cpu'), debug: bool = True, init_kwargs: dict = None) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Initialization strategy for the ENRC algorithm.

Parameters:
  • data (np.ndarray) – input data

  • n_clusters (list) – list of ints, number of clusters for each clustering

  • init (str) –

    {‘nrkmeans’, ‘random’, ‘sgd’, ‘auto’} or callable. Initialization strategies for parameters cluster_centers, V and beta of ENRC. (default=’auto’)

    ’nrkmeans’ : Performs the NrKmeans algorithm to get initial parameters. This strategy is preferred for small data sets, but the orthogonality constraint on V and subsequently for the clustered subspaces can be sometimes to limiting in practice, e.g., if clusterings in the data are not perfectly non-redundant.

    ’random’ : Same as ‘nrkmeans’, but max_iter is set to 10, so the performance is faster, but also less optimized, thus more random.

    ’sgd’ : Initialization strategy based on optimizing ENRC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the ‘nrkmeans’ option and only constraints V using the reconstruction error (torch.nn.MSELoss), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the ‘sgd’ strategy is that it can be less stable for small data sets.

    ’auto’ : Selects ‘sgd’ init if data.shape[0] > 100,000 or data.shape[1] > 1,000. For smaller data sets ‘nrkmeans’ init is used.

    If a callable is passed, it should take arguments data and n_clusters (additional parameters can be provided via the dictionary init_kwargs) and return an initialization (centers, P, V and beta_weights).

  • rounds (int) – number of repetitions of the initialization procedure (default: 10)

  • input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)

  • P (list) – list containing projections for each subspace (optional) (default: None)

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • random_state (np.random.RandomState) – random state for reproducible results (default: None)

  • max_iter (int) – maximum number of iterations of NrKmeans. Only used for init=’nrkmeans’ (default: 100)

  • optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate. Only used for init=’sgd’

  • optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used. Only used for init=’sgd’ (default: None)

  • batch_size (int) – size of the data batches. Only used for init=’sgd’ (default: 128)

  • epochs (int) – number of epochs for the actual clustering procedure. Only used for init=’sgd’ (default: 10)

  • device (torch.device) – device on which should be trained on. Only used for init=’sgd’ (default: torch.device(‘cpu’))

  • debug (bool) – if True then the cost of each round will be printed (default: True)

  • init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)

Returns:

tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.

Return type:

(list, list, np.ndarray, np.ndarray)

:raises ValueError : if init variable is passed that is not implemented.:

clustpy.deep.enrc.enrc_predict(z: Tensor, V: Tensor, centers: list, subspace_betas: Tensor, use_P: bool = False) ndarray[source]

Predicts the labels for each clustering of an input z.

Parameters:
  • z (torch.Tensor) – embedded input data point, can also be a mini-batch of embedded points

  • V (torch.tensor) – orthogonal rotation matrix

  • centers (list) – list of torch.Tensor, cluster centers for each clustering

  • subspace_betas (torch.Tensor) – weights for each dimension per clustering. Calculated via softmax(beta_weights).

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft subspace_beta weights are used (default: False)

Returns:

predicted_labels – n x c matrix, where n is the number of data points in z and c is the number of clusterings.

Return type:

np.ndarray

clustpy.deep.enrc.enrc_predict_batchwise(V: Tensor, centers: list, subspace_betas: Tensor, model: Module, dataloader: DataLoader, device: device = device(type='cpu'), use_P: bool = False) ndarray[source]

Predicts the labels for each clustering of a dataloader in a mini-batch manner.

Parameters:
  • V (torch.Tensor) – orthogonal rotation matrix

  • centers (list) – list of torch.Tensor, cluster centers for each clustering

  • subspace_betas (torch.Tensor) – weights for each dimension per clustering. Calculated via softmax(beta_weights).

  • model (torch.nn.Module) – the input model for encoding the data

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used for prediction

  • device (torch.device) – device to be predicted on (default: torch.device(‘cpu’))

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: False)

Returns:

predicted_labels – n x c matrix, where n is the number of data points in z and c is the number of clusterings.

Return type:

np.ndarray

clustpy.deep.enrc.nrkmeans_init(data: ~numpy.ndarray, n_clusters: list, rounds: int = 10, max_iter: int = 100, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, debug=True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Initialization strategy based on the NrKmeans Algorithm. This strategy is preferred for small data sets, but the orthogonality constraint on V and subsequently for the clustered subspaces can be sometimes to limiting in practice, e.g., if clusterings are not perfectly non-redundant.

Parameters:
  • data (np.ndarray) – input data

  • n_clusters (list) – list of ints, number of clusters for each clustering

  • rounds (int) – number of repetitions of the NrKmeans algorithm (default: 10)

  • max_iter (int) – maximum number of iterations of NrKmeans (default: 100)

  • input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)

  • P (list) – list containing projections for each subspace (optional) (default: None)

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • debug (bool) – if True then the cost of each round will be printed (default: True)

Returns:

tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.

Return type:

(list, list, np.ndarray, np.ndarray)

clustpy.deep.enrc.optimal_beta(kmeans_loss: Tensor, other_losses_mean_sum: Tensor) Tensor[source]

Calculate optimal values for the beta weight for each dimension.

Parameters:
  • kmeans_loss (torch.Tensor) – a 1 x d vector of the kmeans losses per dimension.

  • other_losses_mean_sum (torch.Tensor) – a 1 x d vector of the kmeans losses of all other clusterings except the one in ‘kmeans_loss’.

Returns:

optimal_beta_weights – a 1 x d vector containing the optimal weights for the softmax to indicate which dimensions are important for each clustering. Calculated via -torch.log(kmeans_loss/other_losses_mean_sum)

Return type:

torch.Tensor

clustpy.deep.enrc.random_nrkmeans_init(data: ~numpy.ndarray, n_clusters: list, rounds: int = 10, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Initialization strategy based on the NrKmeans Algorithm. For documentation see nrkmeans_init function. Same as nrkmeans_init, but max_iter is set to 1, so the results will be faster and more random.

Parameters:
  • data (np.ndarray) – input data

  • n_clusters (list) – list of ints, number of clusters for each clustering

  • rounds (int) – number of repetitions of the NrKmeans algorithm (default: 10)

  • input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)

  • P (list) – list containing projections for each subspace (optional) (default: None)

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • debug (bool) – if True then the cost of each round will be printed (default: True)

Returns:

tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.

Return type:

(list, list, np.ndarray, np.ndarray)

clustpy.deep.enrc.reinit_centers(enrc: _ENRC_Module, subspace_id: int, dataloader: DataLoader, model: Module, n_samples: int = 512, kmeans_steps: int = 10, split: str = 'random', debug: bool = False) None[source]

Reinitializes centers that have been lost, i.e. if they did not get any data point assigned. Before a center is reinitialized, this method checks whether a center has not get any points assigned over several mini-batch iterations and if this count is higher than enrc.reinit_threshold the center will be reinitialized.

Parameters:
  • enrc (_ENRC_Module) – torch.nn.Module instance for the ENRC algorithm

  • subspace_id (int) – integer which indicates which subspace the cluster to be checked are in.

  • dataloader (torch.utils.data.DataLoader) – dataloader from which data is randomly sampled. Important shuffle=True needs to be set, because n_samples random samples are drawn.

  • model (torch.nn.Module) – neural network used for the embedding

  • n_samples (int) – number of samples that should be used for the reclustering (default: 512)

  • kmeans_steps (int) – number of mini-batch kmeans steps that should be conducted with the new centroid (default: 10)

  • split (str) – {‘random’, ‘cost’}, default=’random’, select how clusters should be split for renitialization. ‘random’ : split a random point from the random sample of size=n_samples. ‘cost’ : split the cluster with max kmeans cost.

  • debug (bool) – if True than training errors will be printed (default: True)

clustpy.deep.enrc.sgd_init(data: ~numpy.ndarray, n_clusters: list, optimizer_params: dict, batch_size: int = 128, optimizer_class: ~torch.optim.optimizer.Optimizer = None, rounds: int = 2, epochs: int = 10, random_state: ~numpy.random.mtrand.RandomState = None, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, device: ~torch.device = device(type='cpu'), debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Initialization strategy based on optimizing ENRC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the nrkmeans_init and only constraints V using the reconstruction error (torch.nn.MSELoss), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the sgd_init strategy is that it can be less stable for small data sets.

Parameters:
  • data (np.ndarray) – input data

  • n_clusters (list) – list of ints, number of clusters for each clustering

  • optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate

  • batch_size (int) – size of the data batches (default: 128)

  • optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used (default: None)

  • rounds (int) – number of repetitions of the initialization procedure (default: 2)

  • epochs (int) – number of epochs for the actual clustering procedure (default: 10)

  • random_state (np.random.RandomState) – random state for reproducible results (default: None)

  • input_centers (list) – list of np.ndarray, default=None, optional parameter if initial cluster centers want to be set (optional)

  • P (list) – list containing projections for each subspace (optional) (default: None)

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • device (torch.device) – device on which should be trained on (default: torch.device(‘cpu’))

  • debug (bool) – if True then the cost of each round will be printed (default: True)

Returns:

tuple – list of cluster centers for each subspace, list containing projections for each subspace, orthogonal rotation matrix, weights for softmax function to get beta values.

Return type:

(list, list, np.ndarray, np.ndarray)

clustpy.deep.vade module

@authors: Donatella Novakovic, Lukas Miklautz, Collin Leiber

class clustpy.deep.vade.VaDE(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 10, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = BCELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Variational Deep Embedding (VaDE) algorithm. First, an variational autoencoder (VAE) will be trained (will be skipped if input neural network is given). Afterward, a GMM will be fit to identify the initial clustering structures. Last, the VAE will be optimized using the VaDE loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 10)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.BCELoss(reduction=’sum’))

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new VariationalAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (central layer with mean and variance) (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: GaussianMixture)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {“n_init”: 10, “covariance_type”: “diag”})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The labels as identified by a final Gaussian Mixture Model

Type:

np.ndarray

cluster_centers_

The cluster centers as identified by a final Gaussian Mixture Model

Type:

np.ndarray

covariances_

The covariance matrices as identified by a final Gaussian Mixture Model

Type:

np.ndarray

weights_

The weights as identified by a final Gaussian Mixture Model

Type:

np.ndarray

vade_labels_

The labels as identified by VaDE after the training terminated

Type:

np.ndarray

vade_cluster_centers_

The cluster centers as identified by VaDE after the training terminated

Type:

np.ndarray

vade_covariances_

The covariance matrices as identified by VaDE after the training terminated

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> data = (data - np.mean(data)) / np.std(data)
>>> vade = VaDE(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> vade.fit(data)

References

Jiang, Zhuxi, et al. “Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering.” IJCAI. 2017.

fit(X: ndarray, y: ndarray = None) VaDE[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the VaDE algorithm

Return type:

VaDE

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

Module contents

class clustpy.deep.ACeDeC(n_clusters: int, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 20, init: str = 'acedec', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]

Bases: ENRC

Autoencoder Centroid-based Deep Cluster (ACeDeC) can be seen as a special case of ENRC where we have one cluster space and one shared space with a single cluster.

Parameters:
  • n_clusters (int) – number of clusters

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • P (list) – list containing projections for clusters in clustered space and cluster in shared space (optional) (default: None)

  • input_centers (list) – list containing the cluster centers for clusters in clustered space and cluster in shared space (optional) (default: None)

  • batch_size (int) – size of the data batches (default: 128)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)

  • tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)

  • optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)

  • init (str) – choose which initialization strategy should be used. Has to be one of ‘acedec’, ‘subkmeans’, ‘random’ or ‘sgd’ (default: ‘acedec’)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)

  • scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)

  • init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)

  • init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (default: True)

  • debug (bool) – if True additional information during the training will be printed (default: False)

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network

The final neural_network

Type:

torch.nn.Module

:raises ValueError : if init is not one of ‘acedec’, ‘subkmeans’, ‘random’, ‘auto’ or ‘sgd’.:

References

Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Böhm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832

fit(X: ndarray, y: ndarray = None) ACeDeC[source]

Cluster the input dataset with the ACeDeC algorithm. Saves the labels, centers, V, m, Betas, and P in the ACeDeC object. The resulting cluster labels will be stored in the labels_ attribute. :param X: input data :type X: np.ndarray :param y: the labels (can be ignored) :type y: np.ndarray

Returns:

self – returns the AceDeC object

Return type:

ACeDeC

predict(X: ndarray, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]

Predicts the labels of the input data.

Parameters:
  • X (np.ndarray) – input data

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ACeDeC

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for dataloader parameter in predict.

  • use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for use_P parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.AEC(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Auto-encoder Based Data Clustering (AEC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the AEC loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • clustering_loss_weight (float) – weight of the clustering loss (default: 0.05)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining. If this is None, random labels will be used (default: None)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import AEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> aec = AEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> AEC.fit(data)

References

Song, Chunfeng, et al. “Auto-encoder based data clustering.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress, CIARP 2013, Havana, Cuba, November 20-23, 2013, Proceedings, Part I 18. Springer Berlin Heidelberg, 2013.

fit(X: ndarray, y: ndarray = None) AEC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the AEC algorithm

Return type:

AEC

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

class clustpy.deep.DCN(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 0.05, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Clustering Network (DCN) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DCN loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • clustering_loss_weight (float) – weight of the clustering loss (default: 0.05)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dcn_labels_

The final DCN labels

Type:

np.ndarray

dcn_cluster_centers_

The final DCN cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DCN
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dcn = DCN(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dcn.fit(data)

References

Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” international conference on machine learning. PMLR, 2017.

fit(X: ndarray, y: ndarray = None) DCN[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DCN algorithm

Return type:

DCN

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

class clustpy.deep.DDC(ratio: float = 0.1, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, tsne_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Density-based Image Clustering (DDC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE is executed on the embedded data and a variant of the Density Peak Clustering algorithm is executed.

Parameters:
  • ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio (default: 0.1)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • tsne_params (dict) – Parameters for the t-SNE execution. For example, perplexity can be changed by setting tsne_params to {“n_components”: 2, “perplexity”: 25}. Check out sklearn.manifold.TSNE for more information (default: {“n_components”: 2})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels (obtained by a variant of Density Peak Clustering)

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

tsne_

The t-SNE object

Type:

TSNE

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DDC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> ddc = DDC(pretrain_epochs=3, clustering_epochs=3)
>>> ddc.fit(data)

References

Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.

fit(X: ndarray, y: ndarray = None) DDC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DDC algorithm

Return type:

DDC

class clustpy.deep.DEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Clustering (DEC) algorithm. First, a neural_network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DEC loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • alpha (float) – alpha value for the prediction (default: 1.0)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dec_labels_

The final DEC labels

Type:

np.ndarray

dec_cluster_centers_

The final DEC cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dec = DEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dec.fit(data)

References

Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.

fit(X: ndarray, y: ndarray = None) DEC[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DEC algorithm

Return type:

DEC

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

class clustpy.deep.DKM(n_clusters: int, alphas: tuple = 1000, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep k-Means (DKM) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DKM loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • alphas (tuple) – tuple of different alpha values used for the prediction. Small values close to 0 are equivalent to homogeneous assignments to all clusters. Large values simulate a clear assignment as with kMeans. If None, the default calculation of the paper will be used. This is equal to lpha_{i+1}=2^{1/log(i)^2}*lpha_i with lpha_1=0.1 and maximum i=40. Alpha can also be a tuple with (None, lpha_1, maximum i) (default: (1000))

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 50)

  • clustering_epochs (int) – number of epochs for each alpha value for the actual clustering procedure. The total number of clustering epochs therefore corresponds to: len(alphas)*clustering_epochs (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dkm_labels_

The final DKM labels

Type:

np.ndarray

dkm_cluster_centers_

The final DKM cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DKM
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dkm = DKM(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> dkm.fit(data)

References

Fard, Maziar Moradi, Thibaut Thonet, and Eric Gaussier. “Deep k-means: Jointly clustering with k-means and learning representations.” Pattern Recognition Letters 138 (2020): 185-192.

fit(X: ndarray, y: ndarray = None) DKM[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DKM algorithm

Return type:

DKM

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

class clustpy.deep.DeepECT(max_n_leaf_nodes: int = 20, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 200, grow_interval: int = 2, pruning_threshold: float = 0.1, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Cluster Tree (DeepECT) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, a cluster tree will be grown and the network will be optimized using the DeepECT loss function.

Parameters:
  • max_n_leaf_nodes (int) – Maximum number of leaf nodes in the cluster tree (default: 20)

  • batch_size (int) – Size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 50)

  • clustering_epochs (int) – Number of epochs for the actual clustering procedure (default: 200)

  • grow_interval (int) – Number of epochs after which the the tree is grown (default: 2)

  • pruning_threshold (float) – The threshold for pruning the tree (default: 0.1)

  • optimizer_class (torch.optim.Optimizer) – The optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – Size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState) – Use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

tree_

The prediction cluster tree after training

Type:

PredictionClusterTree

neural_network

The final neural network

Type:

torch.nn.Module

fit(X: ndarray, y: ndarray = None) DeepECT[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – This instance of the DeepECT algorithm

Return type:

DeepECT

flat_clustering(n_leaf_nodes_to_keep: int) ndarray[source]

Transform the predicted labels into a flat clustering result by only keeping n_leaf_nodes_to_keep leaf nodes in the tree. Returns labels as if the clustering procedure would have stopped at the specified number of nodes. Note that each leaf node corresponds to a cluster.

Parameters:

n_leaf_nodes_to_keep (int) – The number of leaf nodes to keep in the cluster tree

Returns:

labels_pruned – The new cluster labels

Return type:

np.ndarray

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

class clustpy.deep.DipDECK(n_clusters_init: int = 35, dip_merge_threshold: float = 0.9, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, max_n_clusters: int = inf, min_n_clusters: int = 1, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 5, max_cluster_size_diff_factor: float = 2, pval_strategy: str = 'table', n_boots: int = 1000, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Deep Embedded Clustering with k-Estimation (DipDECK) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters using an overestimated number of clusters. Last, the network will be optimized using the DipDECK loss function. If any Dip-value exceeds the dip_merge_threshold, the corresponding clusters will be merged.

Parameters:
  • n_clusters_init (int) – initial number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 35)

  • dip_merge_threshold (float) – threshold regarding the Dip-p-value that defines if two clusters should be merged. Must be bvetween 0 and 1 (default: 0.9)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • max_n_clusters (int) – maximum number of clusters. Must be larger than min_n_clusters. If the result has more clusters, a merge will be forced (default: np.inf)

  • min_n_clusters (int) – minimum number of clusters. Must be larger than 0, smaller than max_n_clusters and smaller than n_clusters_init. When this number of clusters is reached, all further merge processes will be hindered (default: 1)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure. Will reset after each merge (default: 50)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 5)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for the Dip calculation (default: 2)

  • pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)

  • n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

labels_

The final labels

Type:

np.ndarray

n_clusters_

The final number of clusters

Type:

int

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DipDECK
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dipdeck = DipDECK(pretrain_epochs=3, clustering_epochs=3)
>>> dipdeck.fit(data)

References

Leiber, Collin, et al. “Dip-based deep embedded clustering with k-estimation.” Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021.

fit(X: ndarray, y: ndarray = None) DipDECK[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DipDECK algorithm

Return type:

DipDECK

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

class clustpy.deep.DipEncoder(n_clusters: int, batch_size: int = None, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, max_cluster_size_diff_factor: float = 3, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The DipEncoder. Can be used either as a clustering procedure if no ground truth labels are given or as a supervised dimensionality reduction technique. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DipEncoder loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • batch_size (int) – size of the data batches for the actual training of the DipEncoder. Should be larger the more clusters we have. If it is None, it will be set to (25 x n_clusters) (default: None)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used (default: 3)

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss. If None, it will be equal to 1/(4L), where L is the reconstruction loss of the first batch of an untrained neural network (default: None)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels

Type:

np.ndarray

projection_axes_

The final projection axes between the clusters

Type:

np.ndarray

index_dict_

A dictionary to match the indices of two clusters to a projection axis

Type:

dict

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import DipEncoder
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> dipencoder = DipEncoder(3, pretrain_epochs=3, clustering_epochs=3)
>>> dipencoder.fit(data)

References

Leiber, Collin and Bauer, Lena G. M. and Neumayr, Michael and Plant, Claudia and Böhm, Christian “The DipEncoder: Enforcing Multimodality in Autoencoders.” Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2022.

fit(X: ndarray, y: ndarray = None) DipEncoder[source]

Initiate the actual clustering/dimensionality reduction process on the input data set. If no ground truth labels are given, the resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – The given (training) data set

  • y (np.ndarray) – The ground truth labels. If None, the DipEncoder will be used for clustering (default: None)

Returns:

self – This instance of the DipEncoder

Return type:

DipEncoder

plot(X: ndarray, edge_width: float = 0.2, show_legend: bool = True) None[source]

Plot the current state of the DipEncoder. First the data set will be encoded using the neural network, afterwards the plot will be created. Uses the plot_scatter_matrix as a basis and adds projection axes in red.

Parameters:
  • X (np.ndarray) – The data set

  • edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots

  • show_legend (bool) – Specifies whether a legend should be added to the plot

predict(X_train: ndarray, X_test: ndarray) ndarray[source]

Predict the labels of the X_test dataset using the information gained by the fit function and the X_train dataset. Beware that the current labels influence the labels obtained by predict(). Therefore, it can occur that the outcome of dipencoder.fit(X) does not match dipencoder.predict(X).

Parameters:
  • X_train (np.ndarray) – The data set used to train the DipEncoder (i.e. to retrieve the projection axes, modal intervals, …)

  • X_test (np.ndarray) – The data set for which we want to retrieve the labels

Returns:

labels_pred – The predicted labels for X_test

Return type:

np.ndarray

set_predict_request(*, X_test: bool | None | str = '$UNCHANGED$', X_train: bool | None | str = '$UNCHANGED$') DipEncoder

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_test parameter in predict.

  • X_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_train parameter in predict.

Returns:

self – The updated object.

Return type:

object

class clustpy.deep.ENRC(n_clusters: list, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 20, init: str = 'nrkmeans', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]

Bases: _AbstractDeepClusteringAlgo

The Embeddedn Non-Redundant Clustering (ENRC) algorithm.

Parameters:
  • n_clusters (list) – list containing number of clusters for each clustering

  • V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)

  • P (list) – list containing projections for each clustering (optional) (default: None)

  • input_centers (list) – list containing the cluster centers for each clustering (optional) (default: None)

  • batch_size (int) – size of the data batches (default: 128)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)

  • tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)

  • optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)

  • init (str) – choose which initialization strategy should be used. Has to be one of ‘nrkmeans’, ‘random’ or ‘sgd’ (default: ‘nrkmeans’)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)

  • scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)

  • init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)

  • init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (defaul: False)

  • debug (bool) – if True additional information during the training will be printed (default: False)

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

:raises ValueError : if init is not one of ‘nrkmeans’, ‘random’, ‘auto’ or ‘sgd’.:

References

Miklautz, Lukas & Dominik Mautz et al. “Deep embedded non-redundant clustering.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.

fit(X: ndarray, y: ndarray = None) ENRC[source]

Cluster the input dataset with the ENRC algorithm. Saves the labels, centers, V, m, Betas, and P in the ENRC object. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – input data

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – returns the ENRC object

Return type:

ENRC

plot_subspace(X: ndarray, subspace_index: int = 0, labels: ndarray = None, plot_centers: bool = False, gt: ndarray = None, equal_axis: bool = False) None[source]

Plot the specified subspace_nr as scatter matrix plot.

Parameters:
  • X (np.ndarray) – input data

  • subspace_index (int) – index of the subspace_nr (default: 0)

  • labels (np.ndarray) – the labels to use for the plot (default: labels found by Nr-Kmeans) (default: None)

  • plot_centers (bool) – plot centers if True (default: False)

  • gt (np.ndarray) – of ground truth labels (default=None)

  • equal_axis (bool) – equalize axis if True (default: False)

Return type:

scatter matrix plot of the input data

predict(X: ndarray = None, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]

Predicts the labels for each clustering of X in a mini-batch manner.

Parameters:
  • X (np.ndarray) – input data

  • use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)

  • dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)

Returns:

predicted_labels – n x c matrix, where n is the number of data points in X and c is the number of clusterings.

Return type:

np.ndarray

reconstruct_subspace_centroids(subspace_index: int = 0) ndarray[source]

Reconstructs the centroids in the specified subspace_nr.

Parameters:

subspace_index (int) – index of the subspace_nr (default: 0)

Returns:

centers_rec – reconstructed centers as np.ndarray

Return type:

centers_rec

set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ENRC

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for dataloader parameter in predict.

  • use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for use_P parameter in predict.

Returns:

self – The updated object.

Return type:

object

transform_full_space(X: ndarray, embedded=False) ndarray[source]

Embedds the input dataset with the neural network and the matrix V from the ENRC object. :param X: input data :type X: np.ndarray :param embedded: if True, then X is assumed to be already embedded (default: False) :type embedded: bool

Returns:

rotated – The transformed data

Return type:

np.ndarray

transform_subspace(X: ndarray, subspace_index: int = 0, embedded: bool = False) ndarray[source]

Embedds the input dataset with the neural network and with the matrix V projected onto a special clusterspace_nr.

Parameters:
  • X (np.ndarray) – input data

  • subspace_index (int) – index of the subspace_nr (default: 0)

  • embedded (bool) – if True, then X is assumed to be already embedded (default: False)

Returns:

subspace – The transformed subspace

Return type:

np.ndarray

class clustpy.deep.IDEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: DEC

The Improved Deep Embedded Clustering (IDEC) algorithm. Is equal to the DEC algorithm but uses the self-supervised learning loss also during the clustering optimization. Further, clustering_loss_weight is set to 0.1 instead of 1 when using the default settings.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • alpha (float) – alpha value for the prediction (default: 1.0)

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 0.1)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The final labels (obtained by a final KMeans execution)

Type:

np.ndarray

cluster_centers_

The final cluster centers (obtained by a final KMeans execution)

Type:

np.ndarray

dec_labels_

The final DEC labels

Type:

np.ndarray

dec_cluster_centers_

The final DEC cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> from clustpy.deep import IDEC
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> idec = IDEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> idec.fit(data)

References

Guo, Xifeng, et al. “Improved deep embedded clustering with local structure preservation.” IJCAI. 2017.

class clustpy.deep.N2D(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, manifold_class: ~sklearn.base.TransformerMixin = <class 'sklearn.manifold._t_sne.TSNE'>, manifold_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Not 2 Deep (N2D) clustering algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE/UMAP/ISOMAP is executed on the embedded data and the EM algorithm is executed.

Parameters:
  • n_clusters (int) – number of clusters

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • manifold_class (TransformerMixin) – the manifold technique class (default: TSNE)

  • manifold_params (dict) – Parameters for the manifold execution. For example, perplexity can be changed for TSNE by setting manifold_params to {“n_components”: 2, “perplexity”: 25}. Check out e.g. sklearn.manifold.TSNE for more information (default: {“n_components”: 2})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

n_clusters

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

cluster_centers_

The final cluster centers

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

manifold_

The manifold object

Type:

TransformerMixin

References

McConville, Ryan, et al. “N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding.” 2020 25th international conference on pattern recognition (ICPR). IEEE, 2021.

fit(X: ndarray, y: ndarray = None) N2D[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the N2D algorithm

Return type:

N2D

class clustpy.deep.VaDE(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 10, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = BCELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]

Bases: _AbstractDeepClusteringAlgo

The Variational Deep Embedding (VaDE) algorithm. First, an variational autoencoder (VAE) will be trained (will be skipped if input neural network is given). Afterward, a GMM will be fit to identify the initial clustering structures. Last, the VAE will be optimized using the VaDE loss function.

Parameters:
  • n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN

  • batch_size (int) – size of the data batches (default: 256)

  • pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})

  • clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})

  • pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 10)

  • clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)

  • optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.BCELoss(reduction=’sum’))

  • clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)

  • ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)

  • neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new VariationalAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • embedding_size (int) – size of the embedding within the neural network (central layer with mean and variance) (default: 10)

  • custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)

  • initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: GaussianMixture)

  • initial_clustering_params (dict) – parameters for the initial clustering class (default: {“n_init”: 10, “covariance_type”: “diag”})

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

labels_

The labels as identified by a final Gaussian Mixture Model

Type:

np.ndarray

cluster_centers_

The cluster centers as identified by a final Gaussian Mixture Model

Type:

np.ndarray

covariances_

The covariance matrices as identified by a final Gaussian Mixture Model

Type:

np.ndarray

weights_

The weights as identified by a final Gaussian Mixture Model

Type:

np.ndarray

vade_labels_

The labels as identified by VaDE after the training terminated

Type:

np.ndarray

vade_cluster_centers_

The cluster centers as identified by VaDE after the training terminated

Type:

np.ndarray

vade_covariances_

The covariance matrices as identified by VaDE after the training terminated

Type:

np.ndarray

neural_network

The final neural network

Type:

torch.nn.Module

Examples

>>> from clustpy.data import create_subspace_data
>>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1)
>>> data = (data - np.mean(data)) / np.std(data)
>>> vade = VaDE(n_clusters=3, pretrain_epochs=3, clustering_epochs=3)
>>> vade.fit(data)

References

Jiang, Zhuxi, et al. “Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering.” IJCAI. 2017.

fit(X: ndarray, y: ndarray = None) VaDE[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the VaDE algorithm

Return type:

VaDE

predict(X: ndarray) ndarray[source]

Predicts the labels of the input data.

Parameters:

X (np.ndarray) – input data

Returns:

predicted_labels – The predicted labels

Return type:

np.ndarray

clustpy.deep.decode_batchwise(dataloader: DataLoader, neural_network: Module) ndarray[source]

Utility function for decoding the whole data set in a mini-batch fashion, e.g., with an autoencoder. Note: Assumes an implemented decode function

Parameters:
  • dataloader (torch.utils.data.DataLoader) – dataloader to be used

  • neural_network (torch.nn.Module) – the neural network that is used for the decoding (e.g. an autoencoder)

  • device (torch.device) – device to be trained on

Returns:

reconstructions_numpy – The reconstructed data set

Return type:

np.ndarray

clustpy.deep.detect_device(device: device | int | str = None) device[source]

Automatically detects if you have a cuda enabled GPU. Device can also be read from environment variable “CLUSTPY_DEVICE”. It can be set using, e.g., os.environ[“CLUSTPY_DEVICE”] = “cuda:1”

Parameters:

device (torch.device | int | str) – the input device. Will be returned if it is not None (default: None)

Returns:

device – device on which the prediction should take place

Return type:

torch.device

clustpy.deep.encode_batchwise(dataloader: DataLoader, neural_network: Module) ndarray[source]

Utility function for embedding the whole data set in a mini-batch fashion

Parameters:
  • dataloader (torch.utils.data.DataLoader) – dataloader to be used

  • neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)

Returns:

embeddings_numpy – The embedded data set

Return type:

np.ndarray

clustpy.deep.encode_decode_batchwise(dataloader: ~torch.utils.data.dataloader.DataLoader, neural_network: ~torch.nn.modules.module.Module) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Utility function for encoding and decoding the whole data set in a mini-batch fashion, e.g., with an autoencoder. Note: Assumes an implemented decode function

Parameters:
  • dataloader (torch.utils.data.DataLoader) – dataloader to be used

  • neural_network (torch.nn.Module) – the neural network that is used for the encoding and decoding (e.g. an autoencoder)

Returns:

tuple – The embedded data set, The reconstructed data set

Return type:

(np.ndarray, np.ndarray)

clustpy.deep.get_dataloader(X: ~numpy.ndarray | ~torch.Tensor, batch_size: int, shuffle: bool = True, drop_last: bool = False, additional_inputs: list | ~numpy.ndarray | ~torch.Tensor = None, dataset_class: ~torch.utils.data.dataset.Dataset = <class 'clustpy.deep._data_utils._ClustpyDataset'>, ds_kwargs: dict = None, dl_kwargs: dict = None) DataLoader[source]

Create a dataloader for Deep Clustering algorithms. First entry always contains the indices of the data samples. Second entry always contains the actual data samples. If for example labels are desired, they can be passed through the additional_inputs parameter (should be a list). Other customizations (e.g. augmentation) can be implemented using a custom dataset_class. This custom class should stick to the conventions, [index, data, …].

Parameters:
  • X (np.ndarray | torch.Tensor) – the actual data set (can be np.ndarray or torch.Tensor)

  • batch_size (int) – the batch size

  • shuffle (bool) – boolean that defines if the data set should be shuffled (default: True)

  • drop_last (bool) – boolean that defines if the last batch should be ignored (default: False)

  • additional_inputs (list | np.ndarray | torch.Tensor) – additional inputs for the dataloader, e.g. labels. Can be None, np.ndarray, torch.Tensor or a list containing np.ndarrays/torch.Tensors (default: None)

  • dataset_class (torch.utils.data.Dataset) – defines the class of the tensor dataset that is contained in the dataloader (default: _ClustpyDataset)

  • ds_kwargs (dict) –

    other arguments for dataset_class. An example usage would be to include augmentation or preprocessing transforms to the _ClustpyDataset by passing ds_kwargs={“aug_transforms_list”:[aug_transforms], “orig_transforms_list”:[orig_transforms]}, where aug_transforms and orig_transforms are transforming the input X, e.g., using torchvision.transforms.Compose to combine multiple transformations.

    Important: If aug_transform_list is passed via ds_kwargs the returned values of the dataloader change. The first entry will still be the indices of the data sample,

    but the second samples will be the transformed version of the actual data samples and third entry will be the original data samples. If orig_transforms_list is passed as well then the third entry will be transformed accordingly, this might be needed for preprocessing the data. An example for MNIST is shown below.

  • dl_kwargs (dict) – other arguments for torch.utils.data.DataLoader

Examples

>>> # Example for usage of data transformations with get_dataloader
>>> from clustpy.data import load_mnist
>>> import torch
>>> import torchvision
>>> # load and prepare data for torchvision.transforms
>>> data, labels = load_mnist()
>>> data = data.reshape(-1, 1, 28, 28)
>>> data /= 255.0
>>> data = torch.from_numpy(data).float()
>>> #
>>> # preprocessing functions
>>> mean = data.mean()
>>> std = data.std()
>>> normalize_fn = torchvision.transforms.Normalize([mean], [std])
>>> # flatten is only needed if a FeedForward network is used, otherwise this can be skipped.
>>> flatten_fn = torchvision.transforms.Lambda(torch.flatten)
>>> #
>>> # augmentation transforms
>>> transform_list = [
>>>     # transform input tensor to PIL image for augmentation
>>>     torchvision.transforms.ToPILImage(),
>>>     # apply transformations
>>>     torchvision.transforms.RandomAffine(degrees=(-16,+16),
>>>                                                 translate=(0.1, 0.1),
>>>                                                 shear=(-8, 8),
>>>                                                 fill=0),
>>>     # transform back to torch.tensor
>>>     torchvision.transforms.ToTensor(),
>>>     # preprocess and flatten
>>>     normalize_fn,
>>>     flatten_fn,
>>> ]
>>> #
>>> # augmentation transforms
>>> aug_transforms = torchvision.transforms.Compose(transform_list)
>>> # preprocessing transforms without augmentation
>>> orig_transforms = torchvision.transforms.Compose([normalize_fn, flatten_fn])
>>> #
>>> # pass transforms to dataloader
>>> aug_dl = get_dataloader(data, batch_size=32, shuffle=True,
>>>                         ds_kwargs={"aug_transforms_list":[aug_transforms], "orig_transforms_list":[orig_transforms]},
>>>                         )
Returns:

dataloader – The final dataloader

Return type:

torch.utils.data.DataLoader

clustpy.deep.get_default_augmented_dataloaders(X: ~numpy.ndarray | ~torch.Tensor, batch_size: int = 256, conv_used: bool = False, flatten: bool = True) -> (<class 'torch.utils.data.dataloader.DataLoader'>, <class 'torch.utils.data.dataloader.DataLoader'>)[source]

Receive a train- and a test dataloader using default augmentations. These transformations correspond to a min-max normalization followed by torchvision.transforms.RandomAffine(degrees=(-16, +16), translate=(0.1, 0.1), shear=(-8, 8), fill=0) and a channel-wise z-transformation. Optionally, the images can be flatten afterward.

Parameters:
  • X (np.ndarray | torch.Tensor) – the actual data set (can be np.ndarray or torch.Tensor)

  • batch_size (int) – the batch size (default: 256)

  • conv_used (bool) – defines whether a convolutional network will be used afterward. In this case, grayscale images will be transformed to receive three color channels by copying the grayscale channel three times (default: False)

  • flatten (bool) – defines whether the augmented images should be flatten afterward. Must be False if conv_used is True (default: True)

Returns:

tuple – The trainloader (with augmentations), The testloader (without augmentations)

Return type:

(torch.utils.data.DataLoader, torch.utils.data.DataLoader)

clustpy.deep.get_device_from_module(neural_network: Module) device[source]

Get the device from a given module.

Parameters:

neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)

Returns:

device – device of the module

Return type:

torch.device

clustpy.deep.get_trained_network(trainloader: ~torch.utils.data.dataloader.DataLoader = None, data: ~numpy.ndarray = None, n_epochs: int = 100, batch_size: int = 128, optimizer_params: dict = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, device=None, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), embedding_size: int = 10, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_class: ~torch.nn.modules.module.Module = <class 'clustpy.deep.neural_networks.feedforward_autoencoder.FeedforwardAutoencoder'>, neural_network_params: dict = None, neural_network_weights: str = None, random_state: ~numpy.random.mtrand.RandomState | int = None) Module[source]
This function returns a trained neural network. The following cases are considered
  • If the neural network is initialized and trained (neural_network.fitted==True), then return input neural network without training it again.

  • If the neural network is initialized and not trained (neural_network.fitted==False), it will be fitted (neural_network.fitted will be set to True) using default parameters.

  • If the neural network is None, a new neural network is created using neural_network_class, and it will be fitted as described above.

Beware the input neural_network_class or neural_network object needs both a fit() function and the fitted attribute. See clustpy.deep.feedforward_autoencoder.FeedforwardAutoencoder for an example.

Parameters:
  • trainloader (torch.utils.data.DataLoader) – dataloader used to train neural_network (default: None)

  • data (np.ndarray) – train data set. If data is passed then trainloader can remain empty (default: None)

  • n_epochs (int) – number of training epochs (default: 100)

  • batch_size (int) – size of the data batches (default: 128)

  • optimizer_params (dict) – parameters of the optimizer for the neural network training, includes the learning rate (default: {“lr”: 1e-3})

  • optimizer_class (torch.optim.Optimizer) – optimizer for training (default: torch.optim.Adam)

  • device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)

  • ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())

  • embedding_size (int) – dimension of the innermost layer of the neural network (default: 10)

  • neural_network (torch.nn.Module | tuple) – neural network object to be trained (optional) Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)

  • neural_network_class (torch.nn.Module) – The neural network class that should be used (default: FeedforwardAutoencoder)

  • neural_network_params (dict) – Parameters to be used when creating a new neural network using the neural_network_class (default: None)

  • neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)

  • random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

neural_network – The fitted neural network

Return type:

torch.nn.Module

clustpy.deep.predict_batchwise(dataloader: DataLoader, neural_network: Module, cluster_module: Module) ndarray[source]

Utility function for predicting the cluster labels over the whole data set in a mini-batch fashion. Method calls the predict_hard method of the cluster_module for each batch of data.

Parameters:
  • dataloader (torch.utils.data.DataLoader) – dataloader to be used

  • neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)

  • cluster_module (torch.nn.Module) – the cluster module that is used for the encoding (e.g. DEC). Usually contains the predict method.

Returns:

predictions_numpy – The predictions of the cluster_module for the data set

Return type:

np.ndarray

clustpy.deep.set_torch_seed(random_state: RandomState | int) None[source]

Set the random state for torch applications.

Parameters:

random_state (np.random.RandomState | int) – use a fixed random state or an integer to get a repeatable solution