clustpy.deep package
Subpackages
- clustpy.deep.neural_networks package
- Submodules
- clustpy.deep.neural_networks.convolutional_autoencoder module
- clustpy.deep.neural_networks.feedforward_autoencoder module
- clustpy.deep.neural_networks.neighbor_encoder module
- clustpy.deep.neural_networks.stacked_autoencoder module
- clustpy.deep.neural_networks.variational_autoencoder module
VariationalAutoencoderVariationalAutoencoder.encoderVariationalAutoencoder.decoderVariationalAutoencoder.meanVariationalAutoencoder.log_varianceVariationalAutoencoder.fittedVariationalAutoencoder.work_on_copyVariationalAutoencoder.encode()VariationalAutoencoder.forward()VariationalAutoencoder.loss()VariationalAutoencoder.transform()
- Module contents
ConvolutionalAutoencoderFeedforwardAutoencoderNeighborEncoderStackedAutoencoderVariationalAutoencoderVariationalAutoencoder.encoderVariationalAutoencoder.decoderVariationalAutoencoder.meanVariationalAutoencoder.log_varianceVariationalAutoencoder.fittedVariationalAutoencoder.work_on_copyVariationalAutoencoder.encode()VariationalAutoencoder.forward()VariationalAutoencoder.loss()VariationalAutoencoder.transform()
Submodules
clustpy.deep.aec module
@authors: Collin Leiber
- class clustpy.deep.aec.AEC(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Auto-encoder Based Data Clustering (AEC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the AEC loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining. If this is None, random labels will be used (default: None)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import AEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> aec = AEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> AEC.fit(data)
References
Song, Chunfeng, et al. “Auto-encoder based data clustering.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress, CIARP 2013, Havana, Cuba, November 20-23, 2013, Proceedings, Part I 18. Springer Berlin Heidelberg, 2013.
- fit(X: ndarray, y: ndarray = None) AEC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the AEC algorithm
- Return type:
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') AEC
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
clustpy.deep.dcn module
@authors: Lukas Miklautz, Dominik Mautz
- class clustpy.deep.dcn.DCN(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Clustering Network (DCN) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DCN loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dcn_labels_
The final DCN labels
- Type:
np.ndarray
- dcn_cluster_centers_
The final DCN cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DCN >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dcn = DCN(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dcn.fit(data)
References
Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” international conference on machine learning. PMLR, 2017.
- fit(X: ndarray, y: ndarray = None) DCN[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DCN algorithm
- Return type:
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DCN
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
clustpy.deep.ddc_n2d module
@authors: Collin Leiber
- class clustpy.deep.ddc_n2d.DDC(ratio: float = 0.1, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, tsne_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Density-based Image Clustering (DDC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE is executed on the embedded data and a variant of the Density Peak Clustering algorithm is executed.
- Parameters:
ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio (default: 0.1)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
tsne_params (dict) – Parameters for the t-SNE execution. For example, perplexity can be changed by setting tsne_params to {“n_components”: 2, “perplexity”: 25}. Check out sklearn.manifold.TSNE for more information. If None, it will be set to {“n_components”: 2} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- n_clusters_
The final number of clusters
- Type:
int
- labels_
The final labels (obtained by a variant of Density Peak Clustering)
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- tsne_
The t-SNE object
- Type:
TSNE
- n_features_in_
the number of features used for the fitting
- Type:
int
- cluster_centers_
The final cluster centers defined as the mean of assigned samples within the AE embedding
- Type:
np.ndarray
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DDC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> ddc = DDC(pretrain_epochs=3) >>> ddc.fit(data)
References
Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.
- fit(X: ndarray, y: ndarray = None) DDC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DDC algorithm
- Return type:
- predict(X: ndarray) ndarray[source]
Predicts the labels of the input data. Note that this is just a very imprecise estimation as the manifold does not learn a function f() to map the data into the final embedding. Therefore, the prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.
- Parameters:
X (np.ndarray) – input data
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
- class clustpy.deep.ddc_n2d.DDC_density_peak_clustering(ratio: float)[source]
Bases:
ClusterMixin,BaseEstimatorA variant of the Density Peak Algorithm as proposed in the DDC paper.
- Parameters:
ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio
- n_clusters_
The final number of clusters
- Type:
int
- labels_
The final labels
- Type:
np.ndarray
- n_features_in_
the number of features used for the fitting
- Type:
int
References
Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.
- fit(X: ndarray, y: ndarray = None) DDC_density_peak_clustering[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DDC variant of the Density Peak Clsutering algorithm
- Return type:
- class clustpy.deep.ddc_n2d.N2D(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, manifold_class: ~sklearn.base.TransformerMixin = <class 'sklearn.manifold._t_sne.TSNE'>, manifold_params: dict = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Not 2 Deep (N2D) clustering algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE/UMAP/ISOMAP is executed on the embedded data and the EM algorithm is executed.
- Parameters:
n_clusters (int) – number of clusters (default: 8)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
manifold_class (TransformerMixin) – the manifold technique class (default: TSNE)
manifold_params (dict) – Parameters for the manifold execution. For example, perplexity can be changed for TSNE by setting manifold_params to {“n_components”: 2, “perplexity”: 25}. Check out e.g. sklearn.manifold.TSNE for more information. If None, it will be set to {“n_components”: n_clusters} (default: None)
initial_clustering_params (dict) – parameters for the GMM clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_manifold_
The final cluster centers within the embedding of the manifold
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- manifold_
The manifold object
- Type:
TransformerMixin
- n_features_in_
the number of features used for the fitting
- Type:
int
- cluster_centers_
The final cluster centers defined as the mean of assigned samples within the AE embedding
- Type:
np.ndarray
References
McConville, Ryan, et al. “N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding.” 2020 25th international conference on pattern recognition (ICPR). IEEE, 2021.
- fit(X: ndarray, y: ndarray = None) N2D[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the N2D algorithm
- Return type:
- predict(X: ndarray) ndarray[source]
Predicts the labels of the input data. Note that this is just a very imprecise estimation as the manifold does not learn a function f() to map the data into the final embedding. Therefore, the prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.
- Parameters:
X (np.ndarray) – input data
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
clustpy.deep.dec module
@authors: Lukas Miklautz, Dominik Mautz, Collin Leiber
- class clustpy.deep.dec.DEC(n_clusters: int = 8, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Clustering (DEC) algorithm. First, a neural_network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DEC loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dec = DEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dec.fit(data)
References
Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.
- fit(X: ndarray, y: ndarray = None) DEC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DEC algorithm
- Return type:
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DEC
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.dec.IDEC(n_clusters: int = 8, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
DECThe Improved Deep Embedded Clustering (IDEC) algorithm. Is equal to the DEC algorithm but uses the self-supervised learning loss also during the clustering optimization. Further, clustering_loss_weight is set to 0.1 instead of 1 when using the default settings.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 0.1)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import IDEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> idec = IDEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> idec.fit(data)
References
Guo, Xifeng, et al. “Improved deep embedded clustering with local structure preservation.” IJCAI. 2017.
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') IDEC
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
clustpy.deep.deepect module
@authors: Collin Leiber, Julian Schilcher
- class clustpy.deep.deepect.DeepECT(max_n_leaf_nodes: int = 20, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, grow_interval: int = 2, pruning_threshold: float = 0.1, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Cluster Tree (DeepECT) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, a cluster tree will be grown and the network will be optimized using the DeepECT loss function.
- Parameters:
max_n_leaf_nodes (int) – Maximum number of leaf nodes in the cluster tree (default: 20)
batch_size (int) – Size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – Number of epochs for the actual clustering procedure (default: 150)
grow_interval (int) – Number of epochs after which the the tree is grown (default: 2)
pruning_threshold (float) – The threshold for pruning the tree (default: 0.1)
optimizer_class (torch.optim.Optimizer) – The optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – Size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState) – Use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- tree_
The prediction cluster tree after training
- Type:
PredictionClusterTree
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
References
Mautz, Dominik, Claudia Plant, and Christian Böhm. “Deep embedded cluster tree.” 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019.
- fit(X: ndarray, y: ndarray = None) DeepECT[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – This instance of the DeepECT algorithm
- Return type:
- flat_clustering(n_leaf_nodes_to_keep: int) ndarray[source]
Transform the predicted labels into a flat clustering result by only keeping n_leaf_nodes_to_keep leaf nodes in the tree. Returns labels as if the clustering procedure would have stopped at the specified number of nodes. Note that each leaf node corresponds to a cluster.
- Parameters:
n_leaf_nodes_to_keep (int) – The number of leaf nodes to keep in the cluster tree
- Returns:
labels_pruned – The new cluster labels
- Return type:
np.ndarray
clustpy.deep.den module
@authors: Collin Leiber
- class clustpy.deep.den.DEN(n_clusters: int = 8, group_size: int | list | None = 2, n_neighbors: int = 5, weight_locality_constraint: float = 0.5, weight_sparsity_constraint: float = 1.0, heat_kernel_t_parameter: float = 1.0, group_lasso_lambda_parameter: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int | None = None, custom_dataloaders: tuple = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedding Network (DEN) algorithm. It trains a neural network by optimizing a loss functions consisting of three components. These are (1) the standrad loss function of the neural netork (e.g. reconstruction loss for autoencoders), (2) the locality-preserving constraint and (3) the group sparsity constraint. Finally, k-Means is excuted in the resulting embedding.
- Parameters:
n_clusters (int) – number of clusters (default: 8)
group_size (int | list) – the number of features in each group. Can also be a list, specifying the size of each group separately. Can be None if embedding_size is specified (default: 2)
n_neighbors (int) – the number of nearest-neighbors (including itself) for the locality-preserving constraint. Nearest-neighbors will be calculated by using the Euclidean distance. If another distance should be used to define the nearest-neighbors, the neighbors can be included in the custom_dataloader as additional_inputs. In this case, it is expected that the trainloader is composed of: (sample_ids, original_samples, 1st-NNs, 2nd-NNs, …, (n_neighbors-1)-NNs) (default: 5)
weight_locality_constraint (float) – weight alpha for the locality-preserving constraint (default: 0.5)
weight_sparsity_constraint (float) – weight beta for the group sparsity constraint (default: 1.)
heat_kernel_t_parameter (float) – the t parameter for the heat kernel included in the locality-preserving constraint (default: 1.)
group_lasso_lambda_parameter (float) – the lambda parameter for the group lasso included in the group sparsity constraint (default: 1.)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: None)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by KMeans)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by KMeans)
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DEN >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> den = DEN(n_clusters=3, pretrain_epochs=3) >>> den.fit(data)
References
Huang, Peihao, et al. “Deep embedding network for clustering.” 2014 22nd International conference on pattern recognition. IEEE, 2014.
- fit(X: ndarray, y: ndarray = None) DEN[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DEN algorithm
- Return type:
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DEN
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
clustpy.deep.dipdeck module
@authors: Collin Leiber
- class clustpy.deep.dipdeck.DipDECK(n_clusters_init: int = 35, dip_merge_threshold: float = 0.9, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, max_n_clusters: int = inf, min_n_clusters: int = 1, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 5, max_cluster_size_diff_factor: float = 2, pval_strategy: str = 'table', n_boots: int = 1000, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None, debug: bool = False)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Clustering with k-Estimation (DipDECK) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters using an overestimated number of clusters. Last, the network will be optimized using the DipDECK loss function. If any Dip-value exceeds the dip_merge_threshold, the corresponding clusters will be merged.
- Parameters:
n_clusters_init (int) – initial number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 35)
dip_merge_threshold (float) – threshold regarding the Dip-p-value that defines if two clusters should be merged. Must be bvetween 0 and 1 (default: 0.9)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
max_n_clusters (int) – maximum number of clusters. Must be larger than min_n_clusters. If the result has more clusters, a merge will be forced (default: np.inf)
min_n_clusters (int) – minimum number of clusters. Must be larger than 0, smaller than max_n_clusters and smaller than n_clusters_init. When this number of clusters is reached, all further merge processes will be hindered (default: 1)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure. Will reset after each merge (default: 50)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 5)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for the Dip calculation (default: 2)
pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)
n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – If true, additional information will be printed to the console (default: False)
- labels_
The final labels
- Type:
np.ndarray
- n_clusters_
The final number of clusters
- Type:
int
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DipDECK >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dipdeck = DipDECK(pretrain_epochs=3, clustering_epochs=3) >>> dipdeck.fit(data)
References
Leiber, Collin, et al. “Dip-based deep embedded clustering with k-estimation.” Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021.
- fit(X: ndarray, y: ndarray = None) DipDECK[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DipDECK algorithm
- Return type:
clustpy.deep.dipencoder module
@authors: Collin Leiber
- class clustpy.deep.dipencoder.DipEncoder(n_clusters: int = 8, batch_size: int = None, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, max_cluster_size_diff_factor: float = 3, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe DipEncoder. Can be used either as a clustering procedure if no ground truth labels are given or as a supervised dimensionality reduction technique. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DipEncoder loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
batch_size (int) – size of the data batches for the actual training of the DipEncoder. Should be larger the more clusters we have. If it is None, it will be set to (25 x n_clusters) (default: None)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used (default: 3)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss. If None, it will be equal to 1/(4L), where L is the reconstruction loss of the first batch of an untrained neural network (default: None)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels
- Type:
np.ndarray
- projection_axes_
The final projection axes between the clusters
- Type:
np.ndarray
- index_dict_
A dictionary to match the indices of two clusters to a projection axis
- Type:
dict
- projection_thresholds_
A list containing the thresholds for each projection axis and a tuple indicating which cluster is left and right of the threshold
- Type:
list
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DipEncoder >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dipencoder = DipEncoder(3, pretrain_epochs=3, clustering_epochs=3) >>> dipencoder.fit(data)
References
Leiber, Collin and Bauer, Lena G. M. and Neumayr, Michael and Plant, Claudia and Böhm, Christian “The DipEncoder: Enforcing Multimodality in Autoencoders.” Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2022.
- fit(X: ndarray, y: ndarray = None) DipEncoder[source]
Initiate the actual clustering/dimensionality reduction process on the input data set. If no ground truth labels are given, the resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – The given (training) data set
y (np.ndarray) – The ground truth labels. If None, the DipEncoder will be used for clustering (default: None)
- Returns:
self – This instance of the DipEncoder
- Return type:
- plot(X: ndarray, edge_width: float = 0.2, show_legend: bool = True) None[source]
Plot the current state of the DipEncoder. First the data set will be encoded using the neural network, afterwards the plot will be created. Uses the plot_scatter_matrix as a basis and adds projection axes in red.
- Parameters:
X (np.ndarray) – The data set
edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots
show_legend (bool) – Specifies whether a legend should be added to the plot
- predict(X: ndarray) ndarray[source]
Predicts the labels of the input data.
- Parameters:
X (np.ndarray) – input data
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
- set_predict_request() DipEncoder
No-op.
Calling this method has no effect.
- Returns:
self – The updated object.
- Return type:
object
- clustpy.deep.dipencoder.plot_dipencoder_embedding(X_embed: ndarray, n_clusters: int, labels: ndarray, projection_axes: ndarray, index_dict: dict, edge_width: float = 0.1, show_legend: bool = False, show_plot: bool = True) None[source]
Plot the current state of the DipEncoder. Uses the plot_scatter_matrix as a basis and adds projection axes in red.
- Parameters:
X_embed (np.ndarray) – The embedded data set
n_clusters (int) – Number of clusters
labels (np.ndarray) – The cluster labels
projection_axes (np.ndarray) – The projection axes between the clusters
index_dict (dict) – A dictionary to match the indices of two clusters to a projection axis
edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots
show_legend (bool) – Specifies whether a legend should be added to the plot
show_plot (bool) – Specifies whether the plot should be plotted, i.e. if plt.show() should be executed (default: True)
clustpy.deep.dkm module
@authors: Collin Leiber
- class clustpy.deep.dkm.DKM(n_clusters: int = 8, alphas: tuple = 1000, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep k-Means (DKM) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DKM loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
alphas (tuple) – tuple of different alpha values used for the prediction. Small values close to 0 are equivalent to homogeneous assignments to all clusters. Large values simulate a clear assignment as with kMeans. If None, the default calculation of the paper will be used. This is equal to lpha_{i+1}=2^{1/log(i)^2}*lpha_i with lpha_1=0.1 and maximum i=40. Alpha can also be a tuple with (None, lpha_1, maximum i) (default: (1000))
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for each alpha value for the actual clustering procedure. The total number of clustering epochs therefore corresponds to: len(alphas)*clustering_epochs (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dkm_labels_
The final DKM labels
- Type:
np.ndarray
- dkm_cluster_centers_
The final DKM cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DKM >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dkm = DKM(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dkm.fit(data)
References
Fard, Maziar Moradi, Thibaut Thonet, and Eric Gaussier. “Deep k-means: Jointly clustering with k-means and learning representations.” Pattern Recognition Letters 138 (2020): 185-192.
- fit(X: ndarray, y: ndarray = None) DKM[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DKM algorithm
- Return type:
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DKM
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
clustpy.deep.enrc module
@authors: Lukas Miklautz
- class clustpy.deep.enrc.ACeDeC(n_clusters: int, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 20, init: str = 'acedec', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]
Bases:
ENRCAutoencoder Centroid-based Deep Cluster (ACeDeC) can be seen as a special case of ENRC where we have one cluster space and one shared space with a single cluster.
- Parameters:
n_clusters (int) – number of clusters
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
P (list) – list containing projections for clusters in clustered space and cluster in shared space (optional) (default: None)
input_centers (list) – list containing the cluster centers for clusters in clustered space and cluster in shared space (optional) (default: None)
batch_size (int) – size of the data batches (default: 128)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)
tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)
init (str) – choose which initialization strategy should be used. Has to be one of ‘acedec’, ‘subkmeans’, ‘random’ or ‘sgd’ (default: ‘acedec’)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (default: True)
debug (bool) – if True additional information during the training will be printed (default: False)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_networ_trained_
The final neural_network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
:raises ValueError : if init is not one of ‘acedec’, ‘subkmeans’, ‘random’, ‘auto’ or ‘sgd’.:
References
Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Böhm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832
- fit(X: ndarray, y: ndarray = None) ACeDeC[source]
Cluster the input dataset with the ACeDeC algorithm. Saves the labels, centers, V, m, Betas, and P in the ACeDeC object. The resulting cluster labels will be stored in the labels_ attribute. :param X: input data :type X: np.ndarray :param y: the labels (can be ignored) :type y: np.ndarray
- Returns:
self – returns the AceDeC object
- Return type:
- predict(X: ndarray, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]
Predicts the labels of the input data.
- Parameters:
X (np.ndarray) – input data
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)
dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
- set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ACeDeC
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataloaderparameter inpredict.use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
use_Pparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.enrc.ENRC(n_clusters: list, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 20, init: str = 'nrkmeans', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]
Bases:
_AbstractDeepClusteringAlgoThe Embeddedn Non-Redundant Clustering (ENRC) algorithm.
- Parameters:
n_clusters (list) – list containing number of clusters for each clustering
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
P (list) – list containing projections for each clustering (optional) (default: None)
input_centers (list) – list containing the cluster centers for each clustering (optional) (default: None)
batch_size (int) – size of the data batches (default: 128)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)
tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)
init (str) – choose which initialization strategy should be used. Has to be one of ‘nrkmeans’, ‘random’ or ‘sgd’ (default: ‘nrkmeans’)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (defaul: False)
debug (bool) – if True additional information during the training will be printed (default: False)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
:raises ValueError : if init is not one of ‘nrkmeans’, ‘random’, ‘auto’ or ‘sgd’.:
References
Miklautz, Lukas & Dominik Mautz et al. “Deep embedded non-redundant clustering.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.
- fit(X: ndarray, y: ndarray = None) ENRC[source]
Cluster the input dataset with the ENRC algorithm. Saves the labels, centers, V, m, Betas, and P in the ENRC object. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – input data
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – returns the ENRC object
- Return type:
- plot_subspace(X: ndarray, subspace_index: int = 0, labels: ndarray = None, plot_centers: bool = False, gt: ndarray = None, equal_axis: bool = False) None[source]
Plot the specified subspace_nr as scatter matrix plot.
- Parameters:
X (np.ndarray) – input data
subspace_index (int) – index of the subspace_nr (default: 0)
labels (np.ndarray) – the labels to use for the plot (default: labels found by Nr-Kmeans) (default: None)
plot_centers (bool) – plot centers if True (default: False)
gt (np.ndarray) – of ground truth labels (default=None)
equal_axis (bool) – equalize axis if True (default: False)
- Return type:
scatter matrix plot of the input data
- predict(X: ndarray = None, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]
Predicts the labels for each clustering of X in a mini-batch manner.
- Parameters:
X (np.ndarray) – input data
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)
dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in X and c is the number of clusterings.
- Return type:
np.ndarray
- reconstruct_subspace_centroids(subspace_index: int = 0) ndarray[source]
Reconstructs the centroids in the specified subspace_nr.
- Parameters:
subspace_index (int) – index of the subspace_nr (default: 0)
- Returns:
centers_rec – reconstructed centers as np.ndarray
- Return type:
centers_rec
- set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ENRC
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataloaderparameter inpredict.use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
use_Pparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- transform_full_space(X: ndarray, embedded=False) ndarray[source]
Embedds the input dataset with the neural network and the matrix V from the ENRC object.
- Parameters:
X (np.ndarray) – input data
embedded (bool) – if True, then X is assumed to be already embedded (default: False)
- Returns:
rotated – The transformed data
- Return type:
np.ndarray
- transform_subspace(X: ndarray, subspace_index: int = 0, embedded: bool = False) ndarray[source]
Embedds the input dataset with the neural network and with the matrix V projected onto a special clusterspace_nr.
- Parameters:
X (np.ndarray) – input data
subspace_index (int) – index of the subspace_nr (default: 0)
embedded (bool) – if True, then X is assumed to be already embedded (default: False)
- Returns:
subspace – The transformed subspace
- Return type:
np.ndarray
- clustpy.deep.enrc.acedec_init(data: ~numpy.ndarray, n_clusters: list, optimizer_params: dict, batch_size: int = 128, optimizer_class: ~torch.optim.optimizer.Optimizer = None, rounds: int = None, epochs: int = 10, random_state: ~numpy.random.mtrand.RandomState = None, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, device: ~torch.device = device(type='cpu'), debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy based on optimizing ACeDeC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the nrkmeans_init and only constraints V using the reconstruction error (mean_squared_error), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the sgd_init strategy is that it can be less stable for small data sets.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate
batch_size (int) – size of the data batches (default: 128)
optimizer_params – parameters of the optimizer for the actual clustering procedure, includes the learning rate
optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used (default: None)
rounds (int) – not used here (default: None)
epochs (int) – epochs is automatically set to be close to 20.000 minibatch iterations as in the ACeDeC paper. If this determined value is smaller than the passed epochs, then epochs is used (default: 10)
random_state (np.random.RandomState) – random state for reproducible results (default: None)
input_centers (list) – list of np.ndarray, default=None, optional parameter if initial cluster centers want to be set (optional)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
device (torch.device) – device on which should be trained on (default: torch.device(‘cpu’))
debug (bool) – if True then the cost of each round will be printed (default: True)
- Returns:
tuple – list of cluster centers for each subspace, list containing projections for each subspace, orthogonal rotation matrix, weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
- clustpy.deep.enrc.available_init_strategies() list[source]
Returns a list of strings of available initialization strategies for ENRC and ACeDeC. At the moment following strategies are supported: nrkmeans, random, sgd, auto
- clustpy.deep.enrc.beta_weights_init(P: list, n_dims: int, high_value: float = 0.9) Tensor[source]
Initializes parameters of the softmax such that betas will be set to high_value in dimensions which form a cluster subspace according to P and set to (1 - high_value)/(len(P) - 1) for the other clusterings.
- Parameters:
P (list) – list containing projections for each subspace
n_dims (int) – dimensionality of the embedded data
high_value (float) – value that should be initially used to indicate strength of assignment of a specific dimension to the clustering (default: 0.9)
- Returns:
beta_weights – initialized weights that are input in the softmax to get the betas.
- Return type:
torch.Tensor
- clustpy.deep.enrc.calculate_beta_weight(data: Tensor, centers: list, V: Tensor, P: list, high_beta_value: float = 0.9) Tensor[source]
The beta weights have a closed form solution if we have two subspaces, so the optimal values given the data, centers and V can be computed. See supplement of Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Boehm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832 here: https://gitlab.cs.univie.ac.at/lukas/acedec_public/-/blob/master/supplement.pdf For number of subspaces > 2, we calculate the beta weight assuming that an assigned subspace should have a weight of 0.9.
- Parameters:
data (torch.Tensor) – input data
centers (list) – list of torch.Tensor, cluster centers for each clustering
V (torch.Tensor) – orthogonal rotation matrix
P (list) – list containing projections for each subspace
high_beta_value (float) – value that should be initially used to indicate strength of assignment of a specific dimension to the clustering (default: 0.9)
- Returns:
beta_weights – a c x d vector containing the weights for the softmax to indicate which dimensions d are important for each clustering c.
- Return type:
torch.Tensor
- Raises:
ValueError – If number of clusterings is smaller than 2:
- clustpy.deep.enrc.calculate_optimal_beta_weights_special_case(data: Tensor, centers: list, V: Tensor, batch_size: int = 32) Tensor[source]
The beta weights have a closed form solution if we have two subspaces, so the optimal values given the data, centers and V can be computed. See supplement of Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Boehm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832 here: https://gitlab.cs.univie.ac.at/lukas/acedec_public/-/blob/master/supplement.pdf
- Parameters:
data (torch.Tensor) – input data
centers (list) – list of torch.Tensor, cluster centers for each clustering
V (torch.Tensor) – orthogonal rotation matrix
batch_size (int) – size of the data batches (default: 32)
- Returns:
optimal_beta_weights – a c x d vector containing the optimal weights for the softmax to indicate which dimensions d are important for each clustering c.
- Return type:
torch.Tensor
- clustpy.deep.enrc.enrc_encode_decode_batchwise_with_loss(V: Tensor, centers: list, model: Module, dataloader: DataLoader, device: device = device(type='cpu'), ssl_loss_fn: Callable | _Loss = None) ndarray[source]
Encode and Decode input data of a dataloader in a mini-batch manner with ENRC.
- Parameters:
V (torch.Tensor) – orthogonal rotation matrix
centers (list) – list of torch.Tensor, cluster centers for each clustering
model (torch.nn.Module) – the input model for encoding the data
dataloader (torch.utils.data.DataLoader) – dataloader to be used for prediction
device (torch.device) – device to be predicted on (default: torch.device(‘cpu’))
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: None)
- Returns:
enrc_encoding (np.ndarray) – n x d matrix, where n is the number of data points and d is the number of dimensions of z.
enrc_decoding (np.ndarray) – n x D matrix, where n is the number of data points and D is the data dimensionality.
reconstruction_error (flaot) – reconstruction error (will be None if ssl_loss_fn is not specified)
- clustpy.deep.enrc.enrc_init(data: ~numpy.ndarray, n_clusters: list, init: str = 'auto', rounds: int = 10, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, max_iter: int = 100, optimizer_params: dict = None, optimizer_class: ~torch.optim.optimizer.Optimizer = None, batch_size: int = 128, epochs: int = 10, device: ~torch.device = device(type='cpu'), debug: bool = True, init_kwargs: dict = None) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy for the ENRC algorithm.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
init (str) –
{‘nrkmeans’, ‘random’, ‘sgd’, ‘auto’} or callable. Initialization strategies for parameters cluster_centers, V and beta of ENRC. (default=’auto’)
’nrkmeans’ : Performs the NrKmeans algorithm to get initial parameters. This strategy is preferred for small data sets, but the orthogonality constraint on V and subsequently for the clustered subspaces can be sometimes to limiting in practice, e.g., if clusterings in the data are not perfectly non-redundant.
’random’ : Same as ‘nrkmeans’, but max_iter is set to 10, so the performance is faster, but also less optimized, thus more random.
’sgd’ : Initialization strategy based on optimizing ENRC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the ‘nrkmeans’ option and only constraints V using the reconstruction error (mean_squared_error), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the ‘sgd’ strategy is that it can be less stable for small data sets.
’auto’ : Selects ‘sgd’ init if data.shape[0] > 100,000 or data.shape[1] > 1,000. For smaller data sets ‘nrkmeans’ init is used.
If a callable is passed, it should take arguments data and n_clusters (additional parameters can be provided via the dictionary init_kwargs) and return an initialization (centers, P, V and beta_weights).
rounds (int) – number of repetitions of the initialization procedure (default: 10)
input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
random_state (np.random.RandomState) – random state for reproducible results (default: None)
max_iter (int) – maximum number of iterations of NrKmeans. Only used for init=’nrkmeans’ (default: 100)
optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate. Only used for init=’sgd’
optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used. Only used for init=’sgd’ (default: None)
batch_size (int) – size of the data batches. Only used for init=’sgd’ (default: 128)
epochs (int) – number of epochs for the actual clustering procedure. Only used for init=’sgd’ (default: 10)
device (torch.device) – device on which should be trained on. Only used for init=’sgd’ (default: torch.device(‘cpu’))
debug (bool) – if True then the cost of each round will be printed (default: True)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
- Returns:
tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
:raises ValueError : if init variable is passed that is not implemented.:
- clustpy.deep.enrc.enrc_predict(z: Tensor, V: Tensor, centers: list, subspace_betas: Tensor, use_P: bool = False) ndarray[source]
Predicts the labels for each clustering of an input z.
- Parameters:
z (torch.Tensor) – embedded input data point, can also be a mini-batch of embedded points
V (torch.tensor) – orthogonal rotation matrix
centers (list) – list of torch.Tensor, cluster centers for each clustering
subspace_betas (torch.Tensor) – weights for each dimension per clustering. Calculated via softmax(beta_weights).
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft subspace_beta weights are used (default: False)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in z and c is the number of clusterings.
- Return type:
np.ndarray
- clustpy.deep.enrc.enrc_predict_batchwise(V: Tensor, centers: list, subspace_betas: Tensor, model: Module, dataloader: DataLoader, device: device = device(type='cpu'), use_P: bool = False) ndarray[source]
Predicts the labels for each clustering of a dataloader in a mini-batch manner.
- Parameters:
V (torch.Tensor) – orthogonal rotation matrix
centers (list) – list of torch.Tensor, cluster centers for each clustering
subspace_betas (torch.Tensor) – weights for each dimension per clustering. Calculated via softmax(beta_weights).
model (torch.nn.Module) – the input model for encoding the data
dataloader (torch.utils.data.DataLoader) – dataloader to be used for prediction
device (torch.device) – device to be predicted on (default: torch.device(‘cpu’))
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: False)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in z and c is the number of clusterings.
- Return type:
np.ndarray
- clustpy.deep.enrc.nrkmeans_init(data: ~numpy.ndarray, n_clusters: list, rounds: int = 10, max_iter: int = 100, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, debug=True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy based on the NrKmeans Algorithm. This strategy is preferred for small data sets, but the orthogonality constraint on V and subsequently for the clustered subspaces can be sometimes to limiting in practice, e.g., if clusterings are not perfectly non-redundant.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
rounds (int) – number of repetitions of the NrKmeans algorithm (default: 10)
max_iter (int) – maximum number of iterations of NrKmeans (default: 100)
input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – if True then the cost of each round will be printed (default: True)
- Returns:
tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
- clustpy.deep.enrc.optimal_beta(kmeans_loss: Tensor, other_losses_mean_sum: Tensor) Tensor[source]
Calculate optimal values for the beta weight for each dimension.
- Parameters:
kmeans_loss (torch.Tensor) – a 1 x d vector of the kmeans losses per dimension.
other_losses_mean_sum (torch.Tensor) – a 1 x d vector of the kmeans losses of all other clusterings except the one in ‘kmeans_loss’.
- Returns:
optimal_beta_weights – a 1 x d vector containing the optimal weights for the softmax to indicate which dimensions are important for each clustering. Calculated via -torch.log(kmeans_loss/other_losses_mean_sum)
- Return type:
torch.Tensor
- clustpy.deep.enrc.random_nrkmeans_init(data: ~numpy.ndarray, n_clusters: list, rounds: int = 10, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy based on the NrKmeans Algorithm. For documentation see nrkmeans_init function. Same as nrkmeans_init, but max_iter is set to 1, so the results will be faster and more random.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
rounds (int) – number of repetitions of the NrKmeans algorithm (default: 10)
input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – if True then the cost of each round will be printed (default: True)
- Returns:
tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
- clustpy.deep.enrc.reinit_centers(enrc: _ENRC_Module, subspace_id: int, dataloader: DataLoader, model: Module, n_samples: int = 512, kmeans_steps: int = 10, split: str = 'random', debug: bool = False) None[source]
Reinitializes centers that have been lost, i.e. if they did not get any data point assigned. Before a center is reinitialized, this method checks whether a center has not get any points assigned over several mini-batch iterations and if this count is higher than enrc.reinit_threshold the center will be reinitialized.
- Parameters:
enrc (_ENRC_Module) – torch.nn.Module instance for the ENRC algorithm
subspace_id (int) – integer which indicates which subspace the cluster to be checked are in.
dataloader (torch.utils.data.DataLoader) – dataloader from which data is randomly sampled. Important shuffle=True needs to be set, because n_samples random samples are drawn.
model (torch.nn.Module) – neural network used for the embedding
n_samples (int) – number of samples that should be used for the reclustering (default: 512)
kmeans_steps (int) – number of mini-batch kmeans steps that should be conducted with the new centroid (default: 10)
split (str) – {‘random’, ‘cost’}, default=’random’, select how clusters should be split for renitialization. ‘random’ : split a random point from the random sample of size=n_samples. ‘cost’ : split the cluster with max kmeans cost.
debug (bool) – if True than training errors will be printed (default: True)
- clustpy.deep.enrc.sgd_init(data: ~numpy.ndarray, n_clusters: list, optimizer_params: dict, batch_size: int = 128, optimizer_class: ~torch.optim.optimizer.Optimizer = None, rounds: int = 2, epochs: int = 10, random_state: ~numpy.random.mtrand.RandomState = None, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, device: ~torch.device = device(type='cpu'), debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy based on optimizing ENRC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the nrkmeans_init and only constraints V using the reconstruction error (mean_squared_error), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the sgd_init strategy is that it can be less stable for small data sets.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate
batch_size (int) – size of the data batches (default: 128)
optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used (default: None)
rounds (int) – number of repetitions of the initialization procedure (default: 2)
epochs (int) – number of epochs for the actual clustering procedure (default: 10)
random_state (np.random.RandomState) – random state for reproducible results (default: None)
input_centers (list) – list of np.ndarray, default=None, optional parameter if initial cluster centers want to be set (optional)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
device (torch.device) – device on which should be trained on (default: torch.device(‘cpu’))
debug (bool) – if True then the cost of each round will be printed (default: True)
- Returns:
tuple – list of cluster centers for each subspace, list containing projections for each subspace, orthogonal rotation matrix, weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
clustpy.deep.shade module
@authors: Pascal Weber
- class clustpy.deep.shade.SHADE(clustering_class: ~sklearn.base.ClusterMixin | None = <class 'clustpy.hierarchical.dctree_clusterer.DCTree_Clusterer'>, clustering_params: dict = None, min_points: int = 5, use_complete_dc_tree: bool = True, use_matrix_dc_distance: bool = True, use_less_memory: bool = False, batch_size: int = 500, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 0, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~typing.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, density_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Structure-preserving High-dimensional Analysis with Density-based Exploration (SHADE) algorithm. A neural network (autoencoder AE) will be trained with the reconstruction loss and the d_dc loss function. Afterward, KMeans or HDBSCAN identifies the initial clusters.
- Parameters:
clustering_class (ClusterMixin) – clustering class to obtain the cluster labels after getting the embedding (default: DCTree_Clusterer)
clustering_params (dict) – parameters for the clustering class. If None, it will be set to {“min_points”: min_points} (default: None)
min_points (int) – the minimum number of points (default: 5)
use_complete_dc_tree (bool) – Defines whether the complete DC Tree should be used instead of a batch-wise version (default: True)
use_matrix_dc_distance (bool) – Defines whether the matrix DC distance should be stored - can cause memory issues (default: True)
use_less_memory (bool) – Use less memory when constructing the DCTree. This will, however, increase the runtime (default: False)
batch_size (int) – Size of the data batches. (default: 500)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3}. (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network. (default: 0)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
density_loss_weight (float) – weight of the density loss compared to the reconstruction loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- n_clusters_
The final number of clusters
- Type:
int
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers defined as the mean of assigned samples within the AE embedding
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> shade = SHADE() >>> shade.fit(data)
References
SHADE: Deep Density-based Clustering Anna Beer; Pascal Weber; Lukas Miklautz; Collin Leiber; Walid Durani; Christian Böhm IEEE International Conference on Data Mining (ICDM), Abu Dhabi, United Arab Emirates, 2024, pp. 675-680, doi: 10.1109/ICDM59182.2024.
- fit(X: ndarray, y: ndarray = None) SHADE[source]
Cluster the input dataset with the SHADE algorithm. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – The given data set.
y (np.ndarray) – The labels. (can be ignored)
- Returns:
self – This instance of the SHADE algorithm.
- Return type:
- predict(X: ndarray) ndarray[source]
Predicts the labels of the input data. Note that this is just a very imprecise estimation as we are not using the DC Tree to predict the labels. The prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.
- Parameters:
X (np.ndarray) – input data
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
clustpy.deep.vade module
@authors: Donatella Novakovic, Lukas Miklautz, Collin Leiber
- class clustpy.deep.vade.VaDE(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = BCELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Variational Deep Embedding (VaDE) algorithm. First, an variational autoencoder (VAE) will be trained (will be skipped if input neural network is given). Afterward, a GMM will be fit to identify the initial clustering structures. Last, the VAE will be optimized using the VaDE loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.BCELoss(reduction=’sum’))
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new VariationalAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (central layer with mean and variance) (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: GaussianMixture)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {“n_init”: 10, “covariance_type”: “diag”} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The labels as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- cluster_centers_
The cluster centers as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- covariances_
The covariance matrices as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- weights_
The weights as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- vade_labels_
The labels as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_cluster_centers_
The cluster centers as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_covariances_
The covariance matrices as identified by VaDE after the training terminated
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> data = (data - np.mean(data)) / np.std(data) >>> vade = VaDE(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> vade.fit(data)
References
Jiang, Zhuxi, et al. “Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering.” IJCAI. 2017.
- fit(X: ndarray, y: ndarray = None) VaDE[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the VaDE algorithm
- Return type:
Module contents
- class clustpy.deep.ACeDeC(n_clusters: int, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 20, init: str = 'acedec', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]
Bases:
ENRCAutoencoder Centroid-based Deep Cluster (ACeDeC) can be seen as a special case of ENRC where we have one cluster space and one shared space with a single cluster.
- Parameters:
n_clusters (int) – number of clusters
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
P (list) – list containing projections for clusters in clustered space and cluster in shared space (optional) (default: None)
input_centers (list) – list containing the cluster centers for clusters in clustered space and cluster in shared space (optional) (default: None)
batch_size (int) – size of the data batches (default: 128)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)
tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)
init (str) – choose which initialization strategy should be used. Has to be one of ‘acedec’, ‘subkmeans’, ‘random’ or ‘sgd’ (default: ‘acedec’)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (default: True)
debug (bool) – if True additional information during the training will be printed (default: False)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_networ_trained_
The final neural_network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
:raises ValueError : if init is not one of ‘acedec’, ‘subkmeans’, ‘random’, ‘auto’ or ‘sgd’.:
References
Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Böhm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832
- fit(X: ndarray, y: ndarray = None) ACeDeC[source]
Cluster the input dataset with the ACeDeC algorithm. Saves the labels, centers, V, m, Betas, and P in the ACeDeC object. The resulting cluster labels will be stored in the labels_ attribute. :param X: input data :type X: np.ndarray :param y: the labels (can be ignored) :type y: np.ndarray
- Returns:
self – returns the AceDeC object
- Return type:
- predict(X: ndarray, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]
Predicts the labels of the input data.
- Parameters:
X (np.ndarray) – input data
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)
dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
- set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ACeDeC
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataloaderparameter inpredict.use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
use_Pparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.AEC(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Auto-encoder Based Data Clustering (AEC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the AEC loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining. If this is None, random labels will be used (default: None)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import AEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> aec = AEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> AEC.fit(data)
References
Song, Chunfeng, et al. “Auto-encoder based data clustering.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress, CIARP 2013, Havana, Cuba, November 20-23, 2013, Proceedings, Part I 18. Springer Berlin Heidelberg, 2013.
- fit(X: ndarray, y: ndarray = None) AEC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the AEC algorithm
- Return type:
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') AEC
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.DCN(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Clustering Network (DCN) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DCN loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dcn_labels_
The final DCN labels
- Type:
np.ndarray
- dcn_cluster_centers_
The final DCN cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DCN >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dcn = DCN(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dcn.fit(data)
References
Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” international conference on machine learning. PMLR, 2017.
- fit(X: ndarray, y: ndarray = None) DCN[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DCN algorithm
- Return type:
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DCN
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.DDC(ratio: float = 0.1, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, tsne_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Density-based Image Clustering (DDC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE is executed on the embedded data and a variant of the Density Peak Clustering algorithm is executed.
- Parameters:
ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio (default: 0.1)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
tsne_params (dict) – Parameters for the t-SNE execution. For example, perplexity can be changed by setting tsne_params to {“n_components”: 2, “perplexity”: 25}. Check out sklearn.manifold.TSNE for more information. If None, it will be set to {“n_components”: 2} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- n_clusters_
The final number of clusters
- Type:
int
- labels_
The final labels (obtained by a variant of Density Peak Clustering)
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- tsne_
The t-SNE object
- Type:
TSNE
- n_features_in_
the number of features used for the fitting
- Type:
int
- cluster_centers_
The final cluster centers defined as the mean of assigned samples within the AE embedding
- Type:
np.ndarray
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DDC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> ddc = DDC(pretrain_epochs=3) >>> ddc.fit(data)
References
Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.
- fit(X: ndarray, y: ndarray = None) DDC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DDC algorithm
- Return type:
- predict(X: ndarray) ndarray[source]
Predicts the labels of the input data. Note that this is just a very imprecise estimation as the manifold does not learn a function f() to map the data into the final embedding. Therefore, the prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.
- Parameters:
X (np.ndarray) – input data
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
- class clustpy.deep.DEC(n_clusters: int = 8, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Clustering (DEC) algorithm. First, a neural_network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DEC loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dec = DEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dec.fit(data)
References
Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.
- fit(X: ndarray, y: ndarray = None) DEC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DEC algorithm
- Return type:
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DEC
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.DEN(n_clusters: int = 8, group_size: int | list | None = 2, n_neighbors: int = 5, weight_locality_constraint: float = 0.5, weight_sparsity_constraint: float = 1.0, heat_kernel_t_parameter: float = 1.0, group_lasso_lambda_parameter: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int | None = None, custom_dataloaders: tuple = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedding Network (DEN) algorithm. It trains a neural network by optimizing a loss functions consisting of three components. These are (1) the standrad loss function of the neural netork (e.g. reconstruction loss for autoencoders), (2) the locality-preserving constraint and (3) the group sparsity constraint. Finally, k-Means is excuted in the resulting embedding.
- Parameters:
n_clusters (int) – number of clusters (default: 8)
group_size (int | list) – the number of features in each group. Can also be a list, specifying the size of each group separately. Can be None if embedding_size is specified (default: 2)
n_neighbors (int) – the number of nearest-neighbors (including itself) for the locality-preserving constraint. Nearest-neighbors will be calculated by using the Euclidean distance. If another distance should be used to define the nearest-neighbors, the neighbors can be included in the custom_dataloader as additional_inputs. In this case, it is expected that the trainloader is composed of: (sample_ids, original_samples, 1st-NNs, 2nd-NNs, …, (n_neighbors-1)-NNs) (default: 5)
weight_locality_constraint (float) – weight alpha for the locality-preserving constraint (default: 0.5)
weight_sparsity_constraint (float) – weight beta for the group sparsity constraint (default: 1.)
heat_kernel_t_parameter (float) – the t parameter for the heat kernel included in the locality-preserving constraint (default: 1.)
group_lasso_lambda_parameter (float) – the lambda parameter for the group lasso included in the group sparsity constraint (default: 1.)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: None)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by KMeans)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by KMeans)
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DEN >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> den = DEN(n_clusters=3, pretrain_epochs=3) >>> den.fit(data)
References
Huang, Peihao, et al. “Deep embedding network for clustering.” 2014 22nd International conference on pattern recognition. IEEE, 2014.
- fit(X: ndarray, y: ndarray = None) DEN[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DEN algorithm
- Return type:
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DEN
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.DKM(n_clusters: int = 8, alphas: tuple = 1000, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep k-Means (DKM) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DKM loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
alphas (tuple) – tuple of different alpha values used for the prediction. Small values close to 0 are equivalent to homogeneous assignments to all clusters. Large values simulate a clear assignment as with kMeans. If None, the default calculation of the paper will be used. This is equal to lpha_{i+1}=2^{1/log(i)^2}*lpha_i with lpha_1=0.1 and maximum i=40. Alpha can also be a tuple with (None, lpha_1, maximum i) (default: (1000))
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for each alpha value for the actual clustering procedure. The total number of clustering epochs therefore corresponds to: len(alphas)*clustering_epochs (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss (default: 0.1)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dkm_labels_
The final DKM labels
- Type:
np.ndarray
- dkm_cluster_centers_
The final DKM cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DKM >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dkm = DKM(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dkm.fit(data)
References
Fard, Maziar Moradi, Thibaut Thonet, and Eric Gaussier. “Deep k-means: Jointly clustering with k-means and learning representations.” Pattern Recognition Letters 138 (2020): 185-192.
- fit(X: ndarray, y: ndarray = None) DKM[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DKM algorithm
- Return type:
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') DKM
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.DeepECT(max_n_leaf_nodes: int = 20, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, grow_interval: int = 2, pruning_threshold: float = 0.1, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Cluster Tree (DeepECT) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, a cluster tree will be grown and the network will be optimized using the DeepECT loss function.
- Parameters:
max_n_leaf_nodes (int) – Maximum number of leaf nodes in the cluster tree (default: 20)
batch_size (int) – Size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – Number of epochs for the actual clustering procedure (default: 150)
grow_interval (int) – Number of epochs after which the the tree is grown (default: 2)
pruning_threshold (float) – The threshold for pruning the tree (default: 0.1)
optimizer_class (torch.optim.Optimizer) – The optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – Size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState) – Use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- tree_
The prediction cluster tree after training
- Type:
PredictionClusterTree
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
References
Mautz, Dominik, Claudia Plant, and Christian Böhm. “Deep embedded cluster tree.” 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 2019.
- fit(X: ndarray, y: ndarray = None) DeepECT[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – This instance of the DeepECT algorithm
- Return type:
- flat_clustering(n_leaf_nodes_to_keep: int) ndarray[source]
Transform the predicted labels into a flat clustering result by only keeping n_leaf_nodes_to_keep leaf nodes in the tree. Returns labels as if the clustering procedure would have stopped at the specified number of nodes. Note that each leaf node corresponds to a cluster.
- Parameters:
n_leaf_nodes_to_keep (int) – The number of leaf nodes to keep in the cluster tree
- Returns:
labels_pruned – The new cluster labels
- Return type:
np.ndarray
- class clustpy.deep.DipDECK(n_clusters_init: int = 35, dip_merge_threshold: float = 0.9, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, max_n_clusters: int = inf, min_n_clusters: int = 1, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 5, max_cluster_size_diff_factor: float = 2, pval_strategy: str = 'table', n_boots: int = 1000, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None, debug: bool = False)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Clustering with k-Estimation (DipDECK) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters using an overestimated number of clusters. Last, the network will be optimized using the DipDECK loss function. If any Dip-value exceeds the dip_merge_threshold, the corresponding clusters will be merged.
- Parameters:
n_clusters_init (int) – initial number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 35)
dip_merge_threshold (float) – threshold regarding the Dip-p-value that defines if two clusters should be merged. Must be bvetween 0 and 1 (default: 0.9)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
max_n_clusters (int) – maximum number of clusters. Must be larger than min_n_clusters. If the result has more clusters, a merge will be forced (default: np.inf)
min_n_clusters (int) – minimum number of clusters. Must be larger than 0, smaller than max_n_clusters and smaller than n_clusters_init. When this number of clusters is reached, all further merge processes will be hindered (default: 1)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure. Will reset after each merge (default: 50)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 5)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for the Dip calculation (default: 2)
pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)
n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – If true, additional information will be printed to the console (default: False)
- labels_
The final labels
- Type:
np.ndarray
- n_clusters_
The final number of clusters
- Type:
int
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DipDECK >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dipdeck = DipDECK(pretrain_epochs=3, clustering_epochs=3) >>> dipdeck.fit(data)
References
Leiber, Collin, et al. “Dip-based deep embedded clustering with k-estimation.” Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021.
- fit(X: ndarray, y: ndarray = None) DipDECK[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DipDECK algorithm
- Return type:
- class clustpy.deep.DipEncoder(n_clusters: int = 8, batch_size: int = None, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, max_cluster_size_diff_factor: float = 3, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe DipEncoder. Can be used either as a clustering procedure if no ground truth labels are given or as a supervised dimensionality reduction technique. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DipEncoder loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
batch_size (int) – size of the data batches for the actual training of the DipEncoder. Should be larger the more clusters we have. If it is None, it will be set to (25 x n_clusters) (default: None)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used (default: 3)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss. If None, it will be equal to 1/(4L), where L is the reconstruction loss of the first batch of an untrained neural network (default: None)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels
- Type:
np.ndarray
- projection_axes_
The final projection axes between the clusters
- Type:
np.ndarray
- index_dict_
A dictionary to match the indices of two clusters to a projection axis
- Type:
dict
- projection_thresholds_
A list containing the thresholds for each projection axis and a tuple indicating which cluster is left and right of the threshold
- Type:
list
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DipEncoder >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dipencoder = DipEncoder(3, pretrain_epochs=3, clustering_epochs=3) >>> dipencoder.fit(data)
References
Leiber, Collin and Bauer, Lena G. M. and Neumayr, Michael and Plant, Claudia and Böhm, Christian “The DipEncoder: Enforcing Multimodality in Autoencoders.” Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2022.
- fit(X: ndarray, y: ndarray = None) DipEncoder[source]
Initiate the actual clustering/dimensionality reduction process on the input data set. If no ground truth labels are given, the resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – The given (training) data set
y (np.ndarray) – The ground truth labels. If None, the DipEncoder will be used for clustering (default: None)
- Returns:
self – This instance of the DipEncoder
- Return type:
- plot(X: ndarray, edge_width: float = 0.2, show_legend: bool = True) None[source]
Plot the current state of the DipEncoder. First the data set will be encoded using the neural network, afterwards the plot will be created. Uses the plot_scatter_matrix as a basis and adds projection axes in red.
- Parameters:
X (np.ndarray) – The data set
edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots
show_legend (bool) – Specifies whether a legend should be added to the plot
- predict(X: ndarray) ndarray[source]
Predicts the labels of the input data.
- Parameters:
X (np.ndarray) – input data
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
- set_predict_request() DipEncoder
No-op.
Calling this method has no effect.
- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.ENRC(n_clusters: list, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 20, init: str = 'nrkmeans', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]
Bases:
_AbstractDeepClusteringAlgoThe Embeddedn Non-Redundant Clustering (ENRC) algorithm.
- Parameters:
n_clusters (list) – list containing number of clusters for each clustering
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
P (list) – list containing projections for each clustering (optional) (default: None)
input_centers (list) – list containing the cluster centers for each clustering (optional) (default: None)
batch_size (int) – size of the data batches (default: 128)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)
tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)
init (str) – choose which initialization strategy should be used. Has to be one of ‘nrkmeans’, ‘random’ or ‘sgd’ (default: ‘nrkmeans’)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (defaul: False)
debug (bool) – if True additional information during the training will be printed (default: False)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
:raises ValueError : if init is not one of ‘nrkmeans’, ‘random’, ‘auto’ or ‘sgd’.:
References
Miklautz, Lukas & Dominik Mautz et al. “Deep embedded non-redundant clustering.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.
- fit(X: ndarray, y: ndarray = None) ENRC[source]
Cluster the input dataset with the ENRC algorithm. Saves the labels, centers, V, m, Betas, and P in the ENRC object. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – input data
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – returns the ENRC object
- Return type:
- plot_subspace(X: ndarray, subspace_index: int = 0, labels: ndarray = None, plot_centers: bool = False, gt: ndarray = None, equal_axis: bool = False) None[source]
Plot the specified subspace_nr as scatter matrix plot.
- Parameters:
X (np.ndarray) – input data
subspace_index (int) – index of the subspace_nr (default: 0)
labels (np.ndarray) – the labels to use for the plot (default: labels found by Nr-Kmeans) (default: None)
plot_centers (bool) – plot centers if True (default: False)
gt (np.ndarray) – of ground truth labels (default=None)
equal_axis (bool) – equalize axis if True (default: False)
- Return type:
scatter matrix plot of the input data
- predict(X: ndarray = None, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]
Predicts the labels for each clustering of X in a mini-batch manner.
- Parameters:
X (np.ndarray) – input data
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)
dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in X and c is the number of clusterings.
- Return type:
np.ndarray
- reconstruct_subspace_centroids(subspace_index: int = 0) ndarray[source]
Reconstructs the centroids in the specified subspace_nr.
- Parameters:
subspace_index (int) – index of the subspace_nr (default: 0)
- Returns:
centers_rec – reconstructed centers as np.ndarray
- Return type:
centers_rec
- set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ENRC
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataloaderparameter inpredict.use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
use_Pparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- transform_full_space(X: ndarray, embedded=False) ndarray[source]
Embedds the input dataset with the neural network and the matrix V from the ENRC object.
- Parameters:
X (np.ndarray) – input data
embedded (bool) – if True, then X is assumed to be already embedded (default: False)
- Returns:
rotated – The transformed data
- Return type:
np.ndarray
- transform_subspace(X: ndarray, subspace_index: int = 0, embedded: bool = False) ndarray[source]
Embedds the input dataset with the neural network and with the matrix V projected onto a special clusterspace_nr.
- Parameters:
X (np.ndarray) – input data
subspace_index (int) – index of the subspace_nr (default: 0)
embedded (bool) – if True, then X is assumed to be already embedded (default: False)
- Returns:
subspace – The transformed subspace
- Return type:
np.ndarray
- class clustpy.deep.IDEC(n_clusters: int = 8, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
DECThe Improved Deep Embedded Clustering (IDEC) algorithm. Is equal to the DEC algorithm but uses the self-supervised learning loss also during the clustering optimization. Further, clustering_loss_weight is set to 0.1 instead of 1 when using the default settings.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 0.1)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import IDEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> idec = IDEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> idec.fit(data)
References
Guo, Xifeng, et al. “Improved deep embedded clustering with local structure preservation.” IJCAI. 2017.
- set_predict_request(*, cluster_centers: bool | None | str = '$UNCHANGED$') IDEC
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cluster_centers (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cluster_centersparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.N2D(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, manifold_class: ~sklearn.base.TransformerMixin = <class 'sklearn.manifold._t_sne.TSNE'>, manifold_params: dict = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Not 2 Deep (N2D) clustering algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE/UMAP/ISOMAP is executed on the embedded data and the EM algorithm is executed.
- Parameters:
n_clusters (int) – number of clusters (default: 8)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
manifold_class (TransformerMixin) – the manifold technique class (default: TSNE)
manifold_params (dict) – Parameters for the manifold execution. For example, perplexity can be changed for TSNE by setting manifold_params to {“n_components”: 2, “perplexity”: 25}. Check out e.g. sklearn.manifold.TSNE for more information. If None, it will be set to {“n_components”: n_clusters} (default: None)
initial_clustering_params (dict) – parameters for the GMM clustering class. If None, it will be set to {} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_manifold_
The final cluster centers within the embedding of the manifold
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- manifold_
The manifold object
- Type:
TransformerMixin
- n_features_in_
the number of features used for the fitting
- Type:
int
- cluster_centers_
The final cluster centers defined as the mean of assigned samples within the AE embedding
- Type:
np.ndarray
References
McConville, Ryan, et al. “N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding.” 2020 25th international conference on pattern recognition (ICPR). IEEE, 2021.
- fit(X: ndarray, y: ndarray = None) N2D[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the N2D algorithm
- Return type:
- predict(X: ndarray) ndarray[source]
Predicts the labels of the input data. Note that this is just a very imprecise estimation as the manifold does not learn a function f() to map the data into the final embedding. Therefore, the prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.
- Parameters:
X (np.ndarray) – input data
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
- class clustpy.deep.SHADE(clustering_class: ~sklearn.base.ClusterMixin | None = <class 'clustpy.hierarchical.dctree_clusterer.DCTree_Clusterer'>, clustering_params: dict = None, min_points: int = 5, use_complete_dc_tree: bool = True, use_matrix_dc_distance: bool = True, use_less_memory: bool = False, batch_size: int = 500, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 0, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~typing.Callable | ~torch.nn.modules.loss._Loss = <function mean_squared_error>, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, density_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Structure-preserving High-dimensional Analysis with Density-based Exploration (SHADE) algorithm. A neural network (autoencoder AE) will be trained with the reconstruction loss and the d_dc loss function. Afterward, KMeans or HDBSCAN identifies the initial clusters.
- Parameters:
clustering_class (ClusterMixin) – clustering class to obtain the cluster labels after getting the embedding (default: DCTree_Clusterer)
clustering_params (dict) – parameters for the clustering class. If None, it will be set to {“min_points”: min_points} (default: None)
min_points (int) – the minimum number of points (default: 5)
use_complete_dc_tree (bool) – Defines whether the complete DC Tree should be used instead of a batch-wise version (default: True)
use_matrix_dc_distance (bool) – Defines whether the matrix DC distance should be stored - can cause memory issues (default: True)
use_less_memory (bool) – Use less memory when constructing the DCTree. This will, however, increase the runtime (default: False)
batch_size (int) – Size of the data batches. (default: 500)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3}. (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network. (default: 0)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: mean_squared_error)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
density_loss_weight (float) – weight of the density loss compared to the reconstruction loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- n_clusters_
The final number of clusters
- Type:
int
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers defined as the mean of assigned samples within the AE embedding
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> shade = SHADE() >>> shade.fit(data)
References
SHADE: Deep Density-based Clustering Anna Beer; Pascal Weber; Lukas Miklautz; Collin Leiber; Walid Durani; Christian Böhm IEEE International Conference on Data Mining (ICDM), Abu Dhabi, United Arab Emirates, 2024, pp. 675-680, doi: 10.1109/ICDM59182.2024.
- fit(X: ndarray, y: ndarray = None) SHADE[source]
Cluster the input dataset with the SHADE algorithm. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – The given data set.
y (np.ndarray) – The labels. (can be ignored)
- Returns:
self – This instance of the SHADE algorithm.
- Return type:
- predict(X: ndarray) ndarray[source]
Predicts the labels of the input data. Note that this is just a very imprecise estimation as we are not using the DC Tree to predict the labels. The prediction is calculated by checking the distance to the clostest mean of samples in a cluster within the embedding of the AE.
- Parameters:
X (np.ndarray) – input data
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
- class clustpy.deep.VaDE(n_clusters: int = 8, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~collections.abc.Callable | ~torch.nn.modules.loss._Loss = BCELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str | ~pathlib.Path = None, embedding_size: int = 10, custom_dataloaders: tuple = None, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Variational Deep Embedding (VaDE) algorithm. First, an variational autoencoder (VAE) will be trained (will be skipped if input neural network is given). Afterward, a GMM will be fit to identify the initial clustering structures. Last, the VAE will be optimized using the VaDE loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 8)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate. If None, it will be set to {“lr”: 1e-3} (default: None)
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate. If None, it will be set to {“lr”: 1e-4} (default: None)
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (Callable | torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.BCELoss(reduction=’sum’))
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new VariationalAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str | Path) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (central layer with mean and variance) (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: GaussianMixture)
initial_clustering_params (dict) – parameters for the initial clustering class. If None, it will be set to {“n_init”: 10, “covariance_type”: “diag”} (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The labels as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- cluster_centers_
The cluster centers as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- covariances_
The covariance matrices as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- weights_
The weights as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- vade_labels_
The labels as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_cluster_centers_
The cluster centers as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_covariances_
The covariance matrices as identified by VaDE after the training terminated
- Type:
np.ndarray
- neural_network_trained_
The final neural network
- Type:
torch.nn.Module
- n_features_in_
the number of features used for the fitting
- Type:
int
Examples
>>> from clustpy.data import create_subspace_data >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> data = (data - np.mean(data)) / np.std(data) >>> vade = VaDE(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> vade.fit(data)
References
Jiang, Zhuxi, et al. “Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering.” IJCAI. 2017.
- fit(X: ndarray, y: ndarray = None) VaDE[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the VaDE algorithm
- Return type:
- clustpy.deep.decode_batchwise(dataloader: DataLoader, neural_network: Module) ndarray[source]
Utility function for decoding the whole data set in a mini-batch fashion, e.g., with an autoencoder. Note: Assumes an implemented decode function
- Parameters:
dataloader (torch.utils.data.DataLoader) – data to decode
neural_network (torch.nn.Module) – the neural network that is used for the decoding (e.g. an autoencoder)
- Returns:
decodings_numpy – The decoded data set
- Return type:
np.ndarray
- clustpy.deep.detect_device(device: device | int | str = None) device[source]
Automatically detects if you have a cuda enabled GPU. Device can also be read from environment variable “CLUSTPY_DEVICE”. It can be set using, e.g., os.environ[“CLUSTPY_DEVICE”] = “cuda:1”
- Parameters:
device (torch.device | int | str) – the input device. Will be returned if it is not None (default: None)
- Returns:
device – device on which the prediction should take place
- Return type:
torch.device
- clustpy.deep.encode_batchwise(dataloader: DataLoader, neural_network: Module) ndarray[source]
Utility function for embedding the whole data set in a mini-batch fashion
- Parameters:
dataloader (torch.utils.data.DataLoader) – data to embed
neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)
- Returns:
embeddings_numpy – The embedded data set
- Return type:
np.ndarray
- clustpy.deep.encode_decode_batchwise(dataloader: ~torch.utils.data.dataloader.DataLoader, neural_network: ~torch.nn.modules.module.Module) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Utility function for encoding and decoding the whole data set in a mini-batch fashion, e.g., with an autoencoder. Note: Assumes an implemented decode function
- Parameters:
dataloader (torch.utils.data.DataLoader) – dataloader to be used
neural_network (torch.nn.Module) – the neural network that is used for the encoding and decoding (e.g. an autoencoder)
- Returns:
tuple – The embedded data set, The decoded data set
- Return type:
(np.ndarray, np.ndarray)
- clustpy.deep.get_dataloader(X: ~numpy.ndarray | ~torch.Tensor, batch_size: int = 256, shuffle: bool = True, drop_last: bool = False, additional_inputs: list | ~numpy.ndarray | ~torch.Tensor = None, dataset_class: ~torch.utils.data.dataset.Dataset = <class 'clustpy.deep._data_utils._ClustpyDataset'>, ds_kwargs: dict = None, dl_kwargs: dict = None) DataLoader[source]
Create a dataloader for Deep Clustering algorithms. First entry always contains the indices of the data samples. Second entry always contains the actual data samples. If for example labels are desired, they can be passed through the additional_inputs parameter (should be a list). Other customizations (e.g. augmentation) can be implemented using a custom dataset_class. This custom class should stick to the conventions, [index, data, …].
- Parameters:
X (np.ndarray | torch.Tensor) – the actual data set (can be np.ndarray or torch.Tensor)
batch_size (int) – the batch size (default: 256)
shuffle (bool) – boolean that defines if the data set should be shuffled (default: True)
drop_last (bool) – boolean that defines if the last batch should be ignored (default: False)
additional_inputs (list | np.ndarray | torch.Tensor) – additional inputs for the dataloader, e.g. labels or neighbors. Can be None, np.ndarray, torch.Tensor or a list containing np.ndarrays/torch.Tensors (default: None)
dataset_class (torch.utils.data.Dataset) – defines the class of the tensor dataset that is contained in the dataloader (default: _ClustpyDataset)
ds_kwargs (dict) –
other arguments for dataset_class. An example usage would be to include augmentation or preprocessing transforms to the _ClustpyDataset by passing ds_kwargs={“aug_transforms_list”:[aug_transforms], “orig_transforms_list”:[orig_transforms]}, where aug_transforms and orig_transforms are transforming the input X, e.g., using torchvision.transforms.Compose to combine multiple transformations.
- Important: If aug_transform_list is passed via ds_kwargs the returned values of the dataloader change. The first entry will still be the indices of the data sample,
but the second samples will be the transformed version of the actual data samples and third entry will be the original data samples. If orig_transforms_list is passed as well then the third entry will be transformed accordingly, this might be needed for preprocessing the data. An example for MNIST is shown below.
dl_kwargs (dict) – other arguments for torch.utils.data.DataLoader
Examples
>>> # Example for usage of data transformations with get_dataloader >>> from clustpy.data import load_mnist >>> import torch >>> import torchvision
>>> # load and prepare data for torchvision.transforms >>> data, labels = load_mnist() >>> data = data.reshape(-1, 1, 28, 28) >>> data /= 255.0 >>> data = torch.from_numpy(data).float() >>> # >>> # preprocessing functions >>> mean = data.mean() >>> std = data.std() >>> normalize_fn = torchvision.transforms.Normalize([mean], [std]) >>> # flatten is only needed if a FeedForward network is used, otherwise this can be skipped. >>> flatten_fn = torchvision.transforms.Lambda(torch.flatten) >>> # >>> # augmentation transforms >>> transform_list = [ >>> # transform input tensor to PIL image for augmentation >>> torchvision.transforms.ToPILImage(), >>> # apply transformations >>> torchvision.transforms.RandomAffine(degrees=(-16,+16), >>> translate=(0.1, 0.1), >>> shear=(-8, 8), >>> fill=0), >>> # transform back to torch.tensor >>> torchvision.transforms.ToTensor(), >>> # preprocess and flatten >>> normalize_fn, >>> flatten_fn, >>> ] >>> # >>> # augmentation transforms >>> aug_transforms = torchvision.transforms.Compose(transform_list) >>> # preprocessing transforms without augmentation >>> orig_transforms = torchvision.transforms.Compose([normalize_fn, flatten_fn]) >>> # >>> # pass transforms to dataloader >>> aug_dl = get_dataloader(data, batch_size=32, shuffle=True, >>> ds_kwargs={"aug_transforms_list":[aug_transforms], "orig_transforms_list":[orig_transforms]}, >>> )
- Returns:
dataloader – The final dataloader
- Return type:
torch.utils.data.DataLoader
- clustpy.deep.get_default_augmented_dataloaders(X: ~numpy.ndarray | ~torch.Tensor, batch_size: int = 256, conv_used: bool = False, flatten: bool = True) -> (<class 'torch.utils.data.dataloader.DataLoader'>, <class 'torch.utils.data.dataloader.DataLoader'>)[source]
Receive a train- and a test dataloader using default augmentations. These transformations correspond to a min-max normalization followed by torchvision.transforms.RandomAffine(degrees=(-16, +16), translate=(0.1, 0.1), shear=(-8, 8), fill=0) and a channel-wise z-transformation. Optionally, the images can be flatten afterward.
- Parameters:
X (np.ndarray | torch.Tensor) – the actual data set (can be np.ndarray or torch.Tensor)
batch_size (int) – the batch size (default: 256)
conv_used (bool) – defines whether a convolutional network will be used afterward. In this case, grayscale images will be transformed to receive three color channels by copying the grayscale channel three times (default: False)
flatten (bool) – defines whether the augmented images should be flatten afterward. Must be False if conv_used is True (default: True)
- Returns:
tuple – The trainloader (with augmentations), The testloader (without augmentations)
- Return type:
(torch.utils.data.DataLoader, torch.utils.data.DataLoader)
- clustpy.deep.get_device_from_module(neural_network: Module) device[source]
Get the device from a given module.
- Parameters:
neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)
- Returns:
device – device of the module
- Return type:
torch.device
- clustpy.deep.mean_squared_error(tensor1: Tensor, tensor2: Tensor, weights: Tensor = None) Tensor[source]
Calculate the mean squared error between two tensors. Each row in the tensors is interpreted as a separate object, while each column represents its features. Optionally, features can be individually weighted. The default behavior is that all features are weighted by 1.
- Parameters:
tensor1 (torch.Tensor) – the first tensor
tensor2 (torch.Tensor) – the second tensor
weights (torch.Tensor) – tensor containing the weights of the features (default: None)
- Returns:
mse – the mean squared error
- Return type:
torch.Tensor
- clustpy.deep.predict_batchwise(dataloader: DataLoader, neural_network: Module, cluster_module: Module) ndarray[source]
Utility function for predicting the cluster labels over the whole data set in a mini-batch fashion. Method calls the predict_hard method of the cluster_module for each batch of data.
- Parameters:
dataloader (torch.utils.data.DataLoader) – dataloader to be used
neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)
cluster_module (torch.nn.Module) – the cluster module that is used for the encoding (e.g. DEC). Usually contains the predict method.
- Returns:
predictions_numpy – The predictions of the cluster_module for the data set
- Return type:
np.ndarray