clustpy.deep package
Subpackages
- clustpy.deep.neural_networks package
- Submodules
- clustpy.deep.neural_networks.convolutional_autoencoder module
- clustpy.deep.neural_networks.feedforward_autoencoder module
- clustpy.deep.neural_networks.neighbor_encoder module
- clustpy.deep.neural_networks.stacked_autoencoder module
- clustpy.deep.neural_networks.variational_autoencoder module
- Module contents
Submodules
clustpy.deep.aec module
@authors: Collin Leiber
- class clustpy.deep.aec.AEC(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Auto-encoder Based Data Clustering (AEC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the AEC loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
clustering_loss_weight (float) – weight of the clustering loss (default: 0.05)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining. If this is None, random labels will be used (default: None)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import AEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> aec = AEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> AEC.fit(data)
References
Song, Chunfeng, et al. “Auto-encoder based data clustering.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress, CIARP 2013, Havana, Cuba, November 20-23, 2013, Proceedings, Part I 18. Springer Berlin Heidelberg, 2013.
- fit(X: ndarray, y: ndarray = None) AEC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the AEC algorithm
- Return type:
clustpy.deep.dcn module
@authors: Lukas Miklautz, Dominik Mautz
- class clustpy.deep.dcn.DCN(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 0.05, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Clustering Network (DCN) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DCN loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
clustering_loss_weight (float) – weight of the clustering loss (default: 0.05)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dcn_labels_
The final DCN labels
- Type:
np.ndarray
- dcn_cluster_centers_
The final DCN cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DCN >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dcn = DCN(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dcn.fit(data)
References
Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” international conference on machine learning. PMLR, 2017.
- fit(X: ndarray, y: ndarray = None) DCN[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DCN algorithm
- Return type:
clustpy.deep.ddc_n2d module
@authors: Collin Leiber
- class clustpy.deep.ddc_n2d.DDC(ratio: float = 0.1, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, tsne_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Density-based Image Clustering (DDC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE is executed on the embedded data and a variant of the Density Peak Clustering algorithm is executed.
- Parameters:
ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio (default: 0.1)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
tsne_params (dict) – Parameters for the t-SNE execution. For example, perplexity can be changed by setting tsne_params to {“n_components”: 2, “perplexity”: 25}. Check out sklearn.manifold.TSNE for more information (default: {“n_components”: 2})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- n_clusters_
The final number of clusters
- Type:
int
- labels_
The final labels (obtained by a variant of Density Peak Clustering)
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
- tsne_
The t-SNE object
- Type:
TSNE
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DDC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> ddc = DDC(pretrain_epochs=3, clustering_epochs=3) >>> ddc.fit(data)
References
Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.
- fit(X: ndarray, y: ndarray = None) DDC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DDC algorithm
- Return type:
- class clustpy.deep.ddc_n2d.DDC_density_peak_clustering(ratio: float)[source]
Bases:
BaseEstimator,ClusterMixinA variant of the Density Peak Algorithm as proposed in the DDC paper.
- Parameters:
ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio
- n_clusters_
The final number of clusters
- Type:
int
- labels_
The final labels
- Type:
np.ndarray
References
Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.
- fit(X: ndarray, y: ndarray = None) DDC_density_peak_clustering[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DDC variant of the Density Peak Clsutering algorithm
- Return type:
- class clustpy.deep.ddc_n2d.N2D(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, manifold_class: ~sklearn.base.TransformerMixin = <class 'sklearn.manifold._t_sne.TSNE'>, manifold_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Not 2 Deep (N2D) clustering algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE/UMAP/ISOMAP is executed on the embedded data and the EM algorithm is executed.
- Parameters:
n_clusters (int) – number of clusters
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
manifold_class (TransformerMixin) – the manifold technique class (default: TSNE)
manifold_params (dict) – Parameters for the manifold execution. For example, perplexity can be changed for TSNE by setting manifold_params to {“n_components”: 2, “perplexity”: 25}. Check out e.g. sklearn.manifold.TSNE for more information (default: {“n_components”: 2})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- n_clusters
The final number of clusters
- Type:
int
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
- manifold_
The manifold object
- Type:
TransformerMixin
References
McConville, Ryan, et al. “N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding.” 2020 25th international conference on pattern recognition (ICPR). IEEE, 2021.
- fit(X: ndarray, y: ndarray = None) N2D[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the N2D algorithm
- Return type:
clustpy.deep.dec module
@authors: Lukas Miklautz, Dominik Mautz, Collin Leiber
- class clustpy.deep.dec.DEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Clustering (DEC) algorithm. First, a neural_network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DEC loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dec = DEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dec.fit(data)
References
Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.
- fit(X: ndarray, y: ndarray = None) DEC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DEC algorithm
- Return type:
- class clustpy.deep.dec.IDEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
DECThe Improved Deep Embedded Clustering (IDEC) algorithm. Is equal to the DEC algorithm but uses the self-supervised learning loss also during the clustering optimization. Further, clustering_loss_weight is set to 0.1 instead of 1 when using the default settings.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 0.1)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import IDEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> idec = IDEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> idec.fit(data)
References
Guo, Xifeng, et al. “Improved deep embedded clustering with local structure preservation.” IJCAI. 2017.
clustpy.deep.deepect module
@authors: Collin Leiber, Julian Schilcher
- class clustpy.deep.deepect.DeepECT(max_n_leaf_nodes: int = 20, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 200, grow_interval: int = 2, pruning_threshold: float = 0.1, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Cluster Tree (DeepECT) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, a cluster tree will be grown and the network will be optimized using the DeepECT loss function.
- Parameters:
max_n_leaf_nodes (int) – Maximum number of leaf nodes in the cluster tree (default: 20)
batch_size (int) – Size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 50)
clustering_epochs (int) – Number of epochs for the actual clustering procedure (default: 200)
grow_interval (int) – Number of epochs after which the the tree is grown (default: 2)
pruning_threshold (float) – The threshold for pruning the tree (default: 0.1)
optimizer_class (torch.optim.Optimizer) – The optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – Size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState) – Use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- tree_
The prediction cluster tree after training
- Type:
PredictionClusterTree
- neural_network
The final neural network
- Type:
torch.nn.Module
- fit(X: ndarray, y: ndarray = None) DeepECT[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – This instance of the DeepECT algorithm
- Return type:
- flat_clustering(n_leaf_nodes_to_keep: int) ndarray[source]
Transform the predicted labels into a flat clustering result by only keeping n_leaf_nodes_to_keep leaf nodes in the tree. Returns labels as if the clustering procedure would have stopped at the specified number of nodes. Note that each leaf node corresponds to a cluster.
- Parameters:
n_leaf_nodes_to_keep (int) – The number of leaf nodes to keep in the cluster tree
- Returns:
labels_pruned – The new cluster labels
- Return type:
np.ndarray
clustpy.deep.dipdeck module
@authors: Collin Leiber
- class clustpy.deep.dipdeck.DipDECK(n_clusters_init: int = 35, dip_merge_threshold: float = 0.9, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, max_n_clusters: int = inf, min_n_clusters: int = 1, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 5, max_cluster_size_diff_factor: float = 2, pval_strategy: str = 'table', n_boots: int = 1000, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Clustering with k-Estimation (DipDECK) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters using an overestimated number of clusters. Last, the network will be optimized using the DipDECK loss function. If any Dip-value exceeds the dip_merge_threshold, the corresponding clusters will be merged.
- Parameters:
n_clusters_init (int) – initial number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 35)
dip_merge_threshold (float) – threshold regarding the Dip-p-value that defines if two clusters should be merged. Must be bvetween 0 and 1 (default: 0.9)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
max_n_clusters (int) – maximum number of clusters. Must be larger than min_n_clusters. If the result has more clusters, a merge will be forced (default: np.inf)
min_n_clusters (int) – minimum number of clusters. Must be larger than 0, smaller than max_n_clusters and smaller than n_clusters_init. When this number of clusters is reached, all further merge processes will be hindered (default: 1)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure. Will reset after each merge (default: 50)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 5)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for the Dip calculation (default: 2)
pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)
n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
- labels_
The final labels
- Type:
np.ndarray
- n_clusters_
The final number of clusters
- Type:
int
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DipDECK >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dipdeck = DipDECK(pretrain_epochs=3, clustering_epochs=3) >>> dipdeck.fit(data)
References
Leiber, Collin, et al. “Dip-based deep embedded clustering with k-estimation.” Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021.
- fit(X: ndarray, y: ndarray = None) DipDECK[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DipDECK algorithm
- Return type:
clustpy.deep.dipencoder module
@authors: Collin Leiber
- class clustpy.deep.dipencoder.DipEncoder(n_clusters: int, batch_size: int = None, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, max_cluster_size_diff_factor: float = 3, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe DipEncoder. Can be used either as a clustering procedure if no ground truth labels are given or as a supervised dimensionality reduction technique. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DipEncoder loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
batch_size (int) – size of the data batches for the actual training of the DipEncoder. Should be larger the more clusters we have. If it is None, it will be set to (25 x n_clusters) (default: None)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used (default: 3)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss. If None, it will be equal to 1/(4L), where L is the reconstruction loss of the first batch of an untrained neural network (default: None)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels
- Type:
np.ndarray
- projection_axes_
The final projection axes between the clusters
- Type:
np.ndarray
- index_dict_
A dictionary to match the indices of two clusters to a projection axis
- Type:
dict
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DipEncoder >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dipencoder = DipEncoder(3, pretrain_epochs=3, clustering_epochs=3) >>> dipencoder.fit(data)
References
Leiber, Collin and Bauer, Lena G. M. and Neumayr, Michael and Plant, Claudia and Böhm, Christian “The DipEncoder: Enforcing Multimodality in Autoencoders.” Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2022.
- fit(X: ndarray, y: ndarray = None) DipEncoder[source]
Initiate the actual clustering/dimensionality reduction process on the input data set. If no ground truth labels are given, the resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – The given (training) data set
y (np.ndarray) – The ground truth labels. If None, the DipEncoder will be used for clustering (default: None)
- Returns:
self – This instance of the DipEncoder
- Return type:
- plot(X: ndarray, edge_width: float = 0.2, show_legend: bool = True) None[source]
Plot the current state of the DipEncoder. First the data set will be encoded using the neural network, afterwards the plot will be created. Uses the plot_scatter_matrix as a basis and adds projection axes in red.
- Parameters:
X (np.ndarray) – The data set
edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots
show_legend (bool) – Specifies whether a legend should be added to the plot
- predict(X_train: ndarray, X_test: ndarray) ndarray[source]
Predict the labels of the X_test dataset using the information gained by the fit function and the X_train dataset. Beware that the current labels influence the labels obtained by predict(). Therefore, it can occur that the outcome of dipencoder.fit(X) does not match dipencoder.predict(X).
- Parameters:
X_train (np.ndarray) – The data set used to train the DipEncoder (i.e. to retrieve the projection axes, modal intervals, …)
X_test (np.ndarray) – The data set for which we want to retrieve the labels
- Returns:
labels_pred – The predicted labels for X_test
- Return type:
np.ndarray
- set_predict_request(*, X_test: bool | None | str = '$UNCHANGED$', X_train: bool | None | str = '$UNCHANGED$') DipEncoder
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_testparameter inpredict.X_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_trainparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- clustpy.deep.dipencoder.plot_dipencoder_embedding(X_embed: ndarray, n_clusters: int, labels: ndarray, projection_axes: ndarray, index_dict: dict, edge_width: float = 0.1, show_legend: bool = False, show_plot: bool = True) None[source]
Plot the current state of the DipEncoder. Uses the plot_scatter_matrix as a basis and adds projection axes in red.
- Parameters:
X_embed (np.ndarray) – The embedded data set
n_clusters (int) – Number of clusters
labels (np.ndarray) – The cluster labels
projection_axes (np.ndarray) – The projection axes between the clusters
index_dict (dict) – A dictionary to match the indices of two clusters to a projection axis
edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots
show_legend (bool) – Specifies whether a legend should be added to the plot
show_plot (bool) – Specifies whether the plot should be plotted, i.e. if plt.show() should be executed (default: True)
clustpy.deep.dkm module
@authors: Collin Leiber
- class clustpy.deep.dkm.DKM(n_clusters: int, alphas: tuple = 1000, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep k-Means (DKM) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DKM loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
alphas (tuple) – tuple of different alpha values used for the prediction. Small values close to 0 are equivalent to homogeneous assignments to all clusters. Large values simulate a clear assignment as with kMeans. If None, the default calculation of the paper will be used. This is equal to lpha_{i+1}=2^{1/log(i)^2}*lpha_i with lpha_1=0.1 and maximum i=40. Alpha can also be a tuple with (None, lpha_1, maximum i) (default: (1000))
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 50)
clustering_epochs (int) – number of epochs for each alpha value for the actual clustering procedure. The total number of clustering epochs therefore corresponds to: len(alphas)*clustering_epochs (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dkm_labels_
The final DKM labels
- Type:
np.ndarray
- dkm_cluster_centers_
The final DKM cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DKM >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dkm = DKM(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dkm.fit(data)
References
Fard, Maziar Moradi, Thibaut Thonet, and Eric Gaussier. “Deep k-means: Jointly clustering with k-means and learning representations.” Pattern Recognition Letters 138 (2020): 185-192.
- fit(X: ndarray, y: ndarray = None) DKM[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DKM algorithm
- Return type:
clustpy.deep.enrc module
@authors: Lukas Miklautz
- class clustpy.deep.enrc.ACeDeC(n_clusters: int, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 20, init: str = 'acedec', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]
Bases:
ENRCAutoencoder Centroid-based Deep Cluster (ACeDeC) can be seen as a special case of ENRC where we have one cluster space and one shared space with a single cluster.
- Parameters:
n_clusters (int) – number of clusters
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
P (list) – list containing projections for clusters in clustered space and cluster in shared space (optional) (default: None)
input_centers (list) – list containing the cluster centers for clusters in clustered space and cluster in shared space (optional) (default: None)
batch_size (int) – size of the data batches (default: 128)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)
tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)
init (str) – choose which initialization strategy should be used. Has to be one of ‘acedec’, ‘subkmeans’, ‘random’ or ‘sgd’ (default: ‘acedec’)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (default: True)
debug (bool) – if True additional information during the training will be printed (default: False)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network
The final neural_network
- Type:
torch.nn.Module
:raises ValueError : if init is not one of ‘acedec’, ‘subkmeans’, ‘random’, ‘auto’ or ‘sgd’.:
References
Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Böhm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832
- fit(X: ndarray, y: ndarray = None) ACeDeC[source]
Cluster the input dataset with the ACeDeC algorithm. Saves the labels, centers, V, m, Betas, and P in the ACeDeC object. The resulting cluster labels will be stored in the labels_ attribute. :param X: input data :type X: np.ndarray :param y: the labels (can be ignored) :type y: np.ndarray
- Returns:
self – returns the AceDeC object
- Return type:
- predict(X: ndarray, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]
Predicts the labels of the input data.
- Parameters:
X (np.ndarray) – input data
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)
dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
- set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ACeDeC
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataloaderparameter inpredict.use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
use_Pparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.enrc.ENRC(n_clusters: list, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 20, init: str = 'nrkmeans', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]
Bases:
_AbstractDeepClusteringAlgoThe Embeddedn Non-Redundant Clustering (ENRC) algorithm.
- Parameters:
n_clusters (list) – list containing number of clusters for each clustering
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
P (list) – list containing projections for each clustering (optional) (default: None)
input_centers (list) – list containing the cluster centers for each clustering (optional) (default: None)
batch_size (int) – size of the data batches (default: 128)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)
tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)
init (str) – choose which initialization strategy should be used. Has to be one of ‘nrkmeans’, ‘random’ or ‘sgd’ (default: ‘nrkmeans’)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (defaul: False)
debug (bool) – if True additional information during the training will be printed (default: False)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
:raises ValueError : if init is not one of ‘nrkmeans’, ‘random’, ‘auto’ or ‘sgd’.:
References
Miklautz, Lukas & Dominik Mautz et al. “Deep embedded non-redundant clustering.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.
- fit(X: ndarray, y: ndarray = None) ENRC[source]
Cluster the input dataset with the ENRC algorithm. Saves the labels, centers, V, m, Betas, and P in the ENRC object. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – input data
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – returns the ENRC object
- Return type:
- plot_subspace(X: ndarray, subspace_index: int = 0, labels: ndarray = None, plot_centers: bool = False, gt: ndarray = None, equal_axis: bool = False) None[source]
Plot the specified subspace_nr as scatter matrix plot.
- Parameters:
X (np.ndarray) – input data
subspace_index (int) – index of the subspace_nr (default: 0)
labels (np.ndarray) – the labels to use for the plot (default: labels found by Nr-Kmeans) (default: None)
plot_centers (bool) – plot centers if True (default: False)
gt (np.ndarray) – of ground truth labels (default=None)
equal_axis (bool) – equalize axis if True (default: False)
- Return type:
scatter matrix plot of the input data
- predict(X: ndarray = None, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]
Predicts the labels for each clustering of X in a mini-batch manner.
- Parameters:
X (np.ndarray) – input data
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)
dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in X and c is the number of clusterings.
- Return type:
np.ndarray
- reconstruct_subspace_centroids(subspace_index: int = 0) ndarray[source]
Reconstructs the centroids in the specified subspace_nr.
- Parameters:
subspace_index (int) – index of the subspace_nr (default: 0)
- Returns:
centers_rec – reconstructed centers as np.ndarray
- Return type:
centers_rec
- set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ENRC
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataloaderparameter inpredict.use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
use_Pparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- transform_full_space(X: ndarray, embedded=False) ndarray[source]
Embedds the input dataset with the neural network and the matrix V from the ENRC object. :param X: input data :type X: np.ndarray :param embedded: if True, then X is assumed to be already embedded (default: False) :type embedded: bool
- Returns:
rotated – The transformed data
- Return type:
np.ndarray
- transform_subspace(X: ndarray, subspace_index: int = 0, embedded: bool = False) ndarray[source]
Embedds the input dataset with the neural network and with the matrix V projected onto a special clusterspace_nr.
- Parameters:
X (np.ndarray) – input data
subspace_index (int) – index of the subspace_nr (default: 0)
embedded (bool) – if True, then X is assumed to be already embedded (default: False)
- Returns:
subspace – The transformed subspace
- Return type:
np.ndarray
- clustpy.deep.enrc.acedec_init(data: ~numpy.ndarray, n_clusters: list, optimizer_params: dict, batch_size: int = 128, optimizer_class: ~torch.optim.optimizer.Optimizer = None, rounds: int = None, epochs: int = 10, random_state: ~numpy.random.mtrand.RandomState = None, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, device: ~torch.device = device(type='cpu'), debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy based on optimizing ACeDeC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the nrkmeans_init and only constraints V using the reconstruction error (torch.nn.MSELoss), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the sgd_init strategy is that it can be less stable for small data sets.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate
batch_size (int) – size of the data batches (default: 128)
optimizer_params – parameters of the optimizer for the actual clustering procedure, includes the learning rate
optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used (default: None)
rounds (int) – not used here (default: None)
epochs (int) – epochs is automatically set to be close to 20.000 minibatch iterations as in the ACeDeC paper. If this determined value is smaller than the passed epochs, then epochs is used (default: 10)
random_state (np.random.RandomState) – random state for reproducible results (default: None)
input_centers (list) – list of np.ndarray, default=None, optional parameter if initial cluster centers want to be set (optional)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
device (torch.device) – device on which should be trained on (default: torch.device(‘cpu’))
debug (bool) – if True then the cost of each round will be printed (default: True)
- Returns:
tuple – list of cluster centers for each subspace, list containing projections for each subspace, orthogonal rotation matrix, weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
- clustpy.deep.enrc.available_init_strategies() list[source]
Returns a list of strings of available initialization strategies for ENRC and ACeDeC. At the moment following strategies are supported: nrkmeans, random, sgd, auto
- clustpy.deep.enrc.beta_weights_init(P: list, n_dims: int, high_value: float = 0.9) Tensor[source]
Initializes parameters of the softmax such that betas will be set to high_value in dimensions which form a cluster subspace according to P and set to (1 - high_value)/(len(P) - 1) for the other clusterings.
- Parameters:
P (list) – list containing projections for each subspace
n_dims (int) – dimensionality of the embedded data
high_value (float) – value that should be initially used to indicate strength of assignment of a specific dimension to the clustering (default: 0.9)
- Returns:
beta_weights – initialized weights that are input in the softmax to get the betas.
- Return type:
torch.Tensor
- clustpy.deep.enrc.calculate_beta_weight(data: Tensor, centers: list, V: Tensor, P: list, high_beta_value: float = 0.9) Tensor[source]
The beta weights have a closed form solution if we have two subspaces, so the optimal values given the data, centers and V can be computed. See supplement of Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Boehm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832 here: https://gitlab.cs.univie.ac.at/lukas/acedec_public/-/blob/master/supplement.pdf For number of subspaces > 2, we calculate the beta weight assuming that an assigned subspace should have a weight of 0.9.
- Parameters:
data (torch.Tensor) – input data
centers (list) – list of torch.Tensor, cluster centers for each clustering
V (torch.Tensor) – orthogonal rotation matrix
P (list) – list containing projections for each subspace
high_beta_value (float) – value that should be initially used to indicate strength of assignment of a specific dimension to the clustering (default: 0.9)
- Returns:
beta_weights – a c x d vector containing the weights for the softmax to indicate which dimensions d are important for each clustering c.
- Return type:
torch.Tensor
- Raises:
ValueError – If number of clusterings is smaller than 2:
- clustpy.deep.enrc.calculate_optimal_beta_weights_special_case(data: Tensor, centers: list, V: Tensor, batch_size: int = 32) Tensor[source]
The beta weights have a closed form solution if we have two subspaces, so the optimal values given the data, centers and V can be computed. See supplement of Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Boehm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832 here: https://gitlab.cs.univie.ac.at/lukas/acedec_public/-/blob/master/supplement.pdf
- Parameters:
data (torch.Tensor) – input data
centers (list) – list of torch.Tensor, cluster centers for each clustering
V (torch.Tensor) – orthogonal rotation matrix
batch_size (int) – size of the data batches (default: 32)
- Returns:
optimal_beta_weights – a c x d vector containing the optimal weights for the softmax to indicate which dimensions d are important for each clustering c.
- Return type:
torch.Tensor
- clustpy.deep.enrc.enrc_encode_decode_batchwise_with_loss(V: Tensor, centers: list, model: Module, dataloader: DataLoader, device: device = device(type='cpu'), ssl_loss_fn: _Loss = None) ndarray[source]
Encode and Decode input data of a dataloader in a mini-batch manner with ENRC.
- Parameters:
V (torch.Tensor) – orthogonal rotation matrix
centers (list) – list of torch.Tensor, cluster centers for each clustering
model (torch.nn.Module) – the input model for encoding the data
dataloader (torch.utils.data.DataLoader) – dataloader to be used for prediction
device (torch.device) – device to be predicted on (default: torch.device(‘cpu’))
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: None)
- Returns:
enrc_encoding (np.ndarray) – n x d matrix, where n is the number of data points and d is the number of dimensions of z.
enrc_decoding (np.ndarray) – n x D matrix, where n is the number of data points and D is the data dimensionality.
reconstruction_error (flaot) – reconstruction error (will be None if ssl_loss_fn is not specified)
- clustpy.deep.enrc.enrc_init(data: ~numpy.ndarray, n_clusters: list, init: str = 'auto', rounds: int = 10, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, max_iter: int = 100, optimizer_params: dict = None, optimizer_class: ~torch.optim.optimizer.Optimizer = None, batch_size: int = 128, epochs: int = 10, device: ~torch.device = device(type='cpu'), debug: bool = True, init_kwargs: dict = None) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy for the ENRC algorithm.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
init (str) –
{‘nrkmeans’, ‘random’, ‘sgd’, ‘auto’} or callable. Initialization strategies for parameters cluster_centers, V and beta of ENRC. (default=’auto’)
’nrkmeans’ : Performs the NrKmeans algorithm to get initial parameters. This strategy is preferred for small data sets, but the orthogonality constraint on V and subsequently for the clustered subspaces can be sometimes to limiting in practice, e.g., if clusterings in the data are not perfectly non-redundant.
’random’ : Same as ‘nrkmeans’, but max_iter is set to 10, so the performance is faster, but also less optimized, thus more random.
’sgd’ : Initialization strategy based on optimizing ENRC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the ‘nrkmeans’ option and only constraints V using the reconstruction error (torch.nn.MSELoss), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the ‘sgd’ strategy is that it can be less stable for small data sets.
’auto’ : Selects ‘sgd’ init if data.shape[0] > 100,000 or data.shape[1] > 1,000. For smaller data sets ‘nrkmeans’ init is used.
If a callable is passed, it should take arguments data and n_clusters (additional parameters can be provided via the dictionary init_kwargs) and return an initialization (centers, P, V and beta_weights).
rounds (int) – number of repetitions of the initialization procedure (default: 10)
input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
random_state (np.random.RandomState) – random state for reproducible results (default: None)
max_iter (int) – maximum number of iterations of NrKmeans. Only used for init=’nrkmeans’ (default: 100)
optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate. Only used for init=’sgd’
optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used. Only used for init=’sgd’ (default: None)
batch_size (int) – size of the data batches. Only used for init=’sgd’ (default: 128)
epochs (int) – number of epochs for the actual clustering procedure. Only used for init=’sgd’ (default: 10)
device (torch.device) – device on which should be trained on. Only used for init=’sgd’ (default: torch.device(‘cpu’))
debug (bool) – if True then the cost of each round will be printed (default: True)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
- Returns:
tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
:raises ValueError : if init variable is passed that is not implemented.:
- clustpy.deep.enrc.enrc_predict(z: Tensor, V: Tensor, centers: list, subspace_betas: Tensor, use_P: bool = False) ndarray[source]
Predicts the labels for each clustering of an input z.
- Parameters:
z (torch.Tensor) – embedded input data point, can also be a mini-batch of embedded points
V (torch.tensor) – orthogonal rotation matrix
centers (list) – list of torch.Tensor, cluster centers for each clustering
subspace_betas (torch.Tensor) – weights for each dimension per clustering. Calculated via softmax(beta_weights).
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft subspace_beta weights are used (default: False)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in z and c is the number of clusterings.
- Return type:
np.ndarray
- clustpy.deep.enrc.enrc_predict_batchwise(V: Tensor, centers: list, subspace_betas: Tensor, model: Module, dataloader: DataLoader, device: device = device(type='cpu'), use_P: bool = False) ndarray[source]
Predicts the labels for each clustering of a dataloader in a mini-batch manner.
- Parameters:
V (torch.Tensor) – orthogonal rotation matrix
centers (list) – list of torch.Tensor, cluster centers for each clustering
subspace_betas (torch.Tensor) – weights for each dimension per clustering. Calculated via softmax(beta_weights).
model (torch.nn.Module) – the input model for encoding the data
dataloader (torch.utils.data.DataLoader) – dataloader to be used for prediction
device (torch.device) – device to be predicted on (default: torch.device(‘cpu’))
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: False)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in z and c is the number of clusterings.
- Return type:
np.ndarray
- clustpy.deep.enrc.nrkmeans_init(data: ~numpy.ndarray, n_clusters: list, rounds: int = 10, max_iter: int = 100, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, debug=True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy based on the NrKmeans Algorithm. This strategy is preferred for small data sets, but the orthogonality constraint on V and subsequently for the clustered subspaces can be sometimes to limiting in practice, e.g., if clusterings are not perfectly non-redundant.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
rounds (int) – number of repetitions of the NrKmeans algorithm (default: 10)
max_iter (int) – maximum number of iterations of NrKmeans (default: 100)
input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – if True then the cost of each round will be printed (default: True)
- Returns:
tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
- clustpy.deep.enrc.optimal_beta(kmeans_loss: Tensor, other_losses_mean_sum: Tensor) Tensor[source]
Calculate optimal values for the beta weight for each dimension.
- Parameters:
kmeans_loss (torch.Tensor) – a 1 x d vector of the kmeans losses per dimension.
other_losses_mean_sum (torch.Tensor) – a 1 x d vector of the kmeans losses of all other clusterings except the one in ‘kmeans_loss’.
- Returns:
optimal_beta_weights – a 1 x d vector containing the optimal weights for the softmax to indicate which dimensions are important for each clustering. Calculated via -torch.log(kmeans_loss/other_losses_mean_sum)
- Return type:
torch.Tensor
- clustpy.deep.enrc.random_nrkmeans_init(data: ~numpy.ndarray, n_clusters: list, rounds: int = 10, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, random_state: ~numpy.random.mtrand.RandomState = None, debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy based on the NrKmeans Algorithm. For documentation see nrkmeans_init function. Same as nrkmeans_init, but max_iter is set to 1, so the results will be faster and more random.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
rounds (int) – number of repetitions of the NrKmeans algorithm (default: 10)
input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – if True then the cost of each round will be printed (default: True)
- Returns:
tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
- clustpy.deep.enrc.reinit_centers(enrc: _ENRC_Module, subspace_id: int, dataloader: DataLoader, model: Module, n_samples: int = 512, kmeans_steps: int = 10, split: str = 'random', debug: bool = False) None[source]
Reinitializes centers that have been lost, i.e. if they did not get any data point assigned. Before a center is reinitialized, this method checks whether a center has not get any points assigned over several mini-batch iterations and if this count is higher than enrc.reinit_threshold the center will be reinitialized.
- Parameters:
enrc (_ENRC_Module) – torch.nn.Module instance for the ENRC algorithm
subspace_id (int) – integer which indicates which subspace the cluster to be checked are in.
dataloader (torch.utils.data.DataLoader) – dataloader from which data is randomly sampled. Important shuffle=True needs to be set, because n_samples random samples are drawn.
model (torch.nn.Module) – neural network used for the embedding
n_samples (int) – number of samples that should be used for the reclustering (default: 512)
kmeans_steps (int) – number of mini-batch kmeans steps that should be conducted with the new centroid (default: 10)
split (str) – {‘random’, ‘cost’}, default=’random’, select how clusters should be split for renitialization. ‘random’ : split a random point from the random sample of size=n_samples. ‘cost’ : split the cluster with max kmeans cost.
debug (bool) – if True than training errors will be printed (default: True)
- clustpy.deep.enrc.sgd_init(data: ~numpy.ndarray, n_clusters: list, optimizer_params: dict, batch_size: int = 128, optimizer_class: ~torch.optim.optimizer.Optimizer = None, rounds: int = 2, epochs: int = 10, random_state: ~numpy.random.mtrand.RandomState = None, input_centers: list = None, P: list = None, V: ~numpy.ndarray = None, device: ~torch.device = device(type='cpu'), debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy based on optimizing ENRC’s parameters V and beta in isolation from the neural network using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the nrkmeans_init and only constraints V using the reconstruction error (torch.nn.MSELoss), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the sgd_init strategy is that it can be less stable for small data sets.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
optimizer_params (dict) – parameters of the optimizer used to optimize V and beta, includes the learning rate
batch_size (int) – size of the data batches (default: 128)
optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used (default: None)
rounds (int) – number of repetitions of the initialization procedure (default: 2)
epochs (int) – number of epochs for the actual clustering procedure (default: 10)
random_state (np.random.RandomState) – random state for reproducible results (default: None)
input_centers (list) – list of np.ndarray, default=None, optional parameter if initial cluster centers want to be set (optional)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
device (torch.device) – device on which should be trained on (default: torch.device(‘cpu’))
debug (bool) – if True then the cost of each round will be printed (default: True)
- Returns:
tuple – list of cluster centers for each subspace, list containing projections for each subspace, orthogonal rotation matrix, weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
clustpy.deep.vade module
@authors: Donatella Novakovic, Lukas Miklautz, Collin Leiber
- class clustpy.deep.vade.VaDE(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 10, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = BCELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Variational Deep Embedding (VaDE) algorithm. First, an variational autoencoder (VAE) will be trained (will be skipped if input neural network is given). Afterward, a GMM will be fit to identify the initial clustering structures. Last, the VAE will be optimized using the VaDE loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 10)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.BCELoss(reduction=’sum’))
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new VariationalAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (central layer with mean and variance) (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: GaussianMixture)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {“n_init”: 10, “covariance_type”: “diag”})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The labels as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- cluster_centers_
The cluster centers as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- covariances_
The covariance matrices as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- weights_
The weights as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- vade_labels_
The labels as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_cluster_centers_
The cluster centers as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_covariances_
The covariance matrices as identified by VaDE after the training terminated
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> data = (data - np.mean(data)) / np.std(data) >>> vade = VaDE(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> vade.fit(data)
References
Jiang, Zhuxi, et al. “Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering.” IJCAI. 2017.
- fit(X: ndarray, y: ndarray = None) VaDE[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the VaDE algorithm
- Return type:
Module contents
- class clustpy.deep.ACeDeC(n_clusters: int, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 20, init: str = 'acedec', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]
Bases:
ENRCAutoencoder Centroid-based Deep Cluster (ACeDeC) can be seen as a special case of ENRC where we have one cluster space and one shared space with a single cluster.
- Parameters:
n_clusters (int) – number of clusters
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
P (list) – list containing projections for clusters in clustered space and cluster in shared space (optional) (default: None)
input_centers (list) – list containing the cluster centers for clusters in clustered space and cluster in shared space (optional) (default: None)
batch_size (int) – size of the data batches (default: 128)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)
tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)
init (str) – choose which initialization strategy should be used. Has to be one of ‘acedec’, ‘subkmeans’, ‘random’ or ‘sgd’ (default: ‘acedec’)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (default: True)
debug (bool) – if True additional information during the training will be printed (default: False)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network
The final neural_network
- Type:
torch.nn.Module
:raises ValueError : if init is not one of ‘acedec’, ‘subkmeans’, ‘random’, ‘auto’ or ‘sgd’.:
References
Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Böhm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832
- fit(X: ndarray, y: ndarray = None) ACeDeC[source]
Cluster the input dataset with the ACeDeC algorithm. Saves the labels, centers, V, m, Betas, and P in the ACeDeC object. The resulting cluster labels will be stored in the labels_ attribute. :param X: input data :type X: np.ndarray :param y: the labels (can be ignored) :type y: np.ndarray
- Returns:
self – returns the AceDeC object
- Return type:
- predict(X: ndarray, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]
Predicts the labels of the input data.
- Parameters:
X (np.ndarray) – input data
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)
dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)
- Returns:
predicted_labels – The predicted labels
- Return type:
np.ndarray
- set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ACeDeC
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataloaderparameter inpredict.use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
use_Pparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.AEC(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = None, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Auto-encoder Based Data Clustering (AEC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the AEC loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
clustering_loss_weight (float) – weight of the clustering loss (default: 0.05)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining. If this is None, random labels will be used (default: None)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import AEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> aec = AEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> AEC.fit(data)
References
Song, Chunfeng, et al. “Auto-encoder based data clustering.” Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress, CIARP 2013, Havana, Cuba, November 20-23, 2013, Proceedings, Part I 18. Springer Berlin Heidelberg, 2013.
- fit(X: ndarray, y: ndarray = None) AEC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the AEC algorithm
- Return type:
- class clustpy.deep.DCN(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 0.05, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Clustering Network (DCN) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DCN loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
clustering_loss_weight (float) – weight of the clustering loss (default: 0.05)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dcn_labels_
The final DCN labels
- Type:
np.ndarray
- dcn_cluster_centers_
The final DCN cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DCN >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dcn = DCN(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dcn.fit(data)
References
Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” international conference on machine learning. PMLR, 2017.
- fit(X: ndarray, y: ndarray = None) DCN[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DCN algorithm
- Return type:
- class clustpy.deep.DDC(ratio: float = 0.1, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, tsne_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Density-based Image Clustering (DDC) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE is executed on the embedded data and a variant of the Density Peak Clustering algorithm is executed.
- Parameters:
ratio (float) – The ratio parameter, defining the cutoff distance d_c by calculating: average pairwise distance * ratio (default: 0.1)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
tsne_params (dict) – Parameters for the t-SNE execution. For example, perplexity can be changed by setting tsne_params to {“n_components”: 2, “perplexity”: 25}. Check out sklearn.manifold.TSNE for more information (default: {“n_components”: 2})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- n_clusters_
The final number of clusters
- Type:
int
- labels_
The final labels (obtained by a variant of Density Peak Clustering)
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
- tsne_
The t-SNE object
- Type:
TSNE
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DDC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> ddc = DDC(pretrain_epochs=3, clustering_epochs=3) >>> ddc.fit(data)
References
Ren, Yazhou, et al. “Deep density-based image clustering.” Knowledge-Based Systems 197 (2020): 105841.
- fit(X: ndarray, y: ndarray = None) DDC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DDC algorithm
- Return type:
- class clustpy.deep.DEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Clustering (DEC) algorithm. First, a neural_network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DEC loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dec = DEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dec.fit(data)
References
Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.
- fit(X: ndarray, y: ndarray = None) DEC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DEC algorithm
- Return type:
- class clustpy.deep.DKM(n_clusters: int, alphas: tuple = 1000, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep k-Means (DKM) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DKM loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
alphas (tuple) – tuple of different alpha values used for the prediction. Small values close to 0 are equivalent to homogeneous assignments to all clusters. Large values simulate a clear assignment as with kMeans. If None, the default calculation of the paper will be used. This is equal to lpha_{i+1}=2^{1/log(i)^2}*lpha_i with lpha_1=0.1 and maximum i=40. Alpha can also be a tuple with (None, lpha_1, maximum i) (default: (1000))
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 50)
clustering_epochs (int) – number of epochs for each alpha value for the actual clustering procedure. The total number of clustering epochs therefore corresponds to: len(alphas)*clustering_epochs (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dkm_labels_
The final DKM labels
- Type:
np.ndarray
- dkm_cluster_centers_
The final DKM cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DKM >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dkm = DKM(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> dkm.fit(data)
References
Fard, Maziar Moradi, Thibaut Thonet, and Eric Gaussier. “Deep k-means: Jointly clustering with k-means and learning representations.” Pattern Recognition Letters 138 (2020): 185-192.
- fit(X: ndarray, y: ndarray = None) DKM[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DKM algorithm
- Return type:
- class clustpy.deep.DeepECT(max_n_leaf_nodes: int = 20, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 50, clustering_epochs: int = 200, grow_interval: int = 2, pruning_threshold: float = 0.1, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Cluster Tree (DeepECT) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, a cluster tree will be grown and the network will be optimized using the DeepECT loss function.
- Parameters:
max_n_leaf_nodes (int) – Maximum number of leaf nodes in the cluster tree (default: 20)
batch_size (int) – Size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 50)
clustering_epochs (int) – Number of epochs for the actual clustering procedure (default: 200)
grow_interval (int) – Number of epochs after which the the tree is grown (default: 2)
pruning_threshold (float) – The threshold for pruning the tree (default: 0.1)
optimizer_class (torch.optim.Optimizer) – The optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – Size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState) – Use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- tree_
The prediction cluster tree after training
- Type:
PredictionClusterTree
- neural_network
The final neural network
- Type:
torch.nn.Module
- fit(X: ndarray, y: ndarray = None) DeepECT[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – This instance of the DeepECT algorithm
- Return type:
- flat_clustering(n_leaf_nodes_to_keep: int) ndarray[source]
Transform the predicted labels into a flat clustering result by only keeping n_leaf_nodes_to_keep leaf nodes in the tree. Returns labels as if the clustering procedure would have stopped at the specified number of nodes. Note that each leaf node corresponds to a cluster.
- Parameters:
n_leaf_nodes_to_keep (int) – The number of leaf nodes to keep in the cluster tree
- Returns:
labels_pruned – The new cluster labels
- Return type:
np.ndarray
- class clustpy.deep.DipDECK(n_clusters_init: int = 35, dip_merge_threshold: float = 0.9, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, max_n_clusters: int = inf, min_n_clusters: int = 1, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 5, max_cluster_size_diff_factor: float = 2, pval_strategy: str = 'table', n_boots: int = 1000, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Deep Embedded Clustering with k-Estimation (DipDECK) algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters using an overestimated number of clusters. Last, the network will be optimized using the DipDECK loss function. If any Dip-value exceeds the dip_merge_threshold, the corresponding clusters will be merged.
- Parameters:
n_clusters_init (int) – initial number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN (default: 35)
dip_merge_threshold (float) – threshold regarding the Dip-p-value that defines if two clusters should be merged. Must be bvetween 0 and 1 (default: 0.9)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
max_n_clusters (int) – maximum number of clusters. Must be larger than min_n_clusters. If the result has more clusters, a merge will be forced (default: np.inf)
min_n_clusters (int) – minimum number of clusters. Must be larger than 0, smaller than max_n_clusters and smaller than n_clusters_init. When this number of clusters is reached, all further merge processes will be hindered (default: 1)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure. Will reset after each merge (default: 50)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 5)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for the Dip calculation (default: 2)
pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)
n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
- labels_
The final labels
- Type:
np.ndarray
- n_clusters_
The final number of clusters
- Type:
int
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DipDECK >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dipdeck = DipDECK(pretrain_epochs=3, clustering_epochs=3) >>> dipdeck.fit(data)
References
Leiber, Collin, et al. “Dip-based deep embedded clustering with k-estimation.” Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021.
- fit(X: ndarray, y: ndarray = None) DipDECK[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DipDECK algorithm
- Return type:
- class clustpy.deep.DipEncoder(n_clusters: int, batch_size: int = None, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, max_cluster_size_diff_factor: float = 3, clustering_loss_weight: float = 1.0, ssl_loss_weight: float = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe DipEncoder. Can be used either as a clustering procedure if no ground truth labels are given or as a supervised dimensionality reduction technique. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, KMeans identifies the initial clusters. Last, the network will be optimized using the DipEncoder loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
batch_size (int) – size of the data batches for the actual training of the DipEncoder. Should be larger the more clusters we have. If it is None, it will be set to (25 x n_clusters) (default: None)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used (default: 3)
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss. If None, it will be equal to 1/(4L), where L is the reconstruction loss of the first batch of an untrained neural network (default: None)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels
- Type:
np.ndarray
- projection_axes_
The final projection axes between the clusters
- Type:
np.ndarray
- index_dict_
A dictionary to match the indices of two clusters to a projection axis
- Type:
dict
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import DipEncoder >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> dipencoder = DipEncoder(3, pretrain_epochs=3, clustering_epochs=3) >>> dipencoder.fit(data)
References
Leiber, Collin and Bauer, Lena G. M. and Neumayr, Michael and Plant, Claudia and Böhm, Christian “The DipEncoder: Enforcing Multimodality in Autoencoders.” Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2022.
- fit(X: ndarray, y: ndarray = None) DipEncoder[source]
Initiate the actual clustering/dimensionality reduction process on the input data set. If no ground truth labels are given, the resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – The given (training) data set
y (np.ndarray) – The ground truth labels. If None, the DipEncoder will be used for clustering (default: None)
- Returns:
self – This instance of the DipEncoder
- Return type:
- plot(X: ndarray, edge_width: float = 0.2, show_legend: bool = True) None[source]
Plot the current state of the DipEncoder. First the data set will be encoded using the neural network, afterwards the plot will be created. Uses the plot_scatter_matrix as a basis and adds projection axes in red.
- Parameters:
X (np.ndarray) – The data set
edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots
show_legend (bool) – Specifies whether a legend should be added to the plot
- predict(X_train: ndarray, X_test: ndarray) ndarray[source]
Predict the labels of the X_test dataset using the information gained by the fit function and the X_train dataset. Beware that the current labels influence the labels obtained by predict(). Therefore, it can occur that the outcome of dipencoder.fit(X) does not match dipencoder.predict(X).
- Parameters:
X_train (np.ndarray) – The data set used to train the DipEncoder (i.e. to retrieve the projection axes, modal intervals, …)
X_test (np.ndarray) – The data set for which we want to retrieve the labels
- Returns:
labels_pred – The predicted labels for X_test
- Return type:
np.ndarray
- set_predict_request(*, X_test: bool | None | str = '$UNCHANGED$', X_train: bool | None | str = '$UNCHANGED$') DipEncoder
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_testparameter inpredict.X_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_trainparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.ENRC(n_clusters: list, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 20, init: str = 'nrkmeans', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2/lib/python3.12/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = 10000, random_state: ~numpy.random.mtrand.RandomState | int = None, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, final_reclustering: bool = True, debug: bool = False)[source]
Bases:
_AbstractDeepClusteringAlgoThe Embeddedn Non-Redundant Clustering (ENRC) algorithm.
- Parameters:
n_clusters (list) – list containing number of clusters for each clustering
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
P (list) – list containing projections for each clustering (optional) (default: None)
input_centers (list) – list containing the cluster centers for each clustering (optional) (default: None)
batch_size (int) – size of the data batches (default: 128)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)
tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
clustering_loss_weight (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network. Only used if neural_network is None (default: 20)
init (str) – choose which initialization strategy should be used. Has to be one of ‘nrkmeans’, ‘random’ or ‘sgd’ (default: ‘nrkmeans’)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization. If None, all data will be used. (default: 10,000)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
final_reclustering (bool) – If True, the final embedding will be reclustered with the provided init strategy. (defaul: False)
debug (bool) – if True additional information during the training will be printed (default: False)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
:raises ValueError : if init is not one of ‘nrkmeans’, ‘random’, ‘auto’ or ‘sgd’.:
References
Miklautz, Lukas & Dominik Mautz et al. “Deep embedded non-redundant clustering.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.
- fit(X: ndarray, y: ndarray = None) ENRC[source]
Cluster the input dataset with the ENRC algorithm. Saves the labels, centers, V, m, Betas, and P in the ENRC object. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – input data
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – returns the ENRC object
- Return type:
- plot_subspace(X: ndarray, subspace_index: int = 0, labels: ndarray = None, plot_centers: bool = False, gt: ndarray = None, equal_axis: bool = False) None[source]
Plot the specified subspace_nr as scatter matrix plot.
- Parameters:
X (np.ndarray) – input data
subspace_index (int) – index of the subspace_nr (default: 0)
labels (np.ndarray) – the labels to use for the plot (default: labels found by Nr-Kmeans) (default: None)
plot_centers (bool) – plot centers if True (default: False)
gt (np.ndarray) – of ground truth labels (default=None)
equal_axis (bool) – equalize axis if True (default: False)
- Return type:
scatter matrix plot of the input data
- predict(X: ndarray = None, use_P: bool = True, dataloader: DataLoader = None) ndarray[source]
Predicts the labels for each clustering of X in a mini-batch manner.
- Parameters:
X (np.ndarray) – input data
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)
dataloader (torch.utils.data.DataLoader) – dataloader to be used. Can be None if X is given (default: None)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in X and c is the number of clusterings.
- Return type:
np.ndarray
- reconstruct_subspace_centroids(subspace_index: int = 0) ndarray[source]
Reconstructs the centroids in the specified subspace_nr.
- Parameters:
subspace_index (int) – index of the subspace_nr (default: 0)
- Returns:
centers_rec – reconstructed centers as np.ndarray
- Return type:
centers_rec
- set_predict_request(*, dataloader: bool | None | str = '$UNCHANGED$', use_P: bool | None | str = '$UNCHANGED$') ENRC
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
dataloader (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
dataloaderparameter inpredict.use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
use_Pparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- transform_full_space(X: ndarray, embedded=False) ndarray[source]
Embedds the input dataset with the neural network and the matrix V from the ENRC object. :param X: input data :type X: np.ndarray :param embedded: if True, then X is assumed to be already embedded (default: False) :type embedded: bool
- Returns:
rotated – The transformed data
- Return type:
np.ndarray
- transform_subspace(X: ndarray, subspace_index: int = 0, embedded: bool = False) ndarray[source]
Embedds the input dataset with the neural network and with the matrix V projected onto a special clusterspace_nr.
- Parameters:
X (np.ndarray) – input data
subspace_index (int) – index of the subspace_nr (default: 0)
embedded (bool) – if True, then X is assumed to be already embedded (default: False)
- Returns:
subspace – The transformed subspace
- Return type:
np.ndarray
- class clustpy.deep.IDEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, clustering_loss_weight: float = 0.1, ssl_loss_weight: float = 1.0, custom_dataloaders: tuple = None, augmentation_invariance: bool = False, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.cluster._kmeans.KMeans'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
DECThe Improved Deep Embedded Clustering (IDEC) algorithm. Is equal to the DEC algorithm but uses the self-supervised learning loss also during the clustering optimization. Further, clustering_loss_weight is set to 0.1 instead of 1 when using the default settings.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
clustering_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 0.1)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
augmentation_invariance (bool) – If True, augmented samples provided in custom_dataloaders[0] will be used to learn cluster assignments that are invariant to the augmentation transformations (default: False)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: KMeans)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> from clustpy.deep import IDEC >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> idec = IDEC(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> idec.fit(data)
References
Guo, Xifeng, et al. “Improved deep embedded clustering with local structure preservation.” IJCAI. 2017.
- class clustpy.deep.N2D(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, pretrain_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, manifold_class: ~sklearn.base.TransformerMixin = <class 'sklearn.manifold._t_sne.TSNE'>, manifold_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Not 2 Deep (N2D) clustering algorithm. First, a neural network will be trained (will be skipped if input neural network is given). Afterward, t-SNE/UMAP/ISOMAP is executed on the embedded data and the EM algorithm is executed.
- Parameters:
n_clusters (int) – number of clusters
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new FeedforwardAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
manifold_class (TransformerMixin) – the manifold technique class (default: TSNE)
manifold_params (dict) – Parameters for the manifold execution. For example, perplexity can be changed for TSNE by setting manifold_params to {“n_components”: 2, “perplexity”: 25}. Check out e.g. sklearn.manifold.TSNE for more information (default: {“n_components”: 2})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- n_clusters
The final number of clusters
- Type:
int
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
- manifold_
The manifold object
- Type:
TransformerMixin
References
McConville, Ryan, et al. “N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding.” 2020 25th international conference on pattern recognition (ICPR). IEEE, 2021.
- fit(X: ndarray, y: ndarray = None) N2D[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the N2D algorithm
- Return type:
- class clustpy.deep.VaDE(n_clusters: int, batch_size: int = 256, pretrain_optimizer_params: dict = None, clustering_optimizer_params: dict = None, pretrain_epochs: int = 10, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, ssl_loss_fn: ~torch.nn.modules.loss._Loss = BCELoss(), clustering_loss_weight: float = 1.0, ssl_loss_weight: float = 1.0, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_weights: str = None, embedding_size: int = 10, custom_dataloaders: tuple = None, initial_clustering_class: ~sklearn.base.ClusterMixin = <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>, initial_clustering_params: dict = None, device: ~torch.device = None, random_state: ~numpy.random.mtrand.RandomState | int = None)[source]
Bases:
_AbstractDeepClusteringAlgoThe Variational Deep Embedding (VaDE) algorithm. First, an variational autoencoder (VAE) will be trained (will be skipped if input neural network is given). Afterward, a GMM will be fit to identify the initial clustering structures. Last, the VAE will be optimized using the VaDE loss function.
- Parameters:
n_clusters (int) – number of clusters. Can be None if a corresponding initial_clustering_class is given, that can determine the number of clusters, e.g. DBSCAN
batch_size (int) – size of the data batches (default: 256)
pretrain_optimizer_params (dict) – parameters of the optimizer for the pretraining of the neural network, includes the learning rate (default: {“lr”: 1e-3})
clustering_optimizer_params (dict) – parameters of the optimizer for the actual clustering procedure, includes the learning rate (default: {“lr”: 1e-4})
pretrain_epochs (int) – number of epochs for the pretraining of the neural network (default: 10)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.BCELoss(reduction=’sum’))
clustering_loss_weight (float) – weight of the clustering loss (default: 1.0)
ssl_loss_weight (float) – weight of the self-supervised learning (ssl) loss (default: 1.0)
neural_network (torch.nn.Module | tuple) – the input neural network. If None, a new VariationalAutoencoder will be created. Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
embedding_size (int) – size of the embedding within the neural network (central layer with mean and variance) (default: 10)
custom_dataloaders (tuple) – tuple consisting of a trainloader (random order) at the first and a test loader (non-random order) at the second position. Can also be a tuple of strings, where the first entry is the path to a saved trainloader and the second entry the path to a saved testloader. In this case the dataloaders will be loaded by torch.load(PATH). If None, the default dataloaders will be used (default: None)
initial_clustering_class (ClusterMixin) – clustering class to obtain the initial cluster labels after the pretraining (default: GaussianMixture)
initial_clustering_params (dict) – parameters for the initial clustering class (default: {“n_init”: 10, “covariance_type”: “diag”})
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The labels as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- cluster_centers_
The cluster centers as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- covariances_
The covariance matrices as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- weights_
The weights as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- vade_labels_
The labels as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_cluster_centers_
The cluster centers as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_covariances_
The covariance matrices as identified by VaDE after the training terminated
- Type:
np.ndarray
- neural_network
The final neural network
- Type:
torch.nn.Module
Examples
>>> from clustpy.data import create_subspace_data >>> data, labels = create_subspace_data(1500, subspace_features=(3, 50), random_state=1) >>> data = (data - np.mean(data)) / np.std(data) >>> vade = VaDE(n_clusters=3, pretrain_epochs=3, clustering_epochs=3) >>> vade.fit(data)
References
Jiang, Zhuxi, et al. “Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering.” IJCAI. 2017.
- fit(X: ndarray, y: ndarray = None) VaDE[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the VaDE algorithm
- Return type:
- clustpy.deep.decode_batchwise(dataloader: DataLoader, neural_network: Module) ndarray[source]
Utility function for decoding the whole data set in a mini-batch fashion, e.g., with an autoencoder. Note: Assumes an implemented decode function
- Parameters:
dataloader (torch.utils.data.DataLoader) – dataloader to be used
neural_network (torch.nn.Module) – the neural network that is used for the decoding (e.g. an autoencoder)
device (torch.device) – device to be trained on
- Returns:
reconstructions_numpy – The reconstructed data set
- Return type:
np.ndarray
- clustpy.deep.detect_device(device: device | int | str = None) device[source]
Automatically detects if you have a cuda enabled GPU. Device can also be read from environment variable “CLUSTPY_DEVICE”. It can be set using, e.g., os.environ[“CLUSTPY_DEVICE”] = “cuda:1”
- Parameters:
device (torch.device | int | str) – the input device. Will be returned if it is not None (default: None)
- Returns:
device – device on which the prediction should take place
- Return type:
torch.device
- clustpy.deep.encode_batchwise(dataloader: DataLoader, neural_network: Module) ndarray[source]
Utility function for embedding the whole data set in a mini-batch fashion
- Parameters:
dataloader (torch.utils.data.DataLoader) – dataloader to be used
neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)
- Returns:
embeddings_numpy – The embedded data set
- Return type:
np.ndarray
- clustpy.deep.encode_decode_batchwise(dataloader: ~torch.utils.data.dataloader.DataLoader, neural_network: ~torch.nn.modules.module.Module) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Utility function for encoding and decoding the whole data set in a mini-batch fashion, e.g., with an autoencoder. Note: Assumes an implemented decode function
- Parameters:
dataloader (torch.utils.data.DataLoader) – dataloader to be used
neural_network (torch.nn.Module) – the neural network that is used for the encoding and decoding (e.g. an autoencoder)
- Returns:
tuple – The embedded data set, The reconstructed data set
- Return type:
(np.ndarray, np.ndarray)
- clustpy.deep.get_dataloader(X: ~numpy.ndarray | ~torch.Tensor, batch_size: int, shuffle: bool = True, drop_last: bool = False, additional_inputs: list | ~numpy.ndarray | ~torch.Tensor = None, dataset_class: ~torch.utils.data.dataset.Dataset = <class 'clustpy.deep._data_utils._ClustpyDataset'>, ds_kwargs: dict = None, dl_kwargs: dict = None) DataLoader[source]
Create a dataloader for Deep Clustering algorithms. First entry always contains the indices of the data samples. Second entry always contains the actual data samples. If for example labels are desired, they can be passed through the additional_inputs parameter (should be a list). Other customizations (e.g. augmentation) can be implemented using a custom dataset_class. This custom class should stick to the conventions, [index, data, …].
- Parameters:
X (np.ndarray | torch.Tensor) – the actual data set (can be np.ndarray or torch.Tensor)
batch_size (int) – the batch size
shuffle (bool) – boolean that defines if the data set should be shuffled (default: True)
drop_last (bool) – boolean that defines if the last batch should be ignored (default: False)
additional_inputs (list | np.ndarray | torch.Tensor) – additional inputs for the dataloader, e.g. labels. Can be None, np.ndarray, torch.Tensor or a list containing np.ndarrays/torch.Tensors (default: None)
dataset_class (torch.utils.data.Dataset) – defines the class of the tensor dataset that is contained in the dataloader (default: _ClustpyDataset)
ds_kwargs (dict) –
other arguments for dataset_class. An example usage would be to include augmentation or preprocessing transforms to the _ClustpyDataset by passing ds_kwargs={“aug_transforms_list”:[aug_transforms], “orig_transforms_list”:[orig_transforms]}, where aug_transforms and orig_transforms are transforming the input X, e.g., using torchvision.transforms.Compose to combine multiple transformations.
- Important: If aug_transform_list is passed via ds_kwargs the returned values of the dataloader change. The first entry will still be the indices of the data sample,
but the second samples will be the transformed version of the actual data samples and third entry will be the original data samples. If orig_transforms_list is passed as well then the third entry will be transformed accordingly, this might be needed for preprocessing the data. An example for MNIST is shown below.
dl_kwargs (dict) – other arguments for torch.utils.data.DataLoader
Examples
>>> # Example for usage of data transformations with get_dataloader >>> from clustpy.data import load_mnist >>> import torch >>> import torchvision
>>> # load and prepare data for torchvision.transforms >>> data, labels = load_mnist() >>> data = data.reshape(-1, 1, 28, 28) >>> data /= 255.0 >>> data = torch.from_numpy(data).float() >>> # >>> # preprocessing functions >>> mean = data.mean() >>> std = data.std() >>> normalize_fn = torchvision.transforms.Normalize([mean], [std]) >>> # flatten is only needed if a FeedForward network is used, otherwise this can be skipped. >>> flatten_fn = torchvision.transforms.Lambda(torch.flatten) >>> # >>> # augmentation transforms >>> transform_list = [ >>> # transform input tensor to PIL image for augmentation >>> torchvision.transforms.ToPILImage(), >>> # apply transformations >>> torchvision.transforms.RandomAffine(degrees=(-16,+16), >>> translate=(0.1, 0.1), >>> shear=(-8, 8), >>> fill=0), >>> # transform back to torch.tensor >>> torchvision.transforms.ToTensor(), >>> # preprocess and flatten >>> normalize_fn, >>> flatten_fn, >>> ] >>> # >>> # augmentation transforms >>> aug_transforms = torchvision.transforms.Compose(transform_list) >>> # preprocessing transforms without augmentation >>> orig_transforms = torchvision.transforms.Compose([normalize_fn, flatten_fn]) >>> # >>> # pass transforms to dataloader >>> aug_dl = get_dataloader(data, batch_size=32, shuffle=True, >>> ds_kwargs={"aug_transforms_list":[aug_transforms], "orig_transforms_list":[orig_transforms]}, >>> )
- Returns:
dataloader – The final dataloader
- Return type:
torch.utils.data.DataLoader
- clustpy.deep.get_default_augmented_dataloaders(X: ~numpy.ndarray | ~torch.Tensor, batch_size: int = 256, conv_used: bool = False, flatten: bool = True) -> (<class 'torch.utils.data.dataloader.DataLoader'>, <class 'torch.utils.data.dataloader.DataLoader'>)[source]
Receive a train- and a test dataloader using default augmentations. These transformations correspond to a min-max normalization followed by torchvision.transforms.RandomAffine(degrees=(-16, +16), translate=(0.1, 0.1), shear=(-8, 8), fill=0) and a channel-wise z-transformation. Optionally, the images can be flatten afterward.
- Parameters:
X (np.ndarray | torch.Tensor) – the actual data set (can be np.ndarray or torch.Tensor)
batch_size (int) – the batch size (default: 256)
conv_used (bool) – defines whether a convolutional network will be used afterward. In this case, grayscale images will be transformed to receive three color channels by copying the grayscale channel three times (default: False)
flatten (bool) – defines whether the augmented images should be flatten afterward. Must be False if conv_used is True (default: True)
- Returns:
tuple – The trainloader (with augmentations), The testloader (without augmentations)
- Return type:
(torch.utils.data.DataLoader, torch.utils.data.DataLoader)
- clustpy.deep.get_device_from_module(neural_network: Module) device[source]
Get the device from a given module.
- Parameters:
neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)
- Returns:
device – device of the module
- Return type:
torch.device
- clustpy.deep.get_trained_network(trainloader: ~torch.utils.data.dataloader.DataLoader = None, data: ~numpy.ndarray = None, n_epochs: int = 100, batch_size: int = 128, optimizer_params: dict = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, device=None, ssl_loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), embedding_size: int = 10, neural_network: ~torch.nn.modules.module.Module | tuple = None, neural_network_class: ~torch.nn.modules.module.Module = <class 'clustpy.deep.neural_networks.feedforward_autoencoder.FeedforwardAutoencoder'>, neural_network_params: dict = None, neural_network_weights: str = None, random_state: ~numpy.random.mtrand.RandomState | int = None) Module[source]
- This function returns a trained neural network. The following cases are considered
If the neural network is initialized and trained (neural_network.fitted==True), then return input neural network without training it again.
If the neural network is initialized and not trained (neural_network.fitted==False), it will be fitted (neural_network.fitted will be set to True) using default parameters.
If the neural network is None, a new neural network is created using neural_network_class, and it will be fitted as described above.
Beware the input neural_network_class or neural_network object needs both a fit() function and the fitted attribute. See clustpy.deep.feedforward_autoencoder.FeedforwardAutoencoder for an example.
- Parameters:
trainloader (torch.utils.data.DataLoader) – dataloader used to train neural_network (default: None)
data (np.ndarray) – train data set. If data is passed then trainloader can remain empty (default: None)
n_epochs (int) – number of training epochs (default: 100)
batch_size (int) – size of the data batches (default: 128)
optimizer_params (dict) – parameters of the optimizer for the neural network training, includes the learning rate (default: {“lr”: 1e-3})
optimizer_class (torch.optim.Optimizer) – optimizer for training (default: torch.optim.Adam)
device (torch.device) – The device on which to perform the computations. If device is None then it will be automatically chosen: if a gpu is available the gpu with the highest amount of free memory will be chosen (default: None)
ssl_loss_fn (torch.nn.modules.loss._Loss) – self-supervised learning (ssl) loss function for training the network, e.g. reconstruction loss for autoencoders (default: torch.nn.MSELoss())
embedding_size (int) – dimension of the innermost layer of the neural network (default: 10)
neural_network (torch.nn.Module | tuple) – neural network object to be trained (optional) Can also be a tuple consisting of the neural network class (torch.nn.Module) and the initialization parameters (dict) (default: None)
neural_network_class (torch.nn.Module) – The neural network class that should be used (default: FeedforwardAutoencoder)
neural_network_params (dict) – Parameters to be used when creating a new neural network using the neural_network_class (default: None)
neural_network_weights (str) – Path to a file containing the state_dict of the neural_network (default: None)
random_state (np.random.RandomState | int) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- Returns:
neural_network – The fitted neural network
- Return type:
torch.nn.Module
- clustpy.deep.predict_batchwise(dataloader: DataLoader, neural_network: Module, cluster_module: Module) ndarray[source]
Utility function for predicting the cluster labels over the whole data set in a mini-batch fashion. Method calls the predict_hard method of the cluster_module for each batch of data.
- Parameters:
dataloader (torch.utils.data.DataLoader) – dataloader to be used
neural_network (torch.nn.Module) – the neural network that is used for the encoding (e.g. an autoencoder)
cluster_module (torch.nn.Module) – the cluster module that is used for the encoding (e.g. DEC). Usually contains the predict method.
- Returns:
predictions_numpy – The predictions of the cluster_module for the data set
- Return type:
np.ndarray