clustpy.deep package
Submodules
clustpy.deep.dcn module
@authors: Lukas Miklautz, Dominik Mautz
- class clustpy.deep.dcn.DCN(n_clusters: int, batch_size: int = 256, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), degree_of_space_distortion: float = 0.05, degree_of_space_preservation: float = 1.0, autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 10, random_state: ~numpy.random.mtrand.RandomState | None = None)[source]
Bases:
BaseEstimator,ClusterMixinThe Deep Clustering Network (DCN) algorithm. First, an autoencoder (AE) will be trained (will be skipped if input autoencoder is given). Afterwards, KMeans identifies the initial clusters. Last, the AE will be optimized using the DCN loss function.
- Parameters:
n_clusters (int) – number of clusters
batch_size (int) – size of the data batches (default: 256)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
degree_of_space_distortion (float) – weight of the clustering loss (default: 0.05)
degree_of_space_preservation (float) – weight of the reconstruction loss (default: 1.0)
autoencoder (torch.nn.Module) – the input autoencoder. If None a new FlexibleAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (default: 10)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dcn_labels_
The final DCN labels
- Type:
np.ndarray
- dcn_cluster_centers_
The final DCN cluster centers
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist from clustpy.deep import DCN data, labels = load_mnist() dcn = DCN(n_clusters=10) dcn.fit(data)
References
Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” international conference on machine learning. PMLR, 2017.
- fit(X: ndarray, y: ndarray | None = None) DCN[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DCN algorithm
- Return type:
clustpy.deep.dec module
@authors: Lukas Miklautz, Dominik Mautz, Collin Leiber
- class clustpy.deep.dec.DEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 10, use_reconstruction_loss: bool = False, cluster_loss_weight: float = 1, random_state: ~numpy.random.mtrand.RandomState | None = None)[source]
Bases:
BaseEstimator,ClusterMixinThe Deep Embedded Clustering (DEC) algorithm. First, an autoencoder (AE) will be trained (will be skipped if input autoencoder is given). Afterwards, KMeans identifies the initial clusters. Last, the AE will be optimized using the DEC loss function.
- Parameters:
n_clusters (int) – number of clusters
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
autoencoder (torch.nn.Module) – the input autoencoder. If None a new FlexibleAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (default: 10)
use_reconstruction_loss (bool) – defines whether the reconstruction loss will be used during clustering training (default: False)
cluster_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist from clustpy.deep import DEC data, labels = load_mnist() dec = DEC(n_clusters=10) dec.fit(data)
References
Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.
- fit(X: ndarray, y: ndarray | None = None) DEC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DEC algorithm
- Return type:
- class clustpy.deep.dec.IDEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 10, random_state: ~numpy.random.mtrand.RandomState | None = None)[source]
Bases:
DECThe Improved Deep Embedded Clustering (IDEC) algorithm. Implemented as a child of the DEC class. Therefore, matches the __init__ from DEC but with fixed use_reconstruction_loss=True and cluster_loss_weight=0.1.
- Parameters:
n_clusters (int) – number of clusters
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
autoencoder (torch.nn.Module) – the input autoencoder. If None a new FlexibleAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (default: 10)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist from clustpy.deep import IDEC data, labels = load_mnist() idec = IDEC(n_clusters=10) idec.fit(data)
References
Guo, Xifeng, et al. “Improved deep embedded clustering with local structure preservation.” IJCAI. 2017.
clustpy.deep.dipdeck module
@authors: Collin Leiber
- class clustpy.deep.dipdeck.DipDECK(n_clusters_init: int = 35, dip_merge_threshold: float = 0.9, cluster_loss_weight: float = 1, max_n_clusters: int = inf, min_n_clusters: int = 1, batch_size: int = 256, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 5, max_cluster_size_diff_factor: float = 2, pval_strategy: str = 'table', n_boots: int = 1000, random_state: ~numpy.random.mtrand.RandomState | None = None, debug: bool = False)[source]
Bases:
BaseEstimator,ClusterMixinThe Deep Embedded Clustering with k-Estimation (DipDECK) algorithm. First, an autoencoder (AE) will be trained (will be skipped if input autoencoder is given). Afterwards, KMeans identifies the initial clusters using an overestimated number of clusters. Last, the AE will be optimized using the DipDECK loss function. If any Dip-value exceeds the dip_merge_threshold, the corresponding clusters will be merged.
- Parameters:
n_clusters_init (int) – initial number of clusters (default: 35)
dip_merge_threshold (float) – threshold regarding the Dip-p-value that defines if two clusters should be merged. Must be bvetween 0 and 1 (default: 0.9)
cluster_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1)
max_n_clusters (int) – maximum number of clusters. Must be larger than min_n_clusters. If the result has more clusters, a merge will be forced (default: np.inf)
min_n_clusters (int) – minimum number of clusters. Must be larger than 0, smaller than max_n_clusters and smaller than n_clusters_init. When this number of clusters is reached, all further merge processes will be hindered (default: 1)
batch_size (int) – size of the data batches (default: 256)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure. Will reset after each merge (default: 50)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
autoencoder (torch.nn.Module) – the input autoencoder. If None a new FlexibleAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (default: 10)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for the Dip calculation (default: 2)
pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)
n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – If true, additional information will be printed to the console (default: False)
- labels_
The final labels
- Type:
np.ndarray
- n_clusters_
The final number of clusters
- Type:
int
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist from clustpy.deep import DipDECK data, labels = load_mnist() dipdeck = DipDECK() dipdeck.fit(data)
References
Leiber, Collin, et al. “Dip-based deep embedded clustering with k-estimation.” Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021.
- fit(X: ndarray, y: ndarray | None = None) DipDECK[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DipDECK algorithm
- Return type:
clustpy.deep.dipencoder module
@authors: Collin Leiber
- class clustpy.deep.dipencoder.DipEncoder(n_clusters: int, pretrain_batch_size: int = 256, batch_size: int | None = None, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 10, max_cluster_size_diff_factor: float = 3, random_state: ~numpy.random.mtrand.RandomState | None = None, debug: bool = False)[source]
Bases:
BaseEstimator,ClusterMixinThe DipEncoder. Can be used either as a clustering procedure if no ground truth labels are given or as a supervised dimensionality reduction technique. First, an autoencoder (AE) will be trained (will be skipped if input autoencoder is given). Afterwards, KMeans identifies the initial clusters. Last, the AE will be optimized using the DipEncoder loss function.
- Parameters:
n_clusters (int) – number of clusters
pretrain_batch_size (int) – size of the data batches for the pretraining (default: 256)
batch_size (int) – size of the data batches for the actual training of the DipEncoder. Should be larger the more clusters we have. If it is None, it will be set to (25 x n_clusters) (default: None)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
autoencoder (torch.nn.Module) – the input autoencoder. If None a new FlexibleAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (default: 10)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used (default: 3)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – If true, additional information will be printed to the console (default: False)
- labels_
The final labels
- Type:
np.ndarray
- projection_axes_
The final projection axes between the clusters
- Type:
np.ndarray
- index_dict_
A dictionary to match the indices of two clusters to a projection axis
- Type:
dict
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist from clustpy.deep import DipEncoder data, labels = load_mnist() dipencoder = DipEncoder(10) dipencoder.fit(data)
References
Leiber, Collin and Bauer, Lena G. M. and Neumayr, Michael and Plant, Claudia and Böhm, Christian “The DipEncoder: Enforcing Multimodality in Autoencoders.” Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2022.
- fit(X: ndarray, y: ndarray | None = None) DipEncoder[source]
Initiate the actual clustering/dimensionality reduction process on the input data set. If no ground truth labels are given, the resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – The given (training) data set
y (np.ndarray) – The ground truth labels. If None, the DipEncoder will be used for clustering (default: None)
- Returns:
self – This instance of the DipEncoder
- Return type:
- plot(X: ndarray, edge_width: float = 0.2, show_legend: bool = True) None[source]
Plot the current state of the DipEncoder. First the data set will be encoded using the autoencoder, afterwards the plot will be created. Uses the plot_scatter_matrix as a basis and adds projection axes in red.
- Parameters:
X (np.ndarray) – The data set
edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots
show_legend (bool) – Specifies whether a legend should be added to the plot
- predict(X_train: ndarray, X_test: ndarray) ndarray[source]
Predict the labels of the X_test dataset using the information gained by the fit function and the X_train dataset.
- Parameters:
X_train (np.ndarray) – The data set used to train the DipEncoder (i.e. to retrieve the projection axes, modal intervals, …)
X_test (np.ndarray) – The data set for which we want to retrieve the labels
- Returns:
labels_pred – The predicted labels for X_test
- Return type:
np.ndarray
- set_predict_request(*, X_test: bool | None | str = '$UNCHANGED$', X_train: bool | None | str = '$UNCHANGED$') DipEncoder
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline. Otherwise it has no effect.- Parameters:
X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_testparameter inpredict.X_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_trainparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- clustpy.deep.dipencoder.plot_dipencoder_embedding(X_embed: ndarray, n_clusters: int, labels: ndarray, projection_axes: ndarray, index_dict: dict, edge_width: float = 0.1, show_legend: bool = False, show_plot: bool = True) None[source]
Plot the current state of the DipEncoder. Uses the plot_scatter_matrix as a basis and adds projection axes in red.
- Parameters:
X_embed (np.ndarray) – The embedded data set
n_clusters (int) – Number of clusters
labels (np.ndarray) – The cluster labels
projection_axes (np.ndarray) – The projection axes between the clusters
index_dict (dict) – A dictionary to match the indices of two clusters to a projection axis
edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots
show_legend (bool) – Specifies whether a legend should be added to the plot
show_plot (bool) – Specifies whether the plot should be plotted, i.e. if plt.show() should be executed (default: True)
clustpy.deep.enrc module
@authors: Lukas Miklautz
- class clustpy.deep.enrc.ENRC(n_clusters: list, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), degree_of_space_distortion: float = 1.0, degree_of_space_preservation: float = 1.0, autoencoder: ~torch.nn.modules.module.Module = None, embedding_size: int = 20, init: str = 'nrkmeans', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2-alpha/lib/python3.8/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = None, random_state: ~numpy.random.mtrand.RandomState = None, debug: bool = False)[source]
Bases:
BaseEstimator,ClusterMixinThe Embeddedn Non-Redundant Clustering (ENRC) algorithm.
- Parameters:
n_clusters (list) – list containing number of clusters for each clustering
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
P (list) – list containing projections for each clustering (optional) (default: None)
input_centers (list) – list containing the cluster centers for each clustering (optional) (default: None)
batch_size (int) – size of the data batches (default: 128)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)
tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
degree_of_space_distortion (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)
degree_of_space_preservation (float) – weight of regularization loss term, e.g., reconstruction loss (default: 1.0)
autoencoder (torch.nn.Module) – the input autoencoder. If None a new autoencoder will be created and trained (default: None)
embedding_size (int) – size of the embedding within the autoencoder. Only used if autoencoder is None (default: 20)
init (str) – choose which initialization strategy should be used. Has to be one of ‘nrkmeans’, ‘random’ or ‘sgd’ (default: ‘nrkmeans’)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – if device is None then it will be checked whether a gpu is available or not (default: None)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization (optional) (default: None)
debug (bool) – if True additional information during the training will be printed (default: False)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
:raises ValueError : if init is not one of ‘nrkmeans’, ‘random’, ‘auto’ or ‘sgd’.:
References
Miklautz, Lukas & Dominik Mautz et al. “Deep embedded non-redundant clustering.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.
- fit(X: ndarray, y: ndarray | None = None) ENRC[source]
Cluster the input dataset with the ENRC algorithm. Saves the labels, centers, V, m, Betas, and P in the ENRC object. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – input data
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – returns the ENRC object
- Return type:
- plot_subspace(X: ndarray, subspace_index: int, labels: ndarray | None = None, plot_centers: bool = False, gt: ndarray | None = None, equal_axis: bool = False) None[source]
Plot the specified subspace_nr as scatter matrix plot.
- Parameters:
X (np.ndarray) – input data
subspace_index (int, index of the subspace_nr) –
labels (np.ndarray) – the labels to use for the plot (default: labels found by Nr-Kmeans) (default: None)
plot_centers (bool) – plot centers if True (default: False)
gt (np.ndarray) – of ground truth labels (default=None)
equal_axis (bool) – equalize axis if True (default: False)
- Return type:
scatter matrix plot of the input data
- predict(X: ndarray, y: ndarray | None = None, use_P: bool = True) ndarray[source]
Predicts the labels for each clustering of X in a mini-batch manner.
- Parameters:
X (np.ndarray) – input data
y (np.ndarray) – the labels (can be ignored)
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in X and c is the number of clusterings.
- Return type:
np.ndarray
- reconstruct_subspace_centroids(subspace_index: int) ndarray[source]
Reconstructs the centroids in the specified subspace_nr.
- Parameters:
subspace_index (int) – index of the subspace_nr
- Returns:
centers_rec – reconstructed centers as np.ndarray
- Return type:
centers_rec
- set_predict_request(*, use_P: bool | None | str = '$UNCHANGED$') ENRC
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline. Otherwise it has no effect.- Parameters:
use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
use_Pparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
- transform_full_space(X: ndarray, embedded=False) ndarray[source]
Embedds the input dataset with the autoencoder and the matrix V from the ENRC object. :param X: input data :type X: np.ndarray :param embedded: if True, then X is assumed to be already embedded (default: False) :type embedded: bool
- Returns:
rotated – The transformed data
- Return type:
np.ndarray
- transform_subspace(X: ndarray, subspace_index: int, embedded: bool = False) ndarray[source]
Embedds the input dataset with the autoencoder and with the matrix V projected onto a special clusterspace_nr.
- Parameters:
X (np.ndarray) – input data
subspace_index (int) – index of the subspace_nr
embedded (bool) – if True, then X is assumed to be already embedded (default: False)
- Returns:
subspace – The transformed subspace
- Return type:
np.ndarray
- clustpy.deep.enrc.available_init_strategies() list[source]
Returns a list of strings of available initialization strategies for ENRC. At the moment following strategies are supported: nrkmeans, random, sgd, auto
- clustpy.deep.enrc.beta_weights_init(P: list, n_dims: int, high_value: float = 0.9) Tensor[source]
Initializes parameters of the softmax such that betas will be set to high_value in dimensions which form a cluster subspace according to P and set to (1 - high_value)/(len(P) - 1) for the other clusterings.
- Parameters:
P (list) – list containing projections for each subspace
n_dims (int) – dimensionality of the embedded data
high_value (float) – value that should be initially used to indicate strength of assignment of a specific dimension to the clustering (default: 0.9)
- Returns:
beta_weights – initialized weights that are input in the softmax to get the betas.
- Return type:
torch.Tensor
- clustpy.deep.enrc.calculate_beta_weight(data: Tensor, centers: list, V: Tensor, P: list, high_beta_value: float = 0.9) Tensor[source]
The beta weights have a closed form solution if we have two subspaces, so the optimal values given the data, centers and V can be computed. See supplement of Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Boehm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832 here: https://gitlab.cs.univie.ac.at/lukas/acedec_public/-/blob/master/supplement.pdf For number of subspaces > 2, we calculate the beta weight assuming that an assigned subspace should have a weight of 0.9.
- Parameters:
data (torch.Tensor) – input data
centers (list) – list of torch.Tensor, cluster centers for each clustering
V (torch.Tensor) – orthogonal rotation matrix
P (list) – list containing projections for each subspace
high_beta_value (float) – value that should be initially used to indicate strength of assignment of a specific dimension to the clustering (default: 0.9)
- Returns:
beta_weights – a c x d vector containing the weights for the softmax to indicate which dimensions d are important for each clustering c.
- Return type:
torch.Tensor
- Raises:
ValueError – If number of clusterings is smaller than 2:
- clustpy.deep.enrc.calculate_optimal_beta_weights_special_case(data: Tensor, centers: list, V: Tensor, batch_size: int = 32) Tensor[source]
The beta weights have a closed form solution if we have two subspaces, so the optimal values given the data, centers and V can be computed. See supplement of Lukas Miklautz, Lena G. M. Bauer, Dominik Mautz, Sebastian Tschiatschek, Christian Boehm, Claudia Plant: Details (Don’t) Matter: Isolating Cluster Information in Deep Embedded Spaces. IJCAI 2021: 2826-2832 here: https://gitlab.cs.univie.ac.at/lukas/acedec_public/-/blob/master/supplement.pdf
- Parameters:
data (torch.Tensor) – input data
centers (list) – list of torch.Tensor, cluster centers for each clustering
V (torch.Tensor) – orthogonal rotation matrix
batch_size (int) – size of the data batches (default: 32)
- Returns:
optimal_beta_weights – a c x d vector containing the optimal weights for the softmax to indicate which dimensions d are important for each clustering c.
- Return type:
torch.Tensor
- clustpy.deep.enrc.enrc_init(data: ~numpy.ndarray, n_clusters: list, init: str = 'auto', rounds: int = 10, input_centers: list | None = None, P: list | None = None, V: ~numpy.ndarray | None = None, random_state: ~numpy.random.mtrand.RandomState | None = None, max_iter: int = 100, learning_rate: float | None = None, optimizer_class: ~torch.optim.optimizer.Optimizer | None = None, batch_size: int = 128, epochs: int = 10, device: ~torch.device = device(type='cpu'), debug: bool = True, init_kwargs: dict | None = None) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy for the ENRC algorithm.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
init (str) –
{‘nrkmeans’, ‘random’, ‘sgd’, ‘auto’} or callable. Initialization strategies for parameters cluster_centers, V and beta of ENRC. (default=’auto’)
’nrkmeans’ : Performs the NrKmeans algorithm to get initial parameters. This strategy is preferred for small data sets, but the orthogonality constraint on V and subsequently for the clustered subspaces can be sometimes to limiting in practice, e.g., if clusterings in the data are not perfectly non-redundant.
’random’ : Same as ‘nrkmeans’, but max_iter is set to 10, so the performance is faster, but also less optimized, thus more random.
’sgd’ : Initialization strategy based on optimizing ENRC’s parameters V and beta in isolation from the autoencoder using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the ‘nrkmeans’ option and only constraints V using the reconstruction error (torch.nn.MSELoss), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the ‘sgd’ strategy is that it can be less stable for small data sets.
’auto’ : Selects ‘sgd’ init if data.shape[0] > 100,000 or data.shape[1] > 1,000. For smaller data sets ‘nrkmeans’ init is used.
If a callable is passed, it should take arguments data and n_clusters (additional parameters can be provided via the dictionary init_kwargs) and return an initialization (centers, P, V and beta_weights).
rounds (int) – number of repetitions of the initialization procedure (default: 10)
input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
random_state (np.random.RandomState) – random state for reproducible results (default: None)
max_iter (int) – maximum number of iterations of NrKmeans. Only used for init=’nrkmeans’ (default: 100)
learning_rate (float) – learning rate for optimizer_class that is used to optimize V and beta. Only used for init=’sgd’.
optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used. Only used for init=’sgd’ (default: None)
batch_size (int) – size of the data batches. Only used for init=’sgd’ (default: 128)
epochs (int) – number of epochs for the actual clustering procedure. Only used for init=’sgd’ (default: 10)
device (torch.device) – device on which should be trained on. Only used for init=’sgd’ (default: torch.device(‘cpu’))
debug (bool) – if True then the cost of each round will be printed (default: True)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
- Returns:
tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
:raises ValueError : if init variable is passed that is not implemented.:
- clustpy.deep.enrc.enrc_predict(z: Tensor, V: Tensor, centers: list, subspace_betas: Tensor, use_P: bool = False) ndarray[source]
Predicts the labels for each clustering of an input z.
- Parameters:
z (torch.Tensor) – embedded input data point, can also be a mini-batch of embedded points
V (torch.tensor) – orthogonal rotation matrix
centers (list) – list of torch.Tensor, cluster centers for each clustering
subspace_betas (torch.Tensor) – weights for each dimension per clustering. Calculated via softmax(beta_weights).
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft subspace_beta weights are used (default: False)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in z and c is the number of clusterings.
- Return type:
np.ndarray
- clustpy.deep.enrc.enrc_predict_batchwise(V: Tensor, centers: list, subspace_betas: Tensor, model: Module, dataloader: DataLoader, device: device = device(type='cpu'), use_P: bool = False) ndarray[source]
Predicts the labels for each clustering of a dataloader in a mini-batch manner.
- Parameters:
V (torch.Tensor) – orthogonal rotation matrix
centers (list) – list of torch.Tensor, cluster centers for each clustering
subspace_betas (torch.Tensor) – weights for each dimension per clustering. Calculated via softmax(beta_weights).
model (torch.nn.Module) – the input model for encoding the data
dataloader (torch.utils.data.DataLoader) – dataloader to be used for prediction
device (torch.device) – device to be predicted on (default: torch.device(‘cpu’))
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: False)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in z and c is the number of clusterings.
- Return type:
np.ndarray
- clustpy.deep.enrc.nrkmeans_init(data: ~numpy.ndarray, n_clusters: list, rounds: int = 10, max_iter: int = 100, input_centers: list | None = None, P: list | None = None, V: ~numpy.ndarray | None = None, random_state: ~numpy.random.mtrand.RandomState | None = None, debug=True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy based on the NrKmeans Algorithm. This strategy is preferred for small data sets, but the orthogonality constraint on V and subsequently for the clustered subspaces can be sometimes to limiting in practice, e.g., if clusterings are not perfectly non-redundant.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
rounds (int) – number of repetitions of the NrKmeans algorithm (default: 10)
max_iter (int) – maximum number of iterations of NrKmeans (default: 100)
input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – if True then the cost of each round will be printed (default: True)
- Returns:
tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
- clustpy.deep.enrc.optimal_beta(kmeans_loss: Tensor, other_losses_mean_sum: Tensor) Tensor[source]
Calculate optimal values for the beta weight for each dimension.
- Parameters:
kmeans_loss (torch.Tensor) – a 1 x d vector of the kmeans losses per dimension.
other_losses_mean_sum (torch.Tensor) – a 1 x d vector of the kmeans losses of all other clusterings except the one in ‘kmeans_loss’.
- Returns:
optimal_beta_weights – a 1 x d vector containing the optimal weights for the softmax to indicate which dimensions are important for each clustering. Calculated via -torch.log(kmeans_loss/other_losses_mean_sum)
- Return type:
torch.Tensor
- clustpy.deep.enrc.random_nrkmeans_init(data: ~numpy.ndarray, n_clusters: list, rounds: int = 10, input_centers: list | None = None, P: list | None = None, V: ~numpy.ndarray | None = None, random_state: ~numpy.random.mtrand.RandomState | None = None, debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy based on the NrKmeans Algorithm. For documentation see nrkmeans_init function. Same as nrkmeans_init, but max_iter is set to 5, so the results will be faster and more random.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
rounds (int) – number of repetitions of the NrKmeans algorithm (default: 10)
input_centers (list) – list of np.ndarray, optional parameter if initial cluster centers want to be set (optional) (default: None)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – if True then the cost of each round will be printed (default: True)
- Returns:
tuple – list of cluster centers for each subspace list containing projections for each subspace orthogonal rotation matrix weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
- clustpy.deep.enrc.reinit_centers(enrc: _ENRC_Module, subspace_id: int, dataloader: DataLoader, model: Module, n_samples: int = 512, kmeans_steps: int = 10, split: str = 'random') None[source]
Reinitializes centers that have been lost, i.e. if they did not get any data point assigned. Before a center is reinitialized, this method checks whether a center has not get any points assigned over several mini-batch iterations and if this count is higher than enrc.reinit_threshold the center will be reinitialized.
- Parameters:
enrc (_ENRC_Module) – torch.nn.Module instance for the ENRC algorithm
subspace_id (int) – integer which indicates which subspace the cluster to be checked are in.
dataloader (torch.utils.data.DataLoader) – dataloader from which data is randomly sampled. Important shuffle=True needs to be set, because n_samples random samples are drawn.
model (torch.nn.Module) – autoencoder model used for the embedding
n_samples (int) – number of samples that should be used for the reclustering (default: 512)
kmeans_steps (int) – number of mini-batch kmeans steps that should be conducted with the new centroid (default: 10)
split (str) – {‘random’, ‘cost’}, default=’random’, select how clusters should be split for renitialization. ‘random’ : split a random point from the random sample of size=n_samples. ‘cost’ : split the cluster with max kmeans cost.
- clustpy.deep.enrc.sgd_init(data: ~numpy.ndarray, n_clusters: list, learning_rate: float, batch_size: int = 128, optimizer_class: ~torch.optim.optimizer.Optimizer | None = None, rounds: int = 2, epochs: int = 10, random_state: ~numpy.random.mtrand.RandomState | None = None, input_centers: list | None = None, P: list | None = None, V: ~numpy.ndarray | None = None, device: ~torch.device = device(type='cpu'), debug: bool = True) -> (<class 'list'>, <class 'list'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Initialization strategy based on optimizing ENRC’s parameters V and beta in isolation from the autoencoder using a mini-batch gradient descent optimizer. This initialization strategy scales better to large data sets than the nrkmeans_init and only constraints V using the reconstruction error (torch.nn.MSELoss), which can be more flexible than the orthogonality constraint of NrKmeans. A problem of the sgd_init strategy is that it can be less stable for small data sets.
- Parameters:
data (np.ndarray) – input data
n_clusters (list) – list of ints, number of clusters for each clustering
learning_rate (float) – learning rate for optimizer_class that is used to optimize V and beta
batch_size (int) – size of the data batches (default: 128)
optimizer_class (torch.optim.Optimizer) – optimizer for training. If None then torch.optim.Adam will be used (default: None)
rounds (int) – number of repetitions of the initialization procedure (default: 2)
epochs (int) – number of epochs for the actual clustering procedure (default: 10)
random_state (np.random.RandomState) – random state for reproducible results (default: None)
input_centers (list) – list of np.ndarray, default=None, optional parameter if initial cluster centers want to be set (optional)
P (list) – list containing projections for each subspace (optional) (default: None)
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
device (torch.device) – device on which should be trained on (default: torch.device(‘cpu’))
debug (bool) – if True then the cost of each round will be printed (default: True)
- Returns:
tuple – list of cluster centers for each subspace, list containing projections for each subspace, orthogonal rotation matrix, weights for softmax function to get beta values.
- Return type:
(list, list, np.ndarray, np.ndarray)
clustpy.deep.flexible_autoencoder module
@authors: Lukas Miklautz
- class clustpy.deep.flexible_autoencoder.FlexibleAutoencoder(layers: list, batch_norm: bool = False, dropout: float | None = None, activation_fn: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.LeakyReLU'>, bias: bool = True, decoder_layers: list | None = None, decoder_output_fn: ~torch.nn.modules.module.Module | None = None)[source]
Bases:
ModuleA flexible feedforward autoencoder.
- Parameters:
layers (list) – list of the different layer sizes from input to embedding, e.g. an example architecture for MNIST [784, 512, 256, 10], where 784 is the input dimension and 10 the embedding dimension. If decoder_layers are not specified then the decoder is symmetric and goes in the same order from embedding to input.
batch_norm (bool) – Set True if you want to use torch.nn.BatchNorm1d (default: False)
dropout (float) – Set the amount of dropout you want to use (default: None)
activation_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the hidden layers, if None then it will be linear (default: torch.nn.LeakyReLU)
bias (bool) – set False if you do not want to use a bias term in the linear layers (default: True)
decoder_layers (list) – list of different layer sizes from embedding to output of the decoder. If set to None, will be symmetric to layers (default: None)
decoder_output_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the decoder output layer, if None then it will be linear. e.g. set to torch.nn.Sigmoid if you want to scale the decoder output between 0 and 1 (default: None)
- encoder
encoder part of the autoencoder, responsible for embedding data points (class is FullyConnectedBlock)
- Type:
- decoder
decoder part of the autoencoder, responsible for reconstructing data points from the embedding (class is FullyConnectedBlock)
- Type:
- fitted
boolean value indicating whether the autoencoder is already fitted.
- Type:
bool
References
E.g. Ballard, Dana H. “Modular learning in neural networks.” Aaai. Vol. 647. 1987.
- decode(embedded: Tensor) Tensor[source]
Apply the decoder function to embedded.
- Parameters:
embedded (torch.Tensor) – embedded data point, can also be a mini-batch of embedded points
- Returns:
decoded – returns the reconstruction of embedded
- Return type:
torch.Tensor
- encode(x: Tensor) Tensor[source]
Apply the encoder function to x.
- Parameters:
x (torch.Tensor) – input data point, can also be a mini-batch of points
- Returns:
embedded – the embedded data point with dimensionality embedding_size
- Return type:
torch.Tensor
- evaluate(dataloader: DataLoader, loss_fn: _Loss, device: device = device(type='cpu')) Tensor[source]
Evaluates the autoencoder.
- Parameters:
dataloader (torch.utils.data.DataLoader) – dataloader to be used for training
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction
device (torch.device) – device to be trained on (default: torch.device(‘cpu’))
- Returns:
loss – returns the reconstruction loss of all samples in dataloader
- Return type:
torch.Tensor
- fit(n_epochs: int, lr: float, batch_size: int = 128, data: ~numpy.ndarray = None, data_eval: ~numpy.ndarray = None, dataloader: ~torch.utils.data.dataloader.DataLoader = None, evalloader: ~torch.utils.data.dataloader.DataLoader = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), patience: int = 5, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2-alpha/lib/python3.8/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, device: ~torch.device = device(type='cpu'), model_path: str = None, print_step: int = 0) FlexibleAutoencoder[source]
Trains the autoencoder in place.
- Parameters:
n_epochs (int) – number of epochs for training
lr (float) – learning rate to be used for the optimizer_class
batch_size (int) – size of the data batches (default: 128)
data (np.ndarray) – train data set. If data is passed then dataloader can remain empty (default: None)
data_eval (np.ndarray) – evaluation data set. If data_eval is passed then evalloader can remain empty (default: None)
dataloader (torch.utils.data.DataLoader) – dataloader to be used for training (default: default=None)
evalloader (torch.utils.data.DataLoader) – dataloader to be used for evaluation, early stopping and learning rate scheduling if scheduler=torch.optim.lr_scheduler.ReduceLROnPlateau (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer to be used (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction (default: torch.nn.MSELoss())
patience (int) – patience parameter for EarlyStopping (default: 5)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used. If torch.optim.lr_scheduler.ReduceLROnPlateau is used then the behaviour is matched by providing the validation_loss calculated based on samples from evalloader (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
device (torch.device) – device to be trained on (default: torch.device(‘cpu’))
model_path (str) – if specified will save the trained model to the location. If evalloader is used, then only the best model w.r.t. evaluation loss is saved (default: None)
print_step (int) – specifies how often the losses are printed. If 0, no prints will occur (default: 0)
- Returns:
self – this instance of the FlexibleAutoencoder
- Return type:
- Raises:
ValueError – data cannot be None if dataloader is None:
ValueError – evalloader cannot be None if scheduler=torch.optim.lr_scheduler.ReduceLROnPlateau:
- forward(x: Tensor) Tensor[source]
Applies both the encode and decode function. The forward function is automatically called if we call self(x).
- Parameters:
x (torch.Tensor) – input data point, can also be a mini-batch of embedded points
- Returns:
reconstruction – returns the reconstruction of a data point
- Return type:
torch.Tensor
- loss(batch: list, loss_fn: _Loss, device: device) Tensor[source]
Calculate the loss of a single batch of data.
- Parameters:
batch (list) – the different parts of a dataloader (id, samples, …)
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction
device (torch.device) – device to be trained on
- Returns:
loss – returns the reconstruction loss of the input sample
- Return type:
torch.Tensor
- training: bool
- class clustpy.deep.flexible_autoencoder.FullyConnectedBlock(layers: list, batch_norm: bool = False, dropout: float | None = None, activation_fn: Module | None = None, bias: bool = True, output_fn: Module | None = None)[source]
Bases:
ModuleFeed Forward Neural Network Block
- Parameters:
layers (list) – list of the different layer sizes
batch_norm (bool) – set True if you want to use torch.nn.BatchNorm1d (default: False)
dropout (float) – set the amount of dropout you want to use (default: None)
activation_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the hidden layers, if None then it will be linear (default: None)
bias (bool) – set False if you do not want to use a bias term in the linear layers (default: None)
output_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the last layer, if None then it will be linear (default: None)
- block
feed forward neural network
- Type:
torch.nn.Sequential
- forward(x: Tensor) Tensor[source]
Pass a sample through the FullyConnectedBlock.
- Parameters:
x (torch.Tensor) – the sample
- Returns:
forwarded – The passed sample.
- Return type:
torch.Tensor
- training: bool
clustpy.deep.neighbor_encoder module
@authors: Collin Leiber
- class clustpy.deep.neighbor_encoder.NeighborEncoder(layers: list, n_neighbors: int, decode_self: bool = False, batch_norm: bool = False, dropout: float | None = None, activation_fn: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.LeakyReLU'>, bias: bool = True, decoder_layers: list | None = None, decoder_output_fn: ~torch.nn.modules.module.Module | None = None)[source]
Bases:
FlexibleAutoencoderA NeighborEncoder. Does not compare the reconstruction of an object to itself but to its nearest neighbors. For more information see the stated reference. If n_neighbors is 0 and decode_self is true, the NeighborEncoder will work as a regular FlexibleAutoencoder.
- Parameters:
layers (list) – list of the different layer sizes from input to embedding, e.g. an example architecture for MNIST [784, 512, 256, 10], where 784 is the input dimension and 10 the embedding dimension. If decoder_layers are not specified then the decoder is symmetric and goes in the same order from embedding to input.
n_neighbors (int) – the number of nearest neighbors to be considered
decode_self (bool) – specifies whether a point itself should also be decoded (default: False)
batch_norm (bool) – Set True if you want to use torch.nn.BatchNorm1d (default: False)
dropout (float) – Set the amount of dropout you want to use (default: None)
activation_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the hidden layers, if None then it will be linear (default: torch.nn.LeakyReLU)
bias (bool) – set False if you do not want to use a bias term in the linear layers (default: True)
decoder_layers (list) – list of different layer sizes from embedding to output of the decoder. If set to None, will be symmetric to layers (default: None)
decoder_output_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the decoder output layer, if None then it will be linear. e.g. set to torch.nn.Sigmoid if you want to scale the decoder output between 0 and 1 (default: None)
- encoder
encoder part of the autoencoder, responsible for embedding data points (class is FullyConnectedBlock)
- Type:
- decoder
decoder part of the autoencoder, responsible for reconstructing the data point itself (class is FullyConnectedBlock). Only used if decode_self is true.
- Type:
- fitted
boolean value indicating whether the autoencoder is already fitted.
- Type:
bool
- neighbor_decoders
list containing one decoder network (class is FullyConnectedBlock) for each nearest neighbor
- Type:
list
Examples
from clustpy.data import load_optdigits from clustpy.deep import get_dataloader from clustpy.deep._utils import detect_device from scipy.spatial.distance import pdist, squareform
X, L = load_optdigits() device = detect_device() n_neighbors = 3
dist_matrix = squareform(pdist(X)) neighbor_ids = np.argsort(dist_matrix, axis=1) neighbors = [X[neighbor_ids[:, 1 + i]] for i in range(n_neighbors)] # Alternatively: neighbors = get_neighbors_batchwise(X, n_neighbors)
dataloader = get_dataloader(X, 256, True, additional_inputs=neighbors) neighbor_encoder = NeighborEncoder(layers=[X.shape[1], 512, 256, 10], n_neighbors=n_neighbors, decode_self=False) neighbor_encoder.fit(dataloader=dataloader, device=device, n_epochs=100, lr=1e-3)
References
Yeh, Chin-Chia Michael, et al. “Representation Learning by Reconstructing Neighborhoods.” arXiv preprint arXiv:1811.01557 (2018).
- decode(embedded: Tensor) Tensor[source]
Apply the decoder function of each neighbor network to embedded. Returns a (n_neighbors x batch_size x dimensionality) tensor if decode_self is false, else a (n_neighbors + 1 x batch_size x dimensionality) tensor
- Parameters:
embedded (torch.Tensor) – embedded data point, can also be a mini-batch of embedded points
- Returns:
decoded_neighbors – returns the reconstruction of embedded concerning each neighbor
- Return type:
torch.Tensor
- fit(n_epochs: int, lr: float, batch_size: int = 128, dataloader: ~torch.utils.data.dataloader.DataLoader = None, evalloader: ~torch.utils.data.dataloader.DataLoader = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), patience: int = 5, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2-alpha/lib/python3.8/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, device: ~torch.device = device(type='cpu'), model_path: str = None, print_step: int = 0) NeighborEncoder[source]
Trains the NeighborEncoder in place. Equal to fit function of the FlexibleAutoencoder but does only work with a dataloader (not with a regular data array). This is because the dataloader must contain the nearest neighbors of each point at the positions 2, 3, ….
- Parameters:
n_epochs (int) – number of epochs for training
lr (float) – learning rate to be used for the optimizer_class
batch_size (int) – size of the data batches (default: 128)
dataloader (torch.utils.data.DataLoader) – dataloader to be used for training (default: default=None)
evalloader (torch.utils.data.DataLoader) – dataloader to be used for evaluation, early stopping and learning rate scheduling if scheduler=torch.optim.lr_scheduler.ReduceLROnPlateau (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer to be used (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction (default: torch.nn.MSELoss())
patience (int) – patience parameter for EarlyStopping (default: 5)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used. If torch.optim.lr_scheduler.ReduceLROnPlateau is used then the behaviour is matched by providing the validation_loss calculated based on samples from evalloader (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
device (torch.device) – device to be trained on (default: torch.device(‘cpu’))
model_path (str) – if specified will save the trained model to the location. If evalloader is used, then only the best model w.r.t. evaluation loss is saved (default: None)
print_step (int) – specifies how often the losses are printed. If 0, no prints will occur (default: 0)
- Returns:
self – this instance of the NeighborEncoder
- Return type:
- loss(batch: list, loss_fn: _Loss, device: device) Tensor[source]
Calculate the loss of a single batch of data. Corresponds to the sum of losses concerning each neighbor. batch must contain the data object at the first position and the neighbors at the following positions.
- Parameters:
batch (list) – the different parts of a dataloader (id, samples, 1-nearest-neighbor, 2-nearest-neighbor, …)
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction
device (torch.device) – device to be trained on
- Returns:
loss – returns the sum of the reconstruction losses of the input sample
- Return type:
torch.Tensor
- training: bool
- clustpy.deep.neighbor_encoder.get_neighbors_batchwise(X: ndarray, n_neighbors: int, metric: str = 'sqeuclidean', batch_size: int = 10000) list[source]
For large datasets it is often not possible to determine the nearest neighbors in a trivial manner. Therefore, here is an implementation that calculates the nearest neighbors in batches. Ignores the objects themselves (with distance of 0) as nearest neighbors. It reduces the memory consumption of a trivial nearest neighbor implementation from (data_size x data_size) to (batch_size x data_size). A list is returned, which can be given as additional input into a DataLoader and is therefore directly compatible with the NeighborEncoder. Due to runtime concerns it is still recommended to use a more complex nearest neighbor retrieval implementation (e.g. from sklearn.neighbor)!
- Parameters:
X (np.ndarray) – The given data set
n_neighbors (int) – The number of nearest neighbors to identify
metric (str) – The distance metric to be used. See scipy.spatial.distance.cdist for more information (default: sqeuclidean)
batch_size (int) – The size of the batches (default: 10000)
- Returns:
nearest_neigbors – A list containing the nearest neighbors as torch.Tensors, i.e. [1-nearest-neighbor tensor, 2-nearest-neighbor tensor, …]
- Return type:
list
Examples
from clustpy.data import load_optdigits from clustpy.deep import get_dataloader
X, L = load_optdigits() n_neighbors = 3 neighbors = get_neighbors_batchwise(X, n_neighbors) dataloader = get_dataloader(X, 256, True, additional_inputs=neighbors) neighbor_encoder = NeighborEncoder(layers=[X.shape[1], 512, 256, 10], n_neighbors=n_neighbors) neighbor_encoder.fit(dataloader=dataloader, n_epochs=100, lr=1e-3)
clustpy.deep.stacked_autoencoder module
@authors: Dominik Mautz
- class clustpy.deep.stacked_autoencoder.StackedAutoencoder(feature_dim, embedding_size, layer_dims=[500, 500, 2000], weight_initalizer=<function xavier_normal_>, activation_fn=<function StackedAutoencoder.<lambda>>, loss_fn=<function StackedAutoencoder.<lambda>>, optimizer_fn=<function StackedAutoencoder.<lambda>>, tied_weights=False, bias_init=0.0, linear_embedded=True, linear_decoder_last=True)[source]
Bases:
Module- forward(input_data)[source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- forward_pretrain(input_data, stack, use_dropout=True, dropout_rate=0.2, dropout_is_training=True)[source]
- pretrain(dataset, device, rounds_per_layer=1000, dropout_rate=0.2, corruption_fn=None)[source]
Uses Adam to pretrain the model layer by layer :param rounds_per_layer: :param corruption_fn: Can be used to corrupt the input data for an denoising autoencoder :return:
- start_training(trainloader, device, steps_per_layer=10000, refine_training_steps=20000, dropout_rate=0.2)[source]
- training: bool
clustpy.deep.vade module
@authors: Donatella Novakovic, Lukas Miklautz, Collin Leiber
- class clustpy.deep.vade.VaDE(n_clusters: int, batch_size: int = 256, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = BCELoss(), autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 10, n_gmm_initializations: int = 100, random_state: ~numpy.random.mtrand.RandomState | None = None)[source]
Bases:
BaseEstimator,ClusterMixinThe Variational Deep Embedding (VaDE) algorithm. First, an variational autoencoder (VAE) will be trained (will be skipped if input autoencoder is given). Afterwards, a GMM will be fit to identify the initial clustering structures. Last, the VAE will be optimized using the VaDE loss function.
- Parameters:
n_clusters (int) – number of clusters
batch_size (int) – size of the data batches (default: 256)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.BCELoss())
autoencoder (torch.nn.Module) – the input autoencoder. If None a variation of a VariationalAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (central layer with mean and variance) (default: 10)
n_gmm_initializations (int) – number of initializations for the initial GMM clustering within the embedding (default: 100)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The labels as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- cluster_centers_
The cluster centers as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- covariances_
The covariance matrices as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- vade_labels_
The labels as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_cluster_centers_
The cluster centers as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_covariances_
The covariance matrices as identified by VaDE after the training terminated
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist data, labels = load_mnist() data = (data - np.mean(data)) / np.std(data) vade = VaDE(n_clusters=10) vade.fit(data)
References
Jiang, Zhuxi, et al. “Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering.” IJCAI. 2017.
- fit(X: ndarray, y: ndarray | None = None) VaDE[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the VaDE algorithm
- Return type:
clustpy.deep.variational_autoencoder module
@authors: Lukas Miklautz, Donatella Novakovic, Collin Leiber
- class clustpy.deep.variational_autoencoder.VariationalAutoencoder(layers: list, batch_norm: bool = False, dropout: float | None = None, activation_fn: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.LeakyReLU'>, bias: bool = True, decoder_layers: list | None = None, decoder_output_fn: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.Sigmoid'>)[source]
Bases:
FlexibleAutoencoderA variational autoencoder (VAE).
- Parameters:
layers (list) –
- list of the different layer sizes from input to embedding, e.g. an example architecture for MNIST [784, 512, 256, 10], where 784 is the input dimension and 10 the dimension of the mean and variance value in the central layer.
If decoder_layers are not specified then the decoder is symmetric and goes in the same order from embedding to input.
batch_norm (bool) – set True if you want to use torch.nn.BatchNorm1d (default: False)
dropout (float) – set the amount of dropout you want to use (default: None)
activation (torch.nn.Module) – activation function from torch.nn, set the activation function for the hidden layers, if None then it will be linear (default: torch.nn.LeakyReLU)
bias (bool) – set False if you do not want to use a bias term in the linear layers (default: True)
decoder_layers (list) – list of different layer sizes from embedding to output of the decoder. If set to None, will be symmetric to layers (default: None)
decoder_output_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the decoder output layer, if None then it will be linear. e.g. set to torch.nn.Sigmoid if you want to scale the decoder output between 0 and 1 (default: torch.nn.Sigmoid)
- encoder
encoder part of the autoencoder, responsible for embedding data points (class is FullyConnectedBlock)
- Type:
- decoder
decoder part of the autoencoder, responsible for reconstructing data points from the embedding (class is FullyConnectedBlock)
- Type:
- mean
mean value of the central layer
- Type:
torch.nn.Linear
- log_variance
logarithmic variance of the central layer (use logarithm of variance - numerical purposes)
- Type:
torch.nn.Linear
- fitted
boolean value indicating whether the autoencoder is already fitted.
- Type:
bool
References
Kingma, Diederik P., and Max Welling. “Auto-encoding variational Bayes.” Int. Conf. on Learning Representations.
- encode(x: ~torch.Tensor) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]
Apply the encoder function to x. Overwrites function from FlexibleAutoencoder.
- Parameters:
x (torch.Tensor) – input data point, can also be a mini-batch of points
- Returns:
tuple – mean value of the central VAE layer, logarithmic variance value of the central VAE layer (use logarithm of variance - numerical purposes)
- Return type:
(torch.Tensor, torch.Tensor)
- forward(x: ~torch.Tensor) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]
Applies both the encode and decode function. The forward function is automatically called if we call self(x). Overwrites function from FlexibleAutoencoder.
- Parameters:
x (torch.Tensor) – input data point, can also be a mini-batch of embedded points
- Returns:
tuple – sampling using q_mean and q_logvar, mean value of the central VAE layer, logarithmic variance value of the central VAE layer (use logarithm of variance - numerical purposes), the reconstruction of the data point
- Return type:
(torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor)
- loss(batch: list, loss_fn: _Loss, device: device, beta: float = 1) Tensor[source]
Calculate the loss of a single batch of data.
- Parameters:
batch (list) – the different parts of a dataloader (id, samples, …)
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction
device (torch.device) – device to be trained on
beta (float) – weighting of the KL loss (default: 1)
- Returns:
total_loss – the reconstruction loss of the input sample
- Return type:
torch.Tensor
- training: bool
Module contents
- class clustpy.deep.DCN(n_clusters: int, batch_size: int = 256, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), degree_of_space_distortion: float = 0.05, degree_of_space_preservation: float = 1.0, autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 10, random_state: ~numpy.random.mtrand.RandomState | None = None)[source]
Bases:
BaseEstimator,ClusterMixinThe Deep Clustering Network (DCN) algorithm. First, an autoencoder (AE) will be trained (will be skipped if input autoencoder is given). Afterwards, KMeans identifies the initial clusters. Last, the AE will be optimized using the DCN loss function.
- Parameters:
n_clusters (int) – number of clusters
batch_size (int) – size of the data batches (default: 256)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
degree_of_space_distortion (float) – weight of the clustering loss (default: 0.05)
degree_of_space_preservation (float) – weight of the reconstruction loss (default: 1.0)
autoencoder (torch.nn.Module) – the input autoencoder. If None a new FlexibleAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (default: 10)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dcn_labels_
The final DCN labels
- Type:
np.ndarray
- dcn_cluster_centers_
The final DCN cluster centers
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist from clustpy.deep import DCN data, labels = load_mnist() dcn = DCN(n_clusters=10) dcn.fit(data)
References
Yang, Bo, et al. “Towards k-means-friendly spaces: Simultaneous deep learning and clustering.” international conference on machine learning. PMLR, 2017.
- fit(X: ndarray, y: ndarray | None = None) DCN[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DCN algorithm
- Return type:
- class clustpy.deep.DEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 10, use_reconstruction_loss: bool = False, cluster_loss_weight: float = 1, random_state: ~numpy.random.mtrand.RandomState | None = None)[source]
Bases:
BaseEstimator,ClusterMixinThe Deep Embedded Clustering (DEC) algorithm. First, an autoencoder (AE) will be trained (will be skipped if input autoencoder is given). Afterwards, KMeans identifies the initial clusters. Last, the AE will be optimized using the DEC loss function.
- Parameters:
n_clusters (int) – number of clusters
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
autoencoder (torch.nn.Module) – the input autoencoder. If None a new FlexibleAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (default: 10)
use_reconstruction_loss (bool) – defines whether the reconstruction loss will be used during clustering training (default: False)
cluster_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist from clustpy.deep import DEC data, labels = load_mnist() dec = DEC(n_clusters=10) dec.fit(data)
References
Xie, Junyuan, Ross Girshick, and Ali Farhadi. “Unsupervised deep embedding for clustering analysis.” International conference on machine learning. 2016.
- fit(X: ndarray, y: ndarray | None = None) DEC[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DEC algorithm
- Return type:
- class clustpy.deep.DipDECK(n_clusters_init: int = 35, dip_merge_threshold: float = 0.9, cluster_loss_weight: float = 1, max_n_clusters: int = inf, min_n_clusters: int = 1, batch_size: int = 256, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 50, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 5, max_cluster_size_diff_factor: float = 2, pval_strategy: str = 'table', n_boots: int = 1000, random_state: ~numpy.random.mtrand.RandomState | None = None, debug: bool = False)[source]
Bases:
BaseEstimator,ClusterMixinThe Deep Embedded Clustering with k-Estimation (DipDECK) algorithm. First, an autoencoder (AE) will be trained (will be skipped if input autoencoder is given). Afterwards, KMeans identifies the initial clusters using an overestimated number of clusters. Last, the AE will be optimized using the DipDECK loss function. If any Dip-value exceeds the dip_merge_threshold, the corresponding clusters will be merged.
- Parameters:
n_clusters_init (int) – initial number of clusters (default: 35)
dip_merge_threshold (float) – threshold regarding the Dip-p-value that defines if two clusters should be merged. Must be bvetween 0 and 1 (default: 0.9)
cluster_loss_weight (float) – weight of the clustering loss compared to the reconstruction loss (default: 1)
max_n_clusters (int) – maximum number of clusters. Must be larger than min_n_clusters. If the result has more clusters, a merge will be forced (default: np.inf)
min_n_clusters (int) – minimum number of clusters. Must be larger than 0, smaller than max_n_clusters and smaller than n_clusters_init. When this number of clusters is reached, all further merge processes will be hindered (default: 1)
batch_size (int) – size of the data batches (default: 256)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure. Will reset after each merge (default: 50)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
autoencoder (torch.nn.Module) – the input autoencoder. If None a new FlexibleAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (default: 10)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used for the Dip calculation (default: 2)
pval_strategy (str) – Defines which strategy to use to receive dip-p-vales. Possibilities are ‘table’, ‘function’ and ‘bootstrap’ (default: ‘table’)
n_boots (int) – Number of bootstraps used to calculate dip-p-values. Only necessary if pval_strategy is ‘bootstrap’ (default: 1000)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – If true, additional information will be printed to the console (default: False)
- labels_
The final labels
- Type:
np.ndarray
- n_clusters_
The final number of clusters
- Type:
int
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist from clustpy.deep import DipDECK data, labels = load_mnist() dipdeck = DipDECK() dipdeck.fit(data)
References
Leiber, Collin, et al. “Dip-based deep embedded clustering with k-estimation.” Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021.
- fit(X: ndarray, y: ndarray | None = None) DipDECK[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the DipDECK algorithm
- Return type:
- class clustpy.deep.DipEncoder(n_clusters: int, pretrain_batch_size: int = 256, batch_size: int | None = None, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 100, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 10, max_cluster_size_diff_factor: float = 3, random_state: ~numpy.random.mtrand.RandomState | None = None, debug: bool = False)[source]
Bases:
BaseEstimator,ClusterMixinThe DipEncoder. Can be used either as a clustering procedure if no ground truth labels are given or as a supervised dimensionality reduction technique. First, an autoencoder (AE) will be trained (will be skipped if input autoencoder is given). Afterwards, KMeans identifies the initial clusters. Last, the AE will be optimized using the DipEncoder loss function.
- Parameters:
n_clusters (int) – number of clusters
pretrain_batch_size (int) – size of the data batches for the pretraining (default: 256)
batch_size (int) – size of the data batches for the actual training of the DipEncoder. Should be larger the more clusters we have. If it is None, it will be set to (25 x n_clusters) (default: None)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 100)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
autoencoder (torch.nn.Module) – the input autoencoder. If None a new FlexibleAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (default: 10)
max_cluster_size_diff_factor (float) – The maximum different in size when comparing two clusters regarding the number of samples. If one cluster surpasses this difference factor, only the max_cluster_size_diff_factor*(size of smaller cluster) closest samples will be used (default: 3)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – If true, additional information will be printed to the console (default: False)
- labels_
The final labels
- Type:
np.ndarray
- projection_axes_
The final projection axes between the clusters
- Type:
np.ndarray
- index_dict_
A dictionary to match the indices of two clusters to a projection axis
- Type:
dict
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist from clustpy.deep import DipEncoder data, labels = load_mnist() dipencoder = DipEncoder(10) dipencoder.fit(data)
References
Leiber, Collin and Bauer, Lena G. M. and Neumayr, Michael and Plant, Claudia and Böhm, Christian “The DipEncoder: Enforcing Multimodality in Autoencoders.” Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2022.
- fit(X: ndarray, y: ndarray | None = None) DipEncoder[source]
Initiate the actual clustering/dimensionality reduction process on the input data set. If no ground truth labels are given, the resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – The given (training) data set
y (np.ndarray) – The ground truth labels. If None, the DipEncoder will be used for clustering (default: None)
- Returns:
self – This instance of the DipEncoder
- Return type:
- plot(X: ndarray, edge_width: float = 0.2, show_legend: bool = True) None[source]
Plot the current state of the DipEncoder. First the data set will be encoded using the autoencoder, afterwards the plot will be created. Uses the plot_scatter_matrix as a basis and adds projection axes in red.
- Parameters:
X (np.ndarray) – The data set
edge_width (float) – Specifies the width of the empty space (containung no points) at the edges of the plots
show_legend (bool) – Specifies whether a legend should be added to the plot
- predict(X_train: ndarray, X_test: ndarray) ndarray[source]
Predict the labels of the X_test dataset using the information gained by the fit function and the X_train dataset.
- Parameters:
X_train (np.ndarray) – The data set used to train the DipEncoder (i.e. to retrieve the projection axes, modal intervals, …)
X_test (np.ndarray) – The data set for which we want to retrieve the labels
- Returns:
labels_pred – The predicted labels for X_test
- Return type:
np.ndarray
- set_predict_request(*, X_test: bool | None | str = '$UNCHANGED$', X_train: bool | None | str = '$UNCHANGED$') DipEncoder
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline. Otherwise it has no effect.- Parameters:
X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_testparameter inpredict.X_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_trainparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- class clustpy.deep.ENRC(n_clusters: list, V: ~numpy.ndarray = None, P: list = None, input_centers: list = None, batch_size: int = 128, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 150, tolerance_threshold: float = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), degree_of_space_distortion: float = 1.0, degree_of_space_preservation: float = 1.0, autoencoder: ~torch.nn.modules.module.Module = None, embedding_size: int = 20, init: str = 'nrkmeans', device: ~torch.device = None, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2-alpha/lib/python3.8/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, init_kwargs: dict = None, init_subsample_size: int = None, random_state: ~numpy.random.mtrand.RandomState = None, debug: bool = False)[source]
Bases:
BaseEstimator,ClusterMixinThe Embeddedn Non-Redundant Clustering (ENRC) algorithm.
- Parameters:
n_clusters (list) – list containing number of clusters for each clustering
V (np.ndarray) – orthogonal rotation matrix (optional) (default: None)
P (list) – list containing projections for each clustering (optional) (default: None)
input_centers (list) – list containing the cluster centers for each clustering (optional) (default: None)
batch_size (int) – size of the data batches (default: 128)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – maximum number of epochs for the actual clustering procedure (default: 150)
tolerance_threshold (float) – tolerance threshold to determine when the training should stop. If the NMI(old_labels, new_labels) >= (1-tolerance_threshold) for all clusterings then the training will stop before max_epochs is reached. If set high than training will stop earlier then max_epochs, and if set to 0 or None the training will train as long as the labels are not changing anymore (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer for pretraining and training (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
degree_of_space_distortion (float) – weight of the cluster loss term. The higher it is set the more the embedded space will be shaped to the assumed cluster structure (default: 1.0)
degree_of_space_preservation (float) – weight of regularization loss term, e.g., reconstruction loss (default: 1.0)
autoencoder (torch.nn.Module) – the input autoencoder. If None a new autoencoder will be created and trained (default: None)
embedding_size (int) – size of the embedding within the autoencoder. Only used if autoencoder is None (default: 20)
init (str) – choose which initialization strategy should be used. Has to be one of ‘nrkmeans’, ‘random’ or ‘sgd’ (default: ‘nrkmeans’)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
device (torch.device) – if device is None then it will be checked whether a gpu is available or not (default: None)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
init_kwargs (dict) – additional parameters that are used if init is a callable (optional) (default: None)
init_subsample_size (int) – specify if only a subsample of size ‘init_subsample_size’ of the data should be used for the initialization (optional) (default: None)
debug (bool) – if True additional information during the training will be printed (default: False)
- labels_
The final labels
- Type:
np.ndarray
- cluster_centers_
The final cluster centers
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
:raises ValueError : if init is not one of ‘nrkmeans’, ‘random’, ‘auto’ or ‘sgd’.:
References
Miklautz, Lukas & Dominik Mautz et al. “Deep embedded non-redundant clustering.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.
- fit(X: ndarray, y: ndarray | None = None) ENRC[source]
Cluster the input dataset with the ENRC algorithm. Saves the labels, centers, V, m, Betas, and P in the ENRC object. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – input data
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – returns the ENRC object
- Return type:
- plot_subspace(X: ndarray, subspace_index: int, labels: ndarray | None = None, plot_centers: bool = False, gt: ndarray | None = None, equal_axis: bool = False) None[source]
Plot the specified subspace_nr as scatter matrix plot.
- Parameters:
X (np.ndarray) – input data
subspace_index (int, index of the subspace_nr) –
labels (np.ndarray) – the labels to use for the plot (default: labels found by Nr-Kmeans) (default: None)
plot_centers (bool) – plot centers if True (default: False)
gt (np.ndarray) – of ground truth labels (default=None)
equal_axis (bool) – equalize axis if True (default: False)
- Return type:
scatter matrix plot of the input data
- predict(X: ndarray, y: ndarray | None = None, use_P: bool = True) ndarray[source]
Predicts the labels for each clustering of X in a mini-batch manner.
- Parameters:
X (np.ndarray) – input data
y (np.ndarray) – the labels (can be ignored)
use_P (bool) – if True then P will be used to hard select the dimensions for each clustering, else the soft beta weights are used (default: True)
- Returns:
predicted_labels – n x c matrix, where n is the number of data points in X and c is the number of clusterings.
- Return type:
np.ndarray
- reconstruct_subspace_centroids(subspace_index: int) ndarray[source]
Reconstructs the centroids in the specified subspace_nr.
- Parameters:
subspace_index (int) – index of the subspace_nr
- Returns:
centers_rec – reconstructed centers as np.ndarray
- Return type:
centers_rec
- set_predict_request(*, use_P: bool | None | str = '$UNCHANGED$') ENRC
Request metadata passed to the
predictmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline. Otherwise it has no effect.- Parameters:
use_P (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
use_Pparameter inpredict.- Returns:
self – The updated object.
- Return type:
object
- transform_full_space(X: ndarray, embedded=False) ndarray[source]
Embedds the input dataset with the autoencoder and the matrix V from the ENRC object. :param X: input data :type X: np.ndarray :param embedded: if True, then X is assumed to be already embedded (default: False) :type embedded: bool
- Returns:
rotated – The transformed data
- Return type:
np.ndarray
- transform_subspace(X: ndarray, subspace_index: int, embedded: bool = False) ndarray[source]
Embedds the input dataset with the autoencoder and with the matrix V projected onto a special clusterspace_nr.
- Parameters:
X (np.ndarray) – input data
subspace_index (int) – index of the subspace_nr
embedded (bool) – if True, then X is assumed to be already embedded (default: False)
- Returns:
subspace – The transformed subspace
- Return type:
np.ndarray
- class clustpy.deep.FlexibleAutoencoder(layers: list, batch_norm: bool = False, dropout: float | None = None, activation_fn: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.LeakyReLU'>, bias: bool = True, decoder_layers: list | None = None, decoder_output_fn: ~torch.nn.modules.module.Module | None = None)[source]
Bases:
ModuleA flexible feedforward autoencoder.
- Parameters:
layers (list) – list of the different layer sizes from input to embedding, e.g. an example architecture for MNIST [784, 512, 256, 10], where 784 is the input dimension and 10 the embedding dimension. If decoder_layers are not specified then the decoder is symmetric and goes in the same order from embedding to input.
batch_norm (bool) – Set True if you want to use torch.nn.BatchNorm1d (default: False)
dropout (float) – Set the amount of dropout you want to use (default: None)
activation_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the hidden layers, if None then it will be linear (default: torch.nn.LeakyReLU)
bias (bool) – set False if you do not want to use a bias term in the linear layers (default: True)
decoder_layers (list) – list of different layer sizes from embedding to output of the decoder. If set to None, will be symmetric to layers (default: None)
decoder_output_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the decoder output layer, if None then it will be linear. e.g. set to torch.nn.Sigmoid if you want to scale the decoder output between 0 and 1 (default: None)
- encoder
encoder part of the autoencoder, responsible for embedding data points (class is FullyConnectedBlock)
- Type:
- decoder
decoder part of the autoencoder, responsible for reconstructing data points from the embedding (class is FullyConnectedBlock)
- Type:
- fitted
boolean value indicating whether the autoencoder is already fitted.
- Type:
bool
References
E.g. Ballard, Dana H. “Modular learning in neural networks.” Aaai. Vol. 647. 1987.
- decode(embedded: Tensor) Tensor[source]
Apply the decoder function to embedded.
- Parameters:
embedded (torch.Tensor) – embedded data point, can also be a mini-batch of embedded points
- Returns:
decoded – returns the reconstruction of embedded
- Return type:
torch.Tensor
- encode(x: Tensor) Tensor[source]
Apply the encoder function to x.
- Parameters:
x (torch.Tensor) – input data point, can also be a mini-batch of points
- Returns:
embedded – the embedded data point with dimensionality embedding_size
- Return type:
torch.Tensor
- evaluate(dataloader: DataLoader, loss_fn: _Loss, device: device = device(type='cpu')) Tensor[source]
Evaluates the autoencoder.
- Parameters:
dataloader (torch.utils.data.DataLoader) – dataloader to be used for training
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction
device (torch.device) – device to be trained on (default: torch.device(‘cpu’))
- Returns:
loss – returns the reconstruction loss of all samples in dataloader
- Return type:
torch.Tensor
- fit(n_epochs: int, lr: float, batch_size: int = 128, data: ~numpy.ndarray = None, data_eval: ~numpy.ndarray = None, dataloader: ~torch.utils.data.dataloader.DataLoader = None, evalloader: ~torch.utils.data.dataloader.DataLoader = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), patience: int = 5, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2-alpha/lib/python3.8/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, device: ~torch.device = device(type='cpu'), model_path: str = None, print_step: int = 0) FlexibleAutoencoder[source]
Trains the autoencoder in place.
- Parameters:
n_epochs (int) – number of epochs for training
lr (float) – learning rate to be used for the optimizer_class
batch_size (int) – size of the data batches (default: 128)
data (np.ndarray) – train data set. If data is passed then dataloader can remain empty (default: None)
data_eval (np.ndarray) – evaluation data set. If data_eval is passed then evalloader can remain empty (default: None)
dataloader (torch.utils.data.DataLoader) – dataloader to be used for training (default: default=None)
evalloader (torch.utils.data.DataLoader) – dataloader to be used for evaluation, early stopping and learning rate scheduling if scheduler=torch.optim.lr_scheduler.ReduceLROnPlateau (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer to be used (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction (default: torch.nn.MSELoss())
patience (int) – patience parameter for EarlyStopping (default: 5)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used. If torch.optim.lr_scheduler.ReduceLROnPlateau is used then the behaviour is matched by providing the validation_loss calculated based on samples from evalloader (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
device (torch.device) – device to be trained on (default: torch.device(‘cpu’))
model_path (str) – if specified will save the trained model to the location. If evalloader is used, then only the best model w.r.t. evaluation loss is saved (default: None)
print_step (int) – specifies how often the losses are printed. If 0, no prints will occur (default: 0)
- Returns:
self – this instance of the FlexibleAutoencoder
- Return type:
- Raises:
ValueError – data cannot be None if dataloader is None:
ValueError – evalloader cannot be None if scheduler=torch.optim.lr_scheduler.ReduceLROnPlateau:
- forward(x: Tensor) Tensor[source]
Applies both the encode and decode function. The forward function is automatically called if we call self(x).
- Parameters:
x (torch.Tensor) – input data point, can also be a mini-batch of embedded points
- Returns:
reconstruction – returns the reconstruction of a data point
- Return type:
torch.Tensor
- loss(batch: list, loss_fn: _Loss, device: device) Tensor[source]
Calculate the loss of a single batch of data.
- Parameters:
batch (list) – the different parts of a dataloader (id, samples, …)
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction
device (torch.device) – device to be trained on
- Returns:
loss – returns the reconstruction loss of the input sample
- Return type:
torch.Tensor
- training: bool
- class clustpy.deep.IDEC(n_clusters: int, alpha: float = 1.0, batch_size: int = 256, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 10, random_state: ~numpy.random.mtrand.RandomState | None = None)[source]
Bases:
DECThe Improved Deep Embedded Clustering (IDEC) algorithm. Implemented as a child of the DEC class. Therefore, matches the __init__ from DEC but with fixed use_reconstruction_loss=True and cluster_loss_weight=0.1.
- Parameters:
n_clusters (int) – number of clusters
alpha (float) – alpha value for the prediction (default: 1.0)
batch_size (int) – size of the data batches (default: 256)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.MSELoss())
autoencoder (torch.nn.Module) – the input autoencoder. If None a new FlexibleAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (default: 10)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The final labels (obtained by a final KMeans execution)
- Type:
np.ndarray
- cluster_centers_
The final cluster centers (obtained by a final KMeans execution)
- Type:
np.ndarray
- dec_labels_
The final DEC labels
- Type:
np.ndarray
- dec_cluster_centers_
The final DEC cluster centers
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist from clustpy.deep import IDEC data, labels = load_mnist() idec = IDEC(n_clusters=10) idec.fit(data)
References
Guo, Xifeng, et al. “Improved deep embedded clustering with local structure preservation.” IJCAI. 2017.
- class clustpy.deep.NeighborEncoder(layers: list, n_neighbors: int, decode_self: bool = False, batch_norm: bool = False, dropout: float | None = None, activation_fn: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.LeakyReLU'>, bias: bool = True, decoder_layers: list | None = None, decoder_output_fn: ~torch.nn.modules.module.Module | None = None)[source]
Bases:
FlexibleAutoencoderA NeighborEncoder. Does not compare the reconstruction of an object to itself but to its nearest neighbors. For more information see the stated reference. If n_neighbors is 0 and decode_self is true, the NeighborEncoder will work as a regular FlexibleAutoencoder.
- Parameters:
layers (list) – list of the different layer sizes from input to embedding, e.g. an example architecture for MNIST [784, 512, 256, 10], where 784 is the input dimension and 10 the embedding dimension. If decoder_layers are not specified then the decoder is symmetric and goes in the same order from embedding to input.
n_neighbors (int) – the number of nearest neighbors to be considered
decode_self (bool) – specifies whether a point itself should also be decoded (default: False)
batch_norm (bool) – Set True if you want to use torch.nn.BatchNorm1d (default: False)
dropout (float) – Set the amount of dropout you want to use (default: None)
activation_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the hidden layers, if None then it will be linear (default: torch.nn.LeakyReLU)
bias (bool) – set False if you do not want to use a bias term in the linear layers (default: True)
decoder_layers (list) – list of different layer sizes from embedding to output of the decoder. If set to None, will be symmetric to layers (default: None)
decoder_output_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the decoder output layer, if None then it will be linear. e.g. set to torch.nn.Sigmoid if you want to scale the decoder output between 0 and 1 (default: None)
- encoder
encoder part of the autoencoder, responsible for embedding data points (class is FullyConnectedBlock)
- Type:
- decoder
decoder part of the autoencoder, responsible for reconstructing the data point itself (class is FullyConnectedBlock). Only used if decode_self is true.
- Type:
- fitted
boolean value indicating whether the autoencoder is already fitted.
- Type:
bool
- neighbor_decoders
list containing one decoder network (class is FullyConnectedBlock) for each nearest neighbor
- Type:
list
Examples
from clustpy.data import load_optdigits from clustpy.deep import get_dataloader from clustpy.deep._utils import detect_device from scipy.spatial.distance import pdist, squareform
X, L = load_optdigits() device = detect_device() n_neighbors = 3
dist_matrix = squareform(pdist(X)) neighbor_ids = np.argsort(dist_matrix, axis=1) neighbors = [X[neighbor_ids[:, 1 + i]] for i in range(n_neighbors)] # Alternatively: neighbors = get_neighbors_batchwise(X, n_neighbors)
dataloader = get_dataloader(X, 256, True, additional_inputs=neighbors) neighbor_encoder = NeighborEncoder(layers=[X.shape[1], 512, 256, 10], n_neighbors=n_neighbors, decode_self=False) neighbor_encoder.fit(dataloader=dataloader, device=device, n_epochs=100, lr=1e-3)
References
Yeh, Chin-Chia Michael, et al. “Representation Learning by Reconstructing Neighborhoods.” arXiv preprint arXiv:1811.01557 (2018).
- decode(embedded: Tensor) Tensor[source]
Apply the decoder function of each neighbor network to embedded. Returns a (n_neighbors x batch_size x dimensionality) tensor if decode_self is false, else a (n_neighbors + 1 x batch_size x dimensionality) tensor
- Parameters:
embedded (torch.Tensor) – embedded data point, can also be a mini-batch of embedded points
- Returns:
decoded_neighbors – returns the reconstruction of embedded concerning each neighbor
- Return type:
torch.Tensor
- fit(n_epochs: int, lr: float, batch_size: int = 128, dataloader: ~torch.utils.data.dataloader.DataLoader = None, evalloader: ~torch.utils.data.dataloader.DataLoader = None, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = MSELoss(), patience: int = 5, scheduler: <module 'torch.optim.lr_scheduler' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2-alpha/lib/python3.8/site-packages/torch/optim/lr_scheduler.py'> = None, scheduler_params: dict = None, device: ~torch.device = device(type='cpu'), model_path: str = None, print_step: int = 0) NeighborEncoder[source]
Trains the NeighborEncoder in place. Equal to fit function of the FlexibleAutoencoder but does only work with a dataloader (not with a regular data array). This is because the dataloader must contain the nearest neighbors of each point at the positions 2, 3, ….
- Parameters:
n_epochs (int) – number of epochs for training
lr (float) – learning rate to be used for the optimizer_class
batch_size (int) – size of the data batches (default: 128)
dataloader (torch.utils.data.DataLoader) – dataloader to be used for training (default: default=None)
evalloader (torch.utils.data.DataLoader) – dataloader to be used for evaluation, early stopping and learning rate scheduling if scheduler=torch.optim.lr_scheduler.ReduceLROnPlateau (default: None)
optimizer_class (torch.optim.Optimizer) – optimizer to be used (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction (default: torch.nn.MSELoss())
patience (int) – patience parameter for EarlyStopping (default: 5)
scheduler (torch.optim.lr_scheduler) – learning rate scheduler that should be used. If torch.optim.lr_scheduler.ReduceLROnPlateau is used then the behaviour is matched by providing the validation_loss calculated based on samples from evalloader (default: None)
scheduler_params (dict) – dictionary of the parameters of the scheduler object (default: None)
device (torch.device) – device to be trained on (default: torch.device(‘cpu’))
model_path (str) – if specified will save the trained model to the location. If evalloader is used, then only the best model w.r.t. evaluation loss is saved (default: None)
print_step (int) – specifies how often the losses are printed. If 0, no prints will occur (default: 0)
- Returns:
self – this instance of the NeighborEncoder
- Return type:
- loss(batch: list, loss_fn: _Loss, device: device) Tensor[source]
Calculate the loss of a single batch of data. Corresponds to the sum of losses concerning each neighbor. batch must contain the data object at the first position and the neighbors at the following positions.
- Parameters:
batch (list) – the different parts of a dataloader (id, samples, 1-nearest-neighbor, 2-nearest-neighbor, …)
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction
device (torch.device) – device to be trained on
- Returns:
loss – returns the sum of the reconstruction losses of the input sample
- Return type:
torch.Tensor
- training: bool
- class clustpy.deep.StackedAutoencoder(feature_dim, embedding_size, layer_dims=[500, 500, 2000], weight_initalizer=<function xavier_normal_>, activation_fn=<function StackedAutoencoder.<lambda>>, loss_fn=<function StackedAutoencoder.<lambda>>, optimizer_fn=<function StackedAutoencoder.<lambda>>, tied_weights=False, bias_init=0.0, linear_embedded=True, linear_decoder_last=True)[source]
Bases:
Module- forward(input_data)[source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- forward_pretrain(input_data, stack, use_dropout=True, dropout_rate=0.2, dropout_is_training=True)[source]
- pretrain(dataset, device, rounds_per_layer=1000, dropout_rate=0.2, corruption_fn=None)[source]
Uses Adam to pretrain the model layer by layer :param rounds_per_layer: :param corruption_fn: Can be used to corrupt the input data for an denoising autoencoder :return:
- start_training(trainloader, device, steps_per_layer=10000, refine_training_steps=20000, dropout_rate=0.2)[source]
- training: bool
- class clustpy.deep.VaDE(n_clusters: int, batch_size: int = 256, pretrain_learning_rate: float = 0.001, clustering_learning_rate: float = 0.0001, pretrain_epochs: int = 100, clustering_epochs: int = 150, optimizer_class: ~torch.optim.optimizer.Optimizer = <class 'torch.optim.adam.Adam'>, loss_fn: ~torch.nn.modules.loss._Loss = BCELoss(), autoencoder: ~torch.nn.modules.module.Module | None = None, embedding_size: int = 10, n_gmm_initializations: int = 100, random_state: ~numpy.random.mtrand.RandomState | None = None)[source]
Bases:
BaseEstimator,ClusterMixinThe Variational Deep Embedding (VaDE) algorithm. First, an variational autoencoder (VAE) will be trained (will be skipped if input autoencoder is given). Afterwards, a GMM will be fit to identify the initial clustering structures. Last, the VAE will be optimized using the VaDE loss function.
- Parameters:
n_clusters (int) – number of clusters
batch_size (int) – size of the data batches (default: 256)
pretrain_learning_rate (float) – learning rate for the pretraining of the autoencoder (default: 1e-3)
clustering_learning_rate (float) – learning rate of the actual clustering procedure (default: 1e-4)
pretrain_epochs (int) – number of epochs for the pretraining of the autoencoder (default: 100)
clustering_epochs (int) – number of epochs for the actual clustering procedure (default: 150)
optimizer_class (torch.optim.Optimizer) – the optimizer class (default: torch.optim.Adam)
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction (default: torch.nn.BCELoss())
autoencoder (torch.nn.Module) – the input autoencoder. If None a variation of a VariationalAutoencoder will be created (default: None)
embedding_size (int) – size of the embedding within the autoencoder (central layer with mean and variance) (default: 10)
n_gmm_initializations (int) – number of initializations for the initial GMM clustering within the embedding (default: 100)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
- labels_
The labels as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- cluster_centers_
The cluster centers as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- covariances_
The covariance matrices as identified by a final Gaussian Mixture Model
- Type:
np.ndarray
- vade_labels_
The labels as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_cluster_centers_
The cluster centers as identified by VaDE after the training terminated
- Type:
np.ndarray
- vade_covariances_
The covariance matrices as identified by VaDE after the training terminated
- Type:
np.ndarray
- autoencoder
The final autoencoder
- Type:
torch.nn.Module
Examples
from clustpy.data import load_mnist data, labels = load_mnist() data = (data - np.mean(data)) / np.std(data) vade = VaDE(n_clusters=10) vade.fit(data)
References
Jiang, Zhuxi, et al. “Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering.” IJCAI. 2017.
- fit(X: ndarray, y: ndarray | None = None) VaDE[source]
Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.
- Parameters:
X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)
- Returns:
self – this instance of the VaDE algorithm
- Return type:
- class clustpy.deep.VariationalAutoencoder(layers: list, batch_norm: bool = False, dropout: float | None = None, activation_fn: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.LeakyReLU'>, bias: bool = True, decoder_layers: list | None = None, decoder_output_fn: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.activation.Sigmoid'>)[source]
Bases:
FlexibleAutoencoderA variational autoencoder (VAE).
- Parameters:
layers (list) –
- list of the different layer sizes from input to embedding, e.g. an example architecture for MNIST [784, 512, 256, 10], where 784 is the input dimension and 10 the dimension of the mean and variance value in the central layer.
If decoder_layers are not specified then the decoder is symmetric and goes in the same order from embedding to input.
batch_norm (bool) – set True if you want to use torch.nn.BatchNorm1d (default: False)
dropout (float) – set the amount of dropout you want to use (default: None)
activation (torch.nn.Module) – activation function from torch.nn, set the activation function for the hidden layers, if None then it will be linear (default: torch.nn.LeakyReLU)
bias (bool) – set False if you do not want to use a bias term in the linear layers (default: True)
decoder_layers (list) – list of different layer sizes from embedding to output of the decoder. If set to None, will be symmetric to layers (default: None)
decoder_output_fn (torch.nn.Module) – activation function from torch.nn, set the activation function for the decoder output layer, if None then it will be linear. e.g. set to torch.nn.Sigmoid if you want to scale the decoder output between 0 and 1 (default: torch.nn.Sigmoid)
- encoder
encoder part of the autoencoder, responsible for embedding data points (class is FullyConnectedBlock)
- Type:
- decoder
decoder part of the autoencoder, responsible for reconstructing data points from the embedding (class is FullyConnectedBlock)
- Type:
- mean
mean value of the central layer
- Type:
torch.nn.Linear
- log_variance
logarithmic variance of the central layer (use logarithm of variance - numerical purposes)
- Type:
torch.nn.Linear
- fitted
boolean value indicating whether the autoencoder is already fitted.
- Type:
bool
References
Kingma, Diederik P., and Max Welling. “Auto-encoding variational Bayes.” Int. Conf. on Learning Representations.
- encode(x: ~torch.Tensor) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]
Apply the encoder function to x. Overwrites function from FlexibleAutoencoder.
- Parameters:
x (torch.Tensor) – input data point, can also be a mini-batch of points
- Returns:
tuple – mean value of the central VAE layer, logarithmic variance value of the central VAE layer (use logarithm of variance - numerical purposes)
- Return type:
(torch.Tensor, torch.Tensor)
- forward(x: ~torch.Tensor) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]
Applies both the encode and decode function. The forward function is automatically called if we call self(x). Overwrites function from FlexibleAutoencoder.
- Parameters:
x (torch.Tensor) – input data point, can also be a mini-batch of embedded points
- Returns:
tuple – sampling using q_mean and q_logvar, mean value of the central VAE layer, logarithmic variance value of the central VAE layer (use logarithm of variance - numerical purposes), the reconstruction of the data point
- Return type:
(torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor)
- loss(batch: list, loss_fn: _Loss, device: device, beta: float = 1) Tensor[source]
Calculate the loss of a single batch of data.
- Parameters:
batch (list) – the different parts of a dataloader (id, samples, …)
loss_fn (torch.nn.modules.loss._Loss) – loss function to be used for reconstruction
device (torch.device) – device to be trained on
beta (float) – weighting of the KL loss (default: 1)
- Returns:
total_loss – the reconstruction loss of the input sample
- Return type:
torch.Tensor
- training: bool
- clustpy.deep.encode_batchwise(dataloader: DataLoader, module: Module, device: device) ndarray[source]
Utility function for embedding the whole data set in a mini-batch fashion
- Parameters:
dataloader (torch.utils.data.DataLoader) – dataloader to be used
module (torch.nn.Module) – the module that is used for the encoding (e.g. an autoencoder)
device (torch.device) – device to be trained on
- Returns:
embeddings_numpy – The embedded data set
- Return type:
np.ndarray
- clustpy.deep.get_dataloader(X: ~numpy.ndarray, batch_size: int, shuffle: bool = True, drop_last: bool = False, additional_inputs: list | None = None, dataset_class: ~torch.utils.data.dataset.Dataset = <class 'clustpy.deep._data_utils._ClustpyDataset'>, **dl_kwargs: any) DataLoader[source]
Create a dataloader for Deep Clustering algorithms. First entry always contains the indices of the data samples. Second entry always contains the actual data samples. If for example labels are desired, they can be passed through the additional_inputs parameter (should be a list). Other customizations (e.g. augmentation) can be implemented using a custom dataset_class. This custom class should stick to the conventions, [index, data, …].
- Parameters:
X (np.ndarray / torch.Tensor) – the actual data set (can be np.ndarray or torch.Tensor)
batch_size (int) – the batch size
shuffle (bool) – boolean that defines if the data set should be shuffled (default: True)
drop_last (bool) – boolean that defines if the last batch should be ignored (default: False)
additional_inputs (list / np.ndarray / torch.Tensor) – additional inputs for the dataloader, e.g. labels. Can be None, np.ndarray, torch.Tensor or a list containing np.ndarrays/torch.Tensors (default: None)
dataset_class (torch.utils.data.Dataset) – defines the class of the tensor dataset that is contained in the dataloader (default: _ClustpyDataset)
dl_kwargs (any) – other arguments for torch.utils.data.DataLoader
- Returns:
dataloader – The final dataloader
- Return type:
torch.utils.data.DataLoader
- clustpy.deep.get_trained_autoencoder(trainloader: ~torch.utils.data.dataloader.DataLoader, learning_rate: float, n_epochs: int, device, optimizer_class: ~torch.optim.optimizer.Optimizer, loss_fn: ~torch.nn.modules.loss._Loss, input_dim: int, embedding_size: int, autoencoder: ~torch.nn.modules.module.Module | None = None, autoencoder_class: ~torch.nn.modules.module.Module = <class 'clustpy.deep.flexible_autoencoder.FlexibleAutoencoder'>) Module[source]
- This function returns a trained autoencoder. The following cases are considered
If the autoencoder is initialized and trained (autoencoder.fitted==True), then return input autoencoder without training it again.
If the autoencoder is initialized and not trained (autoencoder.fitted==False), it will be fitted (autoencoder.fitted will be set to True) using default parameters.
If the autoencoder is None, a new autoencoder is created using autoencoder_class, and it will be fitted as described above.
Beware the input autoencoder_class or autoencoder object needs both a fit() function and the fitted attribute. See clustpy.deep.flexible_autoencoder.FlexibleAutoencoder for an example.
- Parameters:
trainloader (torch.utils.data.DataLoader) – dataloader used to train autoencoder
learning_rate (float) – learning rate for the autoencoder training
n_epochs (int) – number of training epochs
device (torch.device) – device to be trained on
optimizer_class (torch.optim.Optimizer) – optimizer for training.
loss_fn (torch.nn.modules.loss._Loss) – loss function for the reconstruction.
input_dim (int) – input dimension of the first layer of the autoencoder
embedding_size (int) – dimension of the innermost layer of the autoencoder
autoencoder (torch.nn.Module) – autoencoder object to be trained (optional) (default=None)
autoencoder_class (torch.nn.Module) – FlexibleAutoencoder class used if autoencoder=None (optional) (default=FlexibleAutoencoder)
- Returns:
autoencoder – The fitted autoencoder
- Return type:
torch.nn.Module
- clustpy.deep.predict_batchwise(dataloader: DataLoader, module: Module, cluster_module: Module, device: device) ndarray[source]
Utility function for predicting the cluster labels over the whole data set in a mini-batch fashion. Method calls the predict_hard method of the cluster_module for each batch of data.
- Parameters:
dataloader (torch.utils.data.DataLoader) – dataloader to be used
module (torch.nn.Module) – the module that is used for the encoding (e.g. an autoencoder)
cluster_module (torch.nn.Module) – the cluster module that is used for the encoding (e.g. DEC). Usually contains the predict method.
device (torch.device) – device to be trained on
- Returns:
predictions_numpy – The predictions of the cluster_module for the data set
- Return type:
np.ndarray