clustpy.alternative package

Submodules

clustpy.alternative.autonr module

@authors: Collin Leiber

class clustpy.alternative.autonr.AutoNR(nrkmeans_repetitions: int = 15, outliers: bool = True, max_subspaces: int | None = None, max_n_clusters: int | None = None, mdl_for_noisespace: bool = True, max_distance: float | None = None, precision: float | None = None, similarity_threshold: float = 1e-05, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

The AutoNR algorithm. The algorithm will search for the best number of subspaces and clusters per subspace in a non-redundant clustering setting. Therefore, it repeatedly performs noise space splits, cluster space splits and cluster space merges and evaluates the results by their MDL costs. In case of AutoNR the described framework is combined with the non-redundant clustering algorithm NrKmeans. In the end the NrKmeans result with the lowest identified MDL costs will be returned and stored in the nrkmeans_ parameter.

Parameters:

nrkmeans_repetitions (int) – number of NrKmeans repetitions for each execution step to find the best local minimum (default: 15)
outliers (bool) – defines if outliers should be identified through MDL (default: True)
max_subspaces (int) – maximum number of subspaces. If None max_subspace will be equal to the total number of dimensions (default: None)
max_n_clusters (int) – maximum number of clusters for each subspace. If None this will be equal to the total number of samples (default: None)
mdl_for_noisespace (bool) – defines if MDL should be used to identify noise space dimensions instead of only considering negative eigenvalues when running NrKmeans (default: True)
max_distance (float) – distance used to encode cluster centers and outliers (default: None)
precision (float) – precision used to convert probability densities to actual probabilities (default: None)
similarity_threshold (float) – threshold that defines if the noise space has not changed for two subsequent iterations by checking the subspace costs (default: 1e-5)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – If true, additional information will be printed to the console (default: False)

n_clusters_

The final number of clusters in each subspace

Type:: list

labels_

The final labels. Shape equals (n_samples x n_subspaces)

Type:: np.ndarray

nrkmeans_

The final NrKmeans result

Type:: NrKmeans

mdl_costs_

The final (lowest) MDL costs found

Type:: float

all_mdl_costs_

A list containing objects of type type _Nrkmeans_Mdl_Costs representing all intermediate results of AutoNR

Type:: list

References

Leiber, Collin, et al. “Automatic Parameter Selection for Non-Redundant Clustering.” Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2022.

dissolve_noise_space(X: ndarray | None = None, random_feature_assignment: bool = True) → NrKmeans[source]

Using this method an optional noise space (n_clusters=1) can be removed from the resulting NrKmeans result which showed the lowest MDL costs. For more information see ‘NrKmeans.dissolve_noise_space()’

Parameters:

X (np.ndarray) – the given data set. Only used to calculate MDL costs. Therefore, can be None if random_feature_assignment is True (default: None)
random_feature_assignment (bool) – If true, the random strategy to distribute the noise space features is used (default: True)

Returns:

self.nrkmeans_ – The final updated NrKmeans object

Return type:

NrKmeans

fit(X: ndarray, y: ndarray | None = None) → AutoNR[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:

X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the AutoNR algorithm

Return type:

AutoNR

plot_mdl_progress() → None[source]: Plot the progress of the MDL costs during AutoNR. Dots represent full-space NrKmeans executions. Best found result is displayed as a green dot. BEWARE: If the green dot is not the lowest point of the curve, the estimated cost of a split/merge was lower than the final costs of a full space NrKmeans execution!

clustpy.alternative.nrkmeans module

@authors: Collin Leiber

class clustpy.alternative.nrkmeans.NrKmeans(n_clusters: list, V: ndarray | None = None, m: list | None = None, P: list | None = None, cluster_centers: list | None = None, mdl_for_noisespace: bool = False, outliers: bool = False, max_iter: int = 300, n_init: int = 1, cost_type: str = 'default', threshold_negative_eigenvalue: float = -1e-07, max_distance: float | None = None, precision: float | None = None, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

The Non-Redundant Kmeans (NrKmeans) algorithm. The algorithm will search for the optimal cluster subspaces and assignments depending on the input number of clusters and subspaces. The number of subspaces will automatically be traced by the length of the input n_clusters array.

This implementation includes some extensions from ‘Automatic Parameter Selection for Non-Redundant Clustering’.

Parameters:

n_clusters (list) – list containing number of clusters for each subspace
V (np.ndarray) – the initial orthonormal rotation matrix (default: None)
m (list) – list containing the initial dimensionalities for each subspace (default: None)
P (list) – list containing the initial projections (ids of corresponding dimensions) for each subspace (default: None)
cluster_centers (list) – list containing the initial cluster centers for each subspace (default: None)
mdl_for_noisespace (bool) – defines if MDL should be used to identify noise space dimensions instead of only considering negative eigenvalues (default: False)
outliers (bool) – defines if outliers should be identified through MDL (default: False)
max_iter (int) – maximum number of iterations for the algorithm (default: 300)
n_init (int) – number of times NrKmeans is executed using different seeds. The final result will be the one with lowest costs. Costs can be the standard NrKmeans costs or MDL costs (defined by the cost_type parameter) (default: 1)
cost_type (str) – Can be “default” or “mdl” and defines whether the standard NrKmeans cost function or MDL costs should be considered to identify the best result. Only relevant if n_init is larger than 1 (default: “default”)
threshold_negative_eigenvalue (float) – threshold to consider an eigenvalue as negative. Used for the update of the subspace dimensions (default: -1e-7)
max_distance (float) – distance used to encode cluster centers and outliers. Only relevant if a MDL strategy is used (default: None)
precision (float) – precision used to convert probability densities to actual probabilities. Only relevant if a MDL strategy is used (default: None)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – If true, additional information will be printed to the console (default: False)

labels_

The final labels. Shape equals (n_samples x n_subspaces)

Type:: np.ndarray

scatter_matrices_

The final scatter matrix of each subspace

Type:: list

References

Mautz, Dominik, et al. “Discovering non-redundant k-means clusterings in optimal subspaces.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

and

Leiber, Collin, et al. “Automatic Parameter Selection for Non-Redundant Clustering.” Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2022.

calculate_cost_function() → float[source]

Calculate the result of the NrKmeans cost function. Depends on the rotation and the scatter matrices. Calculates for each subspace j: P_j^T*V^T*S_j*V*P_j

Returns:: costs – The total loss of this NrKmeans object
Return type:: float

calculate_mdl_costs(X: ~numpy.ndarray) -> (<class 'float'>, <class 'float'>, <class 'list'>)[source]

Calculate the Mdl Costs of this NrKmeans result.

Parameters:: X (np.ndarray) – the given data set
Returns:: tuple – The total costs (global costs + sum of subspace costs), The global costs, The subspace specific costs (one entry for each subspace)
Return type:: (float, float, list)

dissolve_noise_space(X: ndarray | None = None, random_feature_assignment: bool = True) → NrKmeans[source]

Using this method an optional noise space (n_clusters=1) can be removed. This is useful if the noise space is no longer relevant for subsequent processing steps. Only the parameters ‘m’ and ‘P’ will be changed. All other parameters remain the same. There are two strategies to resolve the noise space. In the first case, the features are randomly distributed to the different cluster spaces. In the second, the features of the noise space are successively checked as to which cluster space they are added in order to increase the MDL costs as little as possible. Accordingly, the runtime is significantly longer using the second method.

Parameters:

X (np.ndarray) – the given data set. Only used to calculate MDL costs. Therefore, can be None if random_feature_assignment is True (default: None)
random_feature_assignment (bool) – If true, the random strategy to distribute the noise space features is used (default: True)

Returns:

self – The final updated NrKmeans object

Return type:

NrKmeans

fit(X: ndarray, y: ndarray | None = None) → NrKmeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:

X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the NrKmeans algorithm

Return type:

NrKmeans

have_clusters_been_lost() → bool[source]

Check whether clusters within any subspace have been lost during Nr-Kmeans execution. Will also return true if whole subspaces have been lost (check have_subspaces_been_lost())

Returns:: lost – True if at least one cluster has been lost
Return type:: bool

have_subspaces_been_lost() → bool[source]

Check whether subspaces have been lost during Nr-Kmeans execution.

Returns:: lost – True if at least one subspace has been lost
Return type:: bool

plot_subspace(X: ndarray, subspace_index: int, labels: ndarray | None = None, plot_centers: bool = False, gt: ndarray | None = None, equal_axis=False) → None[source]

Plot the specified subspace identified by NrKmeans as scatter matrix plot.

Parameters:

X (np.ndarray) – the given data set
subspace_index (int) – the index of the specific subspace
labels (np.ndarray) – the cluster labels used for coloring the plot. If none, the labels identified by the fit() function will be used (default: None)
plot_centers (bool) – defines whether the cluster centers should be plotted (default: False)
gt (np.ndarray) – the ground truth labels. In contrast to the labels parameter this will be displayed using different markers instead of colors (default: None)
equal_axis (bool) – defines whether the axes should be scaled equally

predict(X: ndarray) → ndarray[source]

Predict the labels of an input dataset. For this method the results from the NrKmeans fit() method will be used. If NrKmeans was fitted using a outlier strategy this will also be applied here.

Parameters:: X (np.ndarray) – the given data set
Returns:: predicted_labels – the predicted labels of the input data set for each subspace. Shape equals (n_samples x n_subspaces)
Return type:: np.ndarray

transform_full_space(X: ndarray) → ndarray[source]

Transform the input dataset with the orthonormal rotation matrix identified by the fit function.

Parameters:: X (np.ndarray) – the given data set
Returns:: rotated_data – The rotated dataset
Return type:: np.ndarray

transform_subspace(X: ndarray, subspace_index: int) → ndarray[source]

Transform the input dataset with the orthonormal rotation matrix identified by the fit function and project it into the specified subspace.

Parameters:

X (np.ndarray) – the given data set
subspace_index (int) – the index of the specific subspace

Returns:

rotated_data – The rotated and projected dataset

Return type:

np.ndarray

Module contents

class clustpy.alternative.AutoNR(nrkmeans_repetitions: int = 15, outliers: bool = True, max_subspaces: int | None = None, max_n_clusters: int | None = None, mdl_for_noisespace: bool = True, max_distance: float | None = None, precision: float | None = None, similarity_threshold: float = 1e-05, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

The AutoNR algorithm. The algorithm will search for the best number of subspaces and clusters per subspace in a non-redundant clustering setting. Therefore, it repeatedly performs noise space splits, cluster space splits and cluster space merges and evaluates the results by their MDL costs. In case of AutoNR the described framework is combined with the non-redundant clustering algorithm NrKmeans. In the end the NrKmeans result with the lowest identified MDL costs will be returned and stored in the nrkmeans_ parameter.

Parameters:

nrkmeans_repetitions (int) – number of NrKmeans repetitions for each execution step to find the best local minimum (default: 15)
outliers (bool) – defines if outliers should be identified through MDL (default: True)
max_subspaces (int) – maximum number of subspaces. If None max_subspace will be equal to the total number of dimensions (default: None)
max_n_clusters (int) – maximum number of clusters for each subspace. If None this will be equal to the total number of samples (default: None)
mdl_for_noisespace (bool) – defines if MDL should be used to identify noise space dimensions instead of only considering negative eigenvalues when running NrKmeans (default: True)
max_distance (float) – distance used to encode cluster centers and outliers (default: None)
precision (float) – precision used to convert probability densities to actual probabilities (default: None)
similarity_threshold (float) – threshold that defines if the noise space has not changed for two subsequent iterations by checking the subspace costs (default: 1e-5)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – If true, additional information will be printed to the console (default: False)

n_clusters_

The final number of clusters in each subspace

Type:: list

labels_

The final labels. Shape equals (n_samples x n_subspaces)

Type:: np.ndarray

nrkmeans_

The final NrKmeans result

Type:: NrKmeans

mdl_costs_

The final (lowest) MDL costs found

Type:: float

all_mdl_costs_

A list containing objects of type type _Nrkmeans_Mdl_Costs representing all intermediate results of AutoNR

Type:: list

References

Leiber, Collin, et al. “Automatic Parameter Selection for Non-Redundant Clustering.” Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2022.

dissolve_noise_space(X: ndarray | None = None, random_feature_assignment: bool = True) → NrKmeans[source]

Using this method an optional noise space (n_clusters=1) can be removed from the resulting NrKmeans result which showed the lowest MDL costs. For more information see ‘NrKmeans.dissolve_noise_space()’

Parameters:

X (np.ndarray) – the given data set. Only used to calculate MDL costs. Therefore, can be None if random_feature_assignment is True (default: None)
random_feature_assignment (bool) – If true, the random strategy to distribute the noise space features is used (default: True)

Returns:

self.nrkmeans_ – The final updated NrKmeans object

Return type:

NrKmeans

fit(X: ndarray, y: ndarray | None = None) → AutoNR[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:

X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the AutoNR algorithm

Return type:

AutoNR

plot_mdl_progress() → None[source]: Plot the progress of the MDL costs during AutoNR. Dots represent full-space NrKmeans executions. Best found result is displayed as a green dot. BEWARE: If the green dot is not the lowest point of the curve, the estimated cost of a split/merge was lower than the final costs of a full space NrKmeans execution!

class clustpy.alternative.NrKmeans(n_clusters: list, V: ndarray | None = None, m: list | None = None, P: list | None = None, cluster_centers: list | None = None, mdl_for_noisespace: bool = False, outliers: bool = False, max_iter: int = 300, n_init: int = 1, cost_type: str = 'default', threshold_negative_eigenvalue: float = -1e-07, max_distance: float | None = None, precision: float | None = None, random_state: RandomState | None = None, debug: bool = False)[source]

Bases: BaseEstimator, ClusterMixin

The Non-Redundant Kmeans (NrKmeans) algorithm. The algorithm will search for the optimal cluster subspaces and assignments depending on the input number of clusters and subspaces. The number of subspaces will automatically be traced by the length of the input n_clusters array.

This implementation includes some extensions from ‘Automatic Parameter Selection for Non-Redundant Clustering’.

Parameters:

n_clusters (list) – list containing number of clusters for each subspace
V (np.ndarray) – the initial orthonormal rotation matrix (default: None)
m (list) – list containing the initial dimensionalities for each subspace (default: None)
P (list) – list containing the initial projections (ids of corresponding dimensions) for each subspace (default: None)
cluster_centers (list) – list containing the initial cluster centers for each subspace (default: None)
mdl_for_noisespace (bool) – defines if MDL should be used to identify noise space dimensions instead of only considering negative eigenvalues (default: False)
outliers (bool) – defines if outliers should be identified through MDL (default: False)
max_iter (int) – maximum number of iterations for the algorithm (default: 300)
n_init (int) – number of times NrKmeans is executed using different seeds. The final result will be the one with lowest costs. Costs can be the standard NrKmeans costs or MDL costs (defined by the cost_type parameter) (default: 1)
cost_type (str) – Can be “default” or “mdl” and defines whether the standard NrKmeans cost function or MDL costs should be considered to identify the best result. Only relevant if n_init is larger than 1 (default: “default”)
threshold_negative_eigenvalue (float) – threshold to consider an eigenvalue as negative. Used for the update of the subspace dimensions (default: -1e-7)
max_distance (float) – distance used to encode cluster centers and outliers. Only relevant if a MDL strategy is used (default: None)
precision (float) – precision used to convert probability densities to actual probabilities. Only relevant if a MDL strategy is used (default: None)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)
debug (bool) – If true, additional information will be printed to the console (default: False)

labels_

The final labels. Shape equals (n_samples x n_subspaces)

Type:: np.ndarray

scatter_matrices_

The final scatter matrix of each subspace

Type:: list

References

Mautz, Dominik, et al. “Discovering non-redundant k-means clusterings in optimal subspaces.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

and

Leiber, Collin, et al. “Automatic Parameter Selection for Non-Redundant Clustering.” Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2022.

calculate_cost_function() → float[source]

Calculate the result of the NrKmeans cost function. Depends on the rotation and the scatter matrices. Calculates for each subspace j: P_j^T*V^T*S_j*V*P_j

Returns:: costs – The total loss of this NrKmeans object
Return type:: float

calculate_mdl_costs(X: ~numpy.ndarray) -> (<class 'float'>, <class 'float'>, <class 'list'>)[source]

Calculate the Mdl Costs of this NrKmeans result.

Parameters:: X (np.ndarray) – the given data set
Returns:: tuple – The total costs (global costs + sum of subspace costs), The global costs, The subspace specific costs (one entry for each subspace)
Return type:: (float, float, list)

dissolve_noise_space(X: ndarray | None = None, random_feature_assignment: bool = True) → NrKmeans[source]

Using this method an optional noise space (n_clusters=1) can be removed. This is useful if the noise space is no longer relevant for subsequent processing steps. Only the parameters ‘m’ and ‘P’ will be changed. All other parameters remain the same. There are two strategies to resolve the noise space. In the first case, the features are randomly distributed to the different cluster spaces. In the second, the features of the noise space are successively checked as to which cluster space they are added in order to increase the MDL costs as little as possible. Accordingly, the runtime is significantly longer using the second method.

Parameters:

X (np.ndarray) – the given data set. Only used to calculate MDL costs. Therefore, can be None if random_feature_assignment is True (default: None)
random_feature_assignment (bool) – If true, the random strategy to distribute the noise space features is used (default: True)

Returns:

self – The final updated NrKmeans object

Return type:

NrKmeans

fit(X: ndarray, y: ndarray | None = None) → NrKmeans[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:

X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the NrKmeans algorithm

Return type:

NrKmeans

have_clusters_been_lost() → bool[source]

Check whether clusters within any subspace have been lost during Nr-Kmeans execution. Will also return true if whole subspaces have been lost (check have_subspaces_been_lost())

Returns:: lost – True if at least one cluster has been lost
Return type:: bool

have_subspaces_been_lost() → bool[source]

Check whether subspaces have been lost during Nr-Kmeans execution.

Returns:: lost – True if at least one subspace has been lost
Return type:: bool

plot_subspace(X: ndarray, subspace_index: int, labels: ndarray | None = None, plot_centers: bool = False, gt: ndarray | None = None, equal_axis=False) → None[source]

Plot the specified subspace identified by NrKmeans as scatter matrix plot.

Parameters:

X (np.ndarray) – the given data set
subspace_index (int) – the index of the specific subspace
labels (np.ndarray) – the cluster labels used for coloring the plot. If none, the labels identified by the fit() function will be used (default: None)
plot_centers (bool) – defines whether the cluster centers should be plotted (default: False)
gt (np.ndarray) – the ground truth labels. In contrast to the labels parameter this will be displayed using different markers instead of colors (default: None)
equal_axis (bool) – defines whether the axes should be scaled equally

predict(X: ndarray) → ndarray[source]

Predict the labels of an input dataset. For this method the results from the NrKmeans fit() method will be used. If NrKmeans was fitted using a outlier strategy this will also be applied here.

Parameters:: X (np.ndarray) – the given data set
Returns:: predicted_labels – the predicted labels of the input data set for each subspace. Shape equals (n_samples x n_subspaces)
Return type:: np.ndarray

transform_full_space(X: ndarray) → ndarray[source]

Transform the input dataset with the orthonormal rotation matrix identified by the fit function.

Parameters:: X (np.ndarray) – the given data set
Returns:: rotated_data – The rotated dataset
Return type:: np.ndarray

transform_subspace(X: ndarray, subspace_index: int) → ndarray[source]

Transform the input dataset with the orthonormal rotation matrix identified by the fit function and project it into the specified subspace.

Parameters:

X (np.ndarray) – the given data set
subspace_index (int) – the index of the specific subspace

Returns:

rotated_data – The rotated and projected dataset

Return type:

np.ndarray