clustpy.utils package

Submodules

clustpy.utils.diptest module

clustpy.utils.diptest.dip_boot_samples(n_points: int, n_boots: int = 1000, random_state: RandomState | None = None) ndarray[source]

Sample random data sets and calculate corresponding Dip-values. E.g. used to determine p-values.

Parameters:
  • n_points (int) – The number of samples

  • n_boots (int) – Number of random data sets that should be created to calculate Dip-values (default: 1000)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

boot_dips – Array of Dip-values

Return type:

np.ndarray

clustpy.utils.diptest.dip_gradient(X: ndarray, X_proj: ndarray, sorted_indices: ndarray, modal_triangle: tuple) ndarray[source]

Calculate the gradient of the Dip-value regarding the projection axis.

Parameters:
  • X (np.ndarray) – the given data set

  • X_proj (np.ndarray) – The univariate projected data set

  • sorted_indices (np.ndarray) – The indices of the sorted univariate data set

  • modal_triangle (tuple) – Indices of the modal triangle

Returns:

gradient – The gradient of the Dip-value regarding the projection axis

Return type:

np.ndarray

References

Krause, Andreas, and Volkmar Liebscher. “Multimodal projection pursuit using the dip statistic.” (2005).

clustpy.utils.diptest.dip_pval(dip_value: float, n_points: int, pval_strategy: str = 'table', n_boots: int = 1000, random_state: RandomState | None = None) float[source]

Get the p-value of a corresponding Dip-value. P-values depend on the input Dip-value and the sample size. There are several strategies to calculate the p-value. These are: ‘table’ (most common), ‘function’ (available for all sample sizes) and ‘bootstrap’ (slow for large sample sizes)

Parameters:
  • dip_value (flaat) – The Dip-value

  • n_points (int) – The number of samples

  • pval_strategy (str) – Specifies the strategy that should be used to calculate the p-value (default: ‘table’)

  • n_boots (int) – Number of random data sets that should be created to calculate Dip-values. Only relevant if pval_strategy is ‘bootstrap’ (default: 1000)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only relevant if pval_strategy is ‘bootstrap’ (default: None)

Returns:

pval – The resulting p-value

Return type:

float

References

Hartigan, John A., and Pamela M. Hartigan. “The dip test of unimodality.” The annals of Statistics (1985): 70-84.

and

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

clustpy.utils.diptest.dip_pval_gradient(X: ndarray, X_proj: ndarray, sorted_indices: ndarray, modal_triangle: tuple, dip_value: float) ndarray[source]

Calculate the gradient of the Dip p-value function regarding the projection axis.

Parameters:
  • X (np.ndarray) – the given data set

  • X_proj (np.ndarray) – The univariate projected data set

  • sorted_indices (np.ndarray) – The indices of the sorted univariate data set

  • modal_triangle (tuple) – Indices of the modal triangle

  • dip_value (float) – The Dip-value

Returns:

pval_grad – The gradient of the Dip p-value function regarding the projection axis

Return type:

np.ndarray

References

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

clustpy.utils.diptest.dip_test(X: ~numpy.ndarray, just_dip: bool = True, is_data_sorted: bool = False, return_gcm_lcm_mn_mj: bool = False, use_c: bool = True, debug: bool = False) -> (<class 'float'>, <class 'tuple'>, <class 'tuple'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Calculate the Dip-value. This can either be done using the C implementation or the python version. In addition to the Dip-value additional values can be returned. These are e.g. the modal interval (indices of the beginning and end of the steepest slop of the ECDF) and the modal interval (used to calculate the gradient of the Dip-value) if just_dip is False. Further, the indices of the Greatest Convex Minorant (gcm), Least Concave Majorant (lcm), minorant and majorant values can be returned by setting return_gcm_lcm_mn_mj to True. Note that modal_triangle can be (-1,-1,-1) if the triangle could not be determined correctly.

Parameters:
  • X (np.ndarray) – the given univariate data set

  • just_dip (bool) – Defines whether only the Dip-value should be returned or also the modal interval and modal triangle (default: True)

  • is_data_sorted (bool) – Should be True if the data set is already sorted (default: False)

  • return_gcm_lcm_mn_mj (bool) – Defines whether the gcm, lcm, mn and mj arrays should be returned. In this case just_dip must be False (default: False)

  • use_c (bool) – Defines whether the C implementation should be used (defualt: True)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

Returns:

tuple – The resulting Dip-value, The indices of the modal_interval - corresponds to the steepest slope in the ECDF (if just_dip is False), The indices of the modal triangle (if just_dip is False), The indices of points that are part of the Greatest Convex Minorant (gcm) (if just_dip is False and return_gcm_lcm_mn_mj is True), The indices of points that are part of the Least Concave Majorant (lcm) (if just_dip is False and return_gcm_lcm_mn_mj is True), The minorant values (if just_dip is False and return_gcm_lcm_mn_mj is True), The majorant values (if just_dip is False and return_gcm_lcm_mn_mj is True)

Return type:

(float, tuple, tuple, np.ndarray, np.ndarray, np.ndarray, np.ndarray)

References

Hartigan, John A., and Pamela M. Hartigan. “The dip test of unimodality.” The annals of Statistics (1985): 70-84.

and

Hartigan, P. M. “Computation of the dip statistic to test for unimodality: Algorithm as 217.” Applied Statistics 34.3 (1985): 320-5.

clustpy.utils.diptest.plot_dip(X: ndarray, is_data_sorted: bool = False, dip_value: float | None = None, modal_interval: tuple | None = None, modal_triangle: tuple | None = None, gcm: ndarray | None = None, lcm: ndarray | None = None, linewidth_ecdf: float = 1, linewidth_extra: float = 2, show_legend: bool = True, add_histogram: bool = True, histogram_labels: ndarray | None = None, histogram_show_legend: bool = True, histogram_density: bool = True, histogram_n_bins: int = 100, height_ratio: tuple = (1, 2), show_plot: bool = True) None[source]

Plot a visual representation of the computational process of the Dip. Upper part shows an optional histogram of the data and the lower part shows the corresponding ECDF.

Parameters:
  • X (np.ndarray) – the given data set

  • is_data_sorted (bool) – Should be True if the data set is already sorted (default: False)

  • dip_value (float) – The Dip-value (default: None)

  • modal_interval (tuple) – Indices of the modal interval - corresponds to the steepest slope in the ECDF (default: None)

  • modal_triangle (tuple) – Indices of the modal triangle (default: None)

  • gcm (np.ndarray) – The indices of points that are part of the Greatest Convex Minorant (gcm) (default: None)

  • lcm (np.ndarray) – The indices of points that are part of the Least Concave Majorant (lcm) (default None)

  • linewidth_ecdf (flaot) – The linewidth for the eCDF (default: 1)

  • linewidth_extra (float) – The linewidth for the visualization of the dip, modal interval, modal triangle, gcm and lcm (default: 2)

  • show_legend (bool) – Defines whether the legend of the ECDF plot should be added (default: True)

  • add_histogram (bool) – Defines whether the histogram should be shown above the ECDF plot (default: True)

  • histogram_labels (np.ndarray) – Labels used to color parts of the histogram (default: None)

  • histogram_show_legend (bool) – Defines whether the legend of the histogram should be added (default: True)

  • histogram_density (bool) – Defines whether a kernel density should be added to the histogram plot (default: True)

  • histogram_n_bins (int) – Number of bins used for the histogram (default: 100)

  • height_ratio (tuple) – Defines the height ratio between histogram and ECDF plot. Only relevant if add_histogram is True. First value in the tuple defines the height of the histogram and the second value the height of the ECDF plot (default: (1, 2))

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.evaluation module

class clustpy.utils.evaluation.EvaluationAlgorithm(name: str, algorithm: ClusterMixin, params: dict | None = None, deterministic: bool = False, preprocess_methods: list | None = None, preprocess_params: dict | None = None)[source]

Bases: object

The EvaluationAlgorithm object is a wrapper for clustering algorithms. It contains all the information necessary to evaluate a data set using the evaluate_dataset or evaluate_multiple_datasets method. If the algorithm requires the number of clusters as input parameter, params should contain {“n_clusters”: None}.

Parameters:
  • name (str) – Name of the metric. Can be chosen freely

  • algorithm (ClusterMixin) – The actual object of the clustering algorithm

  • params (dict) – Parameters given to the clustering algorithm. If the algorithm uses a n_clusters parameter, it can be set to None, e.g., params={“n_clusters”: None}. In this case the evaluation methods will automatically use the correct number of clusters for the specific data set (default: {})

  • deterministic (bool) – Defines if the algorithm produces a deterministic clustering result (e.g. like DBSCAN). In this case the algorithm will only be executed once even though a higher number of repetitions is specified when evaluating a data set (default: False)

  • preprocess_methods (list) – Specify preprocessing steps performed on each data set before executing the clustering algorithm. Can be either a list of callable functions or a single callable function. Will also be applied to an optional test data set (default: None)

  • preprocess_params (dict) – List of dictionaries containing the parameters for the preprocessing methods. Needs one entry for each method in preprocess_methods. If only a single preprocessing method is given (instead of a list) a single dictionary is expected (default: {})

Examples

See evaluate_multiple_datasets()

>>> from sklearn.cluster import DBSCAN
>>> from clustpy.partition import SubKmeans
>>> ea1 = EvaluationAlgorithm(name="DBSCAN", algorithm=DBSCAN, params={"eps": 0.5, "min_samples": 2}, deterministic=True)
>>> ea2 = EvaluationAlgorithm(name="SubKMeans", algorithm=SubKmeans, params={"n_clusters": None})
class clustpy.utils.evaluation.EvaluationAutoencoder(path: str, autoencoder_class: Module, params: dict | None = None, path_custom_dataloaders: tuple | None = None)[source]

Bases: object

The EvaluationAutoencoder object is a wrapper for autoencoders that can be used by deep clustering algorithms. It contains all the information necessary to load a pretrained autoencoder that for the evaluate_dataset or evaluate_multiple_datasets method. Can also contain paths to saved dataloaders (e.g. when using augmentation).

Parameters:
  • path (str) – Path to the state dict that should be loaded

  • autoencoder_class (torch.nn.Module) – The actual autoencoder class

  • params (dict) – Parameters given to the autoencoder class (default: {})

  • path_custom_dataloaders (tuple) – Tuple containing the path of saved dataloaders. First entry is for the saved trainloader and second for the saved testloader (default: None)

Examples

>>> from clustpy.deep.autoencoders import FeedforwardAutoencoder
>>> ea = EvaluationAutoencoder(path="PATH", autoencoder_class=FeedforwardAutoencoder, params={"layers": [256, 128, 64, 10], "bias": False})
class clustpy.utils.evaluation.EvaluationDataset(name: str, data: ndarray, labels_true: ndarray | None = None, data_loader_params: dict | None = None, train_test_split: bool | None = None, preprocess_methods: list | None = None, preprocess_params: list | None = None, iteration_specific_autoencoders: list | None = None, ignore_algorithms: tuple = ())[source]

Bases: object

The EvaluationDataset object is a wrapper for actual data sets. It contains all the information necessary to evaluate a data set using the evaluate_multiple_datasets method.

Parameters:
  • name (str) – Name of the data set. Can be chosen freely

  • data (np.ndarray) – The actual data set. Can be a np.ndarray, a path to a data file (of type str) or a callable (e.g. a method from clustpy.data)

  • labels_true (np.ndarray) – The ground truth labels. Can be a np.ndarray, an int or list specifying which columns of the data contain the labels or None if no ground truth labels are present. If data is a callable, the ground truth labels can also be obtained by that function and labels_true can be None (default: None)

  • data_loader_params (dict) – Dictionary containing the information necessary to load data from a function or file. Only relevant if data is of type callable or str (default: {})

  • train_test_split (bool) – Specifies if the laoded dataset should be split into a train and test set. Can be of type bool, list or np.ndarray. If train_test_split is a boolean and true, the data loader will use the parameter “subset” to load a train and test set. In that case data must be a callable. If train_test_split is a list/np.ndarray, the entries specify the indices of the data array that should be used for the train set (default: None)

  • preprocess_methods (list) – Specify preprocessing steps before evaluating the data set. Can be either a list of callable functions or a single callable function. Will also be applied to an optional test data set (default: None)

  • preprocess_params (list) – List of dictionaries containing the parameters for the preprocessing methods. Needs one entry for each method in preprocess_methods. If only a single preprocessing method is given (instead of a list) a single dictionary is expected (default: {})

  • iteration_specific_autoencoders (list) – List containing EvaluationAutoencoder objects for each iteration of deep clustering algorithm. Length of the list must be equal to ‘n_repetitions’ in ‘evaluate_multiple_datasets()’ and ‘evaluate_dataset()’. Each entry in the list must be of type EvaluationAutoencoder. If a clustering algorithm does not have a ‘autoencoder’ parameter, this parameter will be ignored. Can be None if no iteration-specific autoencoders are used (default: None)

  • ignore_algorithms (tuple) – List of algorithm names (as specified in the EvaluationAlgorithm object) that should be ignored for this specific data set (default: [])

Examples

See evaluate_multiple_datasets()

>>> from clustpy.data import load_iris, load_wine
>>> ed1 = EvaluationDataset(name="iris", data=load_iris)
>>> X, L = load_wine()
>>> ed2 = EvaluationDataset(name="wine", data=X, labels_true=L)
class clustpy.utils.evaluation.EvaluationMetric(name: str, metric: Callable, params: dict | None = None, use_gt: bool = True)[source]

Bases: object

The EvaluationMetric object is a wrapper for evaluation metrics. It contains all the information necessary to evaluate a data set using the evaluate_dataset or evaluate_multiple_datasets method.

Parameters:
  • name (str) – Name of the metric. Can be chosen freely

  • metric (Callable) – The actual metric function

  • params (dict) – Parameters given to the metric function (default: {})

  • use_gt (bool) – If true, the input to the metric will be the ground truth labels and the predicted labels (e.g. normalized mutual information). If false, the input will be the data and the predicted labels (e.g. silhouette score) (default: True)

Examples

See evaluate_multiple_datasets()

>>> from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette
>>> em1 = EvaluationMetric(name="nmi", metric=nmi, params={"average_method": "geometric"}, use_gt=True),
>>> em2 = EvaluationMetric(name="silhouette", metric=silhouette, use_gt=False)
clustpy.utils.evaluation.evaluate_dataset(X: ~numpy.ndarray, evaluation_algorithms: list, evaluation_metrics: list | None = None, labels_true: ~numpy.ndarray | None = None, n_repetitions: int = 10, X_test: ~numpy.ndarray | None = None, labels_true_test: ~numpy.ndarray | None = None, iteration_specific_autoencoders: list | None = None, aggregation_functions: tuple = (<function mean>, <function std>), add_runtime: bool = True, add_n_clusters: bool = False, save_path: str | None = None, save_labels_path: str | None = None, ignore_algorithms: tuple = (), random_state: ~numpy.random.mtrand.RandomState | None = None) DataFrame[source]

Evaluate the clustering result of different clustering algorithms (as specified by evaluation_algorithms) on a given data set using different metrics (as specified by evaluation_metrics). Each algorithm will be executed n_repetitions times and all specified metrics will be used to evaluate the clustering result. The final result is a pandas DataFrame containing all the information.

Parameters:
  • X (np.ndarray) – the given data set

  • evaluation_algorithms (list) – Contains objects of type EvaluationAlgorithm which are wrappers for the clustering algorithms

  • evaluation_metrics (list) – Contains objects of type EvaluationMetric which are wrappers for the metrics (default: None)

  • labels_true (np.ndarray) – The ground truth labels of the data set (default: None)

  • n_repetitions (int) – Number of times that the clustering procedure should be executed on the same data set (default: 10)

  • X_test (np.ndarray) – An optional test data set that will be evaluated using the predict method of the clustering algorithms (default: None)

  • labels_true_test (np.ndarray) – The ground truth labels of the test data set (default: None)

  • iteration_specific_autoencoders (list) – List containing EvaluationAutoencoder objects for each iteration of deep clustering algorithm. Length of the list must be equal to ‘n_repetitions’. Each entry in the list must be of type EvaluationAutoencoder. If a clustering algorithm does not have a ‘autoencoder’ parameter, this parameter will be ignored. Can be None if no iteration-specific autoencoders are used (default: None)

  • aggregation_functions (tuple) – List of aggregation functions that should be applied to the n_repetitions different results of a single clustering algorithm (default: [np.mean, np.std])

  • add_runtime (bool) – Add runtime of each execution to the final table (default: True)

  • add_n_clusters (bool) – Add the resulting number of clusters to the final table (default: False)

  • save_path (str) – The path where the final DataFrame should be saved as csv. If None, the DataFrame will not be saved (default: None)

  • save_labels_path (str) – The path where the clustering labels should be saved as csv. If None, the labels will not be saved (default: None)

  • ignore_algorithms (tuple) – List of algorithm names (as specified in the EvaluationAlgorithm object) that should be ignored for this specific data set (default: [])

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

df – The final DataFrame

Return type:

pd.DataFrame

Examples

>>> from sklearn.cluster import KMeans, DBSCAN
>>> from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette
>>>
>>> def _add_value(x, value):
>>>     return x + value
>>>
>>> X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]])
>>> L = np.array([0] * 3 + [1] * 3)
>>> n_repetitions = 2
>>> aggregations = [np.mean, np.std, np.max]
>>> algorithms = [
>>>     EvaluationAlgorithm(name="KMeans", algorithm=KMeans, params={"n_clusters": 2}),
>>>     EvaluationAlgorithm(name="KMeans_with_preprocess", algorithm=KMeans, params={"n_clusters": 2},
>>>                         preprocess_methods=[_add_value],
>>>                         preprocess_params=[{"value": 1}]),
>>>     EvaluationAlgorithm(name="DBSCAN", algorithm=DBSCAN, params={"eps": 0.5, "min_samples": 2}, deterministic=True)]
>>> metrics = [EvaluationMetric(name="nmi", metric=nmi, params={"average_method": "geometric"}, use_gt=True),
>>>            EvaluationMetric(name="silhouette", metric=silhouette, use_gt=False)]
>>> df = evaluate_dataset(X=X, evaluation_algorithms=algorithms, evaluation_metrics=metrics, labels_true=L,
>>>                       n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True,
>>>                       add_n_clusters=True, save_path=None, ignore_algorithms=["KMeans_with_preprocess"],
>>>                       random_state=1)
clustpy.utils.evaluation.evaluate_multiple_datasets(evaluation_datasets: list, evaluation_algorithms: list, evaluation_metrics: list | None = None, n_repetitions: int = 10, aggregation_functions: tuple = (<function mean>, <function std>), add_runtime: bool = True, add_n_clusters: bool = False, save_path: str | None = None, save_intermediate_results: bool = False, save_labels_path: str | None = None, random_state: ~numpy.random.mtrand.RandomState | None = None) DataFrame[source]

Evaluate the clustering result of different clustering algorithms (as specified by evaluation_algorithms) on a set of data sets (as specified by evaluation_datasets) using different metrics (as specified by evaluation_metrics). Each algorithm will be executed n_repetitions times and all specified metrics will be used to evaluate the clustering result. The final result is a pandas DataFrame containing all the information.

Parameters:
  • evaluation_datasets (list) – Contains objects of type EvaluationDataset which are wrappers for the data sets

  • evaluation_algorithms (list) – Contains objects of type EvaluationAlgorithm which are wrappers for the clustering algorithms

  • evaluation_metrics (list) – Contains objects of type EvaluationMetric which are wrappers for the metrics (default: None)

  • n_repetitions (int) – Number of times that the clustering procedure should be executed on the same data set (default: 10)

  • aggregation_functions (tuple) – List of aggregation functions that should be applied to the n_repetitions different results of a single clustering algorithm (default: [np.mean, np.std])

  • add_runtime (bool) – Add runtime of each execution to the final table (default: True)

  • add_n_clusters (bool) – Add the resulting number of clusters to the final table (default: False)

  • save_path (str) – The path where the final DataFrame should be saved as csv. If None, the DataFrame will not be saved (default: None)

  • save_intermediate_results (bool) – Defines whether the result of each data set should be separately saved. Useful if the evaluation takes a lot of time (default: False)

  • save_labels_path (str) – The path where the clustering labels should be saved as csv. If None, the labels will not be saved (default: None)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

df – The final DataFrame

Return type:

pd.DataFrame

Examples

See the readme.md

>>> from sklearn.cluster import KMeans, DBSCAN
>>> from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette
>>> from clustpy.data import load_iris
>>>
>>> def _add_value(x, value):
>>>     return x + value
>>>
>>> X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]])
>>> L = np.array([0] * 3 + [1] * 3)
>>> X2 = np.c_[X, L]
>>> n_repetitions = 2
>>> aggregations = [np.mean, np.std, np.max]
>>> algorithms = [
>>>     EvaluationAlgorithm(name="KMeans", algorithm=KMeans, params={"n_clusters": 2}),
>>>     EvaluationAlgorithm(name="KMeans_with_preprocess", algorithm=KMeans, params={"n_clusters": 2},
>>>                         preprocess_methods=[_add_value],
>>>                         preprocess_params=[{"value": 1}]),
>>>     EvaluationAlgorithm(name="DBSCAN", algorithm=DBSCAN, params={"eps": 0.5, "min_samples": 2}, deterministic=True)]
>>> metrics = [EvaluationMetric(name="nmi", metric=nmi, params={"average_method": "geometric"}, use_gt=True),
>>>            EvaluationMetric(name="silhouette", metric=silhouette, use_gt=False)]
>>> datasets = [EvaluationDataset(name="iris", data=load_iris, preprocess_methods=[_add_value],
>>>                               preprocess_params=[{"value": 2}]),
>>>             EvaluationDataset(name="X", data=X, labels_true=L),
>>>             EvaluationDataset(name="X2", data=X2, labels_true=-1, ignore_algorithms=["KMeans_with_preprocess"])
>>>             ]
>>> df = evaluate_multiple_datasets(evaluation_datasets=datasets, evaluation_algorithms=algorithms,
>>>                                 evaluation_metrics=metrics, n_repetitions=n_repetitions,
>>>                                 aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True,
>>>                                 save_path=None, save_intermediate_results=False, random_state=1)
clustpy.utils.evaluation.evaluation_df_to_latex_table(df: DataFrame, output_path: str, use_std: bool = True, best_in_bold: bool = True, second_best_underlined: bool = True, color_by_value: str | None = None, higher_is_better: list | None = None, in_percent: int = True, decimal_places: int = 1) None[source]

Convert the resulting dataframe of an evaluation into a latex table. Note that the latex package booktabs is required, so usepackage{booktabs} must be included in the latex file. This method will only consider the mean values. Therefore, note that “mean” must be included in the aggregations! If “std” is also contained in the dataframe (and use_std is True) this value will also be added by using plusminus.

Parameters:
  • df (pd.DataFrame) – The pandas dataframe. Can also be a string that contains the path to the saved dataframe

  • output_path (std) – The path were the resulting latex table text file will be stored

  • use_std (bool) – Defines if the standard deviation (std) should also be added to the latex table (default: True)

  • best_in_bold (bool) – Print best value for each combination of dataset and metric in bold. Note, that the latex package bm is used, so usepackage{bm} must be included in the latex file (default: True)

  • second_best_underlined (bool) – Print second-best value for each combination of dataset and metric underlined (default: True)

  • color_by_value (str) – Define the color that should be used to indicate the difference between the values of the metrics. Uses colorcell, so usepackage{colortbl} or usepackage[table]{xcolor} must be included in the latex file. Can be ‘blue’ for example (default: None)

  • higher_is_better (list) – List with booleans. Each value indicates if a high value for a certain metric is better than a low value. The length of the list must be equal to the number of different metrics. If None, it is always assumed that a higher value is better, except for the runtime (default: None)

  • in_percent (bool) – If true, all values, except n_clusters and runtime, will be converted to percentages -> all values will be multiplied by 100 (default: True)

  • decimal_places (int) – Number of decimal places that should be used in the latex table (default: 1)

clustpy.utils.evaluation.load_saved_autoencoder(path: str, autoencoder_class: Module, params: dict | None = None) Module[source]

Load the states of an already trained autoencoder. It will be assumed that the autoencoder was already fitted, so the ‘fitted’ parameter will be set to True.

Parameters:
  • path (str) – Path to the state dict that should be loaded

  • autoencoder_class (torch.nn.Module) – The actual autoencoder class

  • params (dict) – Parameters given to the autoencoder class (default: {})

Return type:

The autoencoder with the loaded states

clustpy.utils.plots module

clustpy.utils.plots.plot_1d_data(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, show_legend: bool = True, show_plot: bool = True) None[source]

Plot a one-dimensional data set.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)

  • true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)

  • show_legend (bool) – Defines whether a legend should be shown (default: True)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plots.plot_2d_data(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, centers: ~numpy.ndarray | None = None, true_labels: ~numpy.ndarray | None = None, cluster_ids_font_size: float | None = None, centers_ids_font_size: float = 10, show_legend: bool = True, title: str | None = None, scattersize: float = 10, centers_scattersize: float = 15, equal_axis: bool = False, container: ~matplotlib.axes._axes.Axes = <module 'matplotlib.pyplot' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.8/site-packages/matplotlib/pyplot.py'>, show_plot: bool = True) None[source]

Plot a two-dimensional data set.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)

  • true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)

  • cluster_ids_font_size (float) – The font size of the id of a predicted cluster, which is shown as text in the center of that cluster. Can be None if no id should be shown (default: None)

  • centers_ids_font_size (float) – The font size of the id that is shown next to the red marker of a cluster center. Only relevant if centers is not None. Can be None if no id should be shown (default: 10)

  • show_legend (bool) – Defines whether a legend should be shown (default: True)

  • title (str) – Title of the plot (default: None)

  • scattersize (float) – The size of the scatters (default: 10)

  • centers_scattersize (float) – The size of the red scatters of the cluster centers (default: 15)

  • equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)

  • container (plt.Axes) – The container to which the scatter plot is added. If another container is defined, show_plot should usually be False (default: matplotlib.pyplot)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plots.plot_3d_data(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, show_legend: bool = True, scattersize: float = 10, show_plot: bool = True) None[source]

Plot a three-dimensional data set.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)

  • true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)

  • show_legend (bool) – Defines whether a legend should be shown (default: True)

  • scattersize (float) – The size of the scatters (default: 10)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plots.plot_histogram(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, density: bool = True, n_bins: int = 100, show_legend: bool = True, container: ~matplotlib.axes._axes.Axes = <module 'matplotlib.pyplot' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.8/site-packages/matplotlib/pyplot.py'>, show_plot: bool = True) None[source]

Plot a histogram.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • density (bool) – Defines whether a kernel density should be added to the histogram (default: True)

  • n_bins (int) – Number of bins (default: 100)

  • show_legend (bool) – Defines whether the legend of the histogram should be shown (default: True)

  • container (plt.Axes) – The container to which the histogram is added. If another container is defined, show_plot should usually be False (default: matplotlib.pyplot)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plots.plot_image(img_data: ndarray, black_and_white: bool = False, image_shape: tuple | None = None, is_color_channel_last: bool = False, max_value: float | None = None, min_value: float | None = None, show_plot: bool = True) None[source]

Plot an image. Color image should occur in the HWC representation (height, width, color channels) if is_color_channel_last is True and in the CHW if is_color_channel_last is False.

Parameters:
  • img_data (np.ndarray) – The image data

  • black_and_white (bool) – Specifies whether the image should be plotted in grayscale colors. Only relevant for images without color channels (default: False)

  • image_shape (tuple) – (height, width) for grayscale images or HWC (height, width, color channels) / CHW for color images (default: None)

  • is_color_channel_last (bool) – if true, the color channels should be in the last dimension, known as HWC representation. Alternatively the color channel can be at the first position, known as CHW representation. Only relevant for color images (default: False)

  • max_value (float) – maximum pixel value, used for min-max normalization. Is often 255, if None the maximum value in the data set will be used (default: None)

  • min_value (float) – maximum pixel value, used for min-max normalization. Is often 0, if None the minimum value in the data set will be used (default: 255)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

Examples

>>> from clustpy.data import load_nrletters, load_optdigits
>>> X = load_nrletters().data
>>> plot_image(X[0], False, (9, 7, 3), True, 255, 0, show_plot=True)
>>> X = load_optdigits().data
>>> plot_image(X[0], True, (8, 8), None, 255, 0, show_plot=True)
clustpy.utils.plots.plot_scatter_matrix(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, density: bool = True, n_bins: int = 100, show_legend: bool = True, scattersize: float = 10, equal_axis: bool = False, max_dimensions: int = 10, show_plot: bool = True) Axes[source]

Create a scatter matrix plot. Visualizes a 2d scatter plot for each combination of features. The center axis shows a histogram of each single feature.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)

  • true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)

  • density (bool) – Defines whether a kernel density should be added to the histogram (default: True)

  • n_bins (int) – Number of bins used for the histogram (default: 100)

  • show_legend (bool) – Defines whether a legend should be shown (default: True)

  • scattersize (float) – The size of the scatters (default: 10)

  • equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)

  • max_dimensions (int) – Maximum Number of dimensions that should be plotted. This value is intended to prevent the creation of overly complex plots that are very confusing and take a long time to create (default: 10)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

Returns:

axes – None if show_plot is True, otherwise the used matplotlib axes

Return type:

plt.Axes

clustpy.utils.plots.plot_with_transformation(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, centers: ~numpy.ndarray | None = None, true_labels: ~numpy.ndarray | None = None, plot_dimensionality: int = 2, transformation_class: ~sklearn.base.TransformerMixin = <class 'sklearn.decomposition._pca.PCA'>, show_legend: bool = True, scattersize: float = 10, equal_axis: bool = False, show_plot: bool = True) None[source]

In Data Science, it is common to work with high-dimensional data. These cannot be visualized without further ado. Therefore, a dimensionality reduction technique is often applied before a plot is created. Examples for such techniques are PCA, ICA, t-SNE, UMAP, … Note that the chosen technique must work with a ‘fit_transform’ method.

This method automatically executes the aforementioned pipline: first it reduces the dimensionality, then it creates a plot adjusted to the number of features. Up to three dimensions are visualized with the help of scatter plats. Then a scatter matrix plot is used.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)

  • true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)

  • plot_dimensionality (int) – The dimensionality of the feature space after the dimensionality reduction technique has been applied (default: 2)

  • transformation_class (TransformerMixin) – The transformation class / dimensionality reduction technique (default: sklearn.decomposition.PCA)

  • show_legend (bool) – Defines whether a legend should be shown (default: True)

  • scattersize (float) – The size of the scatters (default: 10)

  • equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

Module contents

class clustpy.utils.EvaluationAlgorithm(name: str, algorithm: ClusterMixin, params: dict | None = None, deterministic: bool = False, preprocess_methods: list | None = None, preprocess_params: dict | None = None)[source]

Bases: object

The EvaluationAlgorithm object is a wrapper for clustering algorithms. It contains all the information necessary to evaluate a data set using the evaluate_dataset or evaluate_multiple_datasets method. If the algorithm requires the number of clusters as input parameter, params should contain {“n_clusters”: None}.

Parameters:
  • name (str) – Name of the metric. Can be chosen freely

  • algorithm (ClusterMixin) – The actual object of the clustering algorithm

  • params (dict) – Parameters given to the clustering algorithm. If the algorithm uses a n_clusters parameter, it can be set to None, e.g., params={“n_clusters”: None}. In this case the evaluation methods will automatically use the correct number of clusters for the specific data set (default: {})

  • deterministic (bool) – Defines if the algorithm produces a deterministic clustering result (e.g. like DBSCAN). In this case the algorithm will only be executed once even though a higher number of repetitions is specified when evaluating a data set (default: False)

  • preprocess_methods (list) – Specify preprocessing steps performed on each data set before executing the clustering algorithm. Can be either a list of callable functions or a single callable function. Will also be applied to an optional test data set (default: None)

  • preprocess_params (dict) – List of dictionaries containing the parameters for the preprocessing methods. Needs one entry for each method in preprocess_methods. If only a single preprocessing method is given (instead of a list) a single dictionary is expected (default: {})

Examples

See evaluate_multiple_datasets()

>>> from sklearn.cluster import DBSCAN
>>> from clustpy.partition import SubKmeans
>>> ea1 = EvaluationAlgorithm(name="DBSCAN", algorithm=DBSCAN, params={"eps": 0.5, "min_samples": 2}, deterministic=True)
>>> ea2 = EvaluationAlgorithm(name="SubKMeans", algorithm=SubKmeans, params={"n_clusters": None})
class clustpy.utils.EvaluationAutoencoder(path: str, autoencoder_class: Module, params: dict | None = None, path_custom_dataloaders: tuple | None = None)[source]

Bases: object

The EvaluationAutoencoder object is a wrapper for autoencoders that can be used by deep clustering algorithms. It contains all the information necessary to load a pretrained autoencoder that for the evaluate_dataset or evaluate_multiple_datasets method. Can also contain paths to saved dataloaders (e.g. when using augmentation).

Parameters:
  • path (str) – Path to the state dict that should be loaded

  • autoencoder_class (torch.nn.Module) – The actual autoencoder class

  • params (dict) – Parameters given to the autoencoder class (default: {})

  • path_custom_dataloaders (tuple) – Tuple containing the path of saved dataloaders. First entry is for the saved trainloader and second for the saved testloader (default: None)

Examples

>>> from clustpy.deep.autoencoders import FeedforwardAutoencoder
>>> ea = EvaluationAutoencoder(path="PATH", autoencoder_class=FeedforwardAutoencoder, params={"layers": [256, 128, 64, 10], "bias": False})
class clustpy.utils.EvaluationDataset(name: str, data: ndarray, labels_true: ndarray | None = None, data_loader_params: dict | None = None, train_test_split: bool | None = None, preprocess_methods: list | None = None, preprocess_params: list | None = None, iteration_specific_autoencoders: list | None = None, ignore_algorithms: tuple = ())[source]

Bases: object

The EvaluationDataset object is a wrapper for actual data sets. It contains all the information necessary to evaluate a data set using the evaluate_multiple_datasets method.

Parameters:
  • name (str) – Name of the data set. Can be chosen freely

  • data (np.ndarray) – The actual data set. Can be a np.ndarray, a path to a data file (of type str) or a callable (e.g. a method from clustpy.data)

  • labels_true (np.ndarray) – The ground truth labels. Can be a np.ndarray, an int or list specifying which columns of the data contain the labels or None if no ground truth labels are present. If data is a callable, the ground truth labels can also be obtained by that function and labels_true can be None (default: None)

  • data_loader_params (dict) – Dictionary containing the information necessary to load data from a function or file. Only relevant if data is of type callable or str (default: {})

  • train_test_split (bool) – Specifies if the laoded dataset should be split into a train and test set. Can be of type bool, list or np.ndarray. If train_test_split is a boolean and true, the data loader will use the parameter “subset” to load a train and test set. In that case data must be a callable. If train_test_split is a list/np.ndarray, the entries specify the indices of the data array that should be used for the train set (default: None)

  • preprocess_methods (list) – Specify preprocessing steps before evaluating the data set. Can be either a list of callable functions or a single callable function. Will also be applied to an optional test data set (default: None)

  • preprocess_params (list) – List of dictionaries containing the parameters for the preprocessing methods. Needs one entry for each method in preprocess_methods. If only a single preprocessing method is given (instead of a list) a single dictionary is expected (default: {})

  • iteration_specific_autoencoders (list) – List containing EvaluationAutoencoder objects for each iteration of deep clustering algorithm. Length of the list must be equal to ‘n_repetitions’ in ‘evaluate_multiple_datasets()’ and ‘evaluate_dataset()’. Each entry in the list must be of type EvaluationAutoencoder. If a clustering algorithm does not have a ‘autoencoder’ parameter, this parameter will be ignored. Can be None if no iteration-specific autoencoders are used (default: None)

  • ignore_algorithms (tuple) – List of algorithm names (as specified in the EvaluationAlgorithm object) that should be ignored for this specific data set (default: [])

Examples

See evaluate_multiple_datasets()

>>> from clustpy.data import load_iris, load_wine
>>> ed1 = EvaluationDataset(name="iris", data=load_iris)
>>> X, L = load_wine()
>>> ed2 = EvaluationDataset(name="wine", data=X, labels_true=L)
class clustpy.utils.EvaluationMetric(name: str, metric: Callable, params: dict | None = None, use_gt: bool = True)[source]

Bases: object

The EvaluationMetric object is a wrapper for evaluation metrics. It contains all the information necessary to evaluate a data set using the evaluate_dataset or evaluate_multiple_datasets method.

Parameters:
  • name (str) – Name of the metric. Can be chosen freely

  • metric (Callable) – The actual metric function

  • params (dict) – Parameters given to the metric function (default: {})

  • use_gt (bool) – If true, the input to the metric will be the ground truth labels and the predicted labels (e.g. normalized mutual information). If false, the input will be the data and the predicted labels (e.g. silhouette score) (default: True)

Examples

See evaluate_multiple_datasets()

>>> from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette
>>> em1 = EvaluationMetric(name="nmi", metric=nmi, params={"average_method": "geometric"}, use_gt=True),
>>> em2 = EvaluationMetric(name="silhouette", metric=silhouette, use_gt=False)
clustpy.utils.dip_boot_samples(n_points: int, n_boots: int = 1000, random_state: RandomState | None = None) ndarray[source]

Sample random data sets and calculate corresponding Dip-values. E.g. used to determine p-values.

Parameters:
  • n_points (int) – The number of samples

  • n_boots (int) – Number of random data sets that should be created to calculate Dip-values (default: 1000)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

boot_dips – Array of Dip-values

Return type:

np.ndarray

clustpy.utils.dip_gradient(X: ndarray, X_proj: ndarray, sorted_indices: ndarray, modal_triangle: tuple) ndarray[source]

Calculate the gradient of the Dip-value regarding the projection axis.

Parameters:
  • X (np.ndarray) – the given data set

  • X_proj (np.ndarray) – The univariate projected data set

  • sorted_indices (np.ndarray) – The indices of the sorted univariate data set

  • modal_triangle (tuple) – Indices of the modal triangle

Returns:

gradient – The gradient of the Dip-value regarding the projection axis

Return type:

np.ndarray

References

Krause, Andreas, and Volkmar Liebscher. “Multimodal projection pursuit using the dip statistic.” (2005).

clustpy.utils.dip_pval(dip_value: float, n_points: int, pval_strategy: str = 'table', n_boots: int = 1000, random_state: RandomState | None = None) float[source]

Get the p-value of a corresponding Dip-value. P-values depend on the input Dip-value and the sample size. There are several strategies to calculate the p-value. These are: ‘table’ (most common), ‘function’ (available for all sample sizes) and ‘bootstrap’ (slow for large sample sizes)

Parameters:
  • dip_value (flaat) – The Dip-value

  • n_points (int) – The number of samples

  • pval_strategy (str) – Specifies the strategy that should be used to calculate the p-value (default: ‘table’)

  • n_boots (int) – Number of random data sets that should be created to calculate Dip-values. Only relevant if pval_strategy is ‘bootstrap’ (default: 1000)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only relevant if pval_strategy is ‘bootstrap’ (default: None)

Returns:

pval – The resulting p-value

Return type:

float

References

Hartigan, John A., and Pamela M. Hartigan. “The dip test of unimodality.” The annals of Statistics (1985): 70-84.

and

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

clustpy.utils.dip_pval_gradient(X: ndarray, X_proj: ndarray, sorted_indices: ndarray, modal_triangle: tuple, dip_value: float) ndarray[source]

Calculate the gradient of the Dip p-value function regarding the projection axis.

Parameters:
  • X (np.ndarray) – the given data set

  • X_proj (np.ndarray) – The univariate projected data set

  • sorted_indices (np.ndarray) – The indices of the sorted univariate data set

  • modal_triangle (tuple) – Indices of the modal triangle

  • dip_value (float) – The Dip-value

Returns:

pval_grad – The gradient of the Dip p-value function regarding the projection axis

Return type:

np.ndarray

References

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

clustpy.utils.dip_test(X: ~numpy.ndarray, just_dip: bool = True, is_data_sorted: bool = False, return_gcm_lcm_mn_mj: bool = False, use_c: bool = True, debug: bool = False) -> (<class 'float'>, <class 'tuple'>, <class 'tuple'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Calculate the Dip-value. This can either be done using the C implementation or the python version. In addition to the Dip-value additional values can be returned. These are e.g. the modal interval (indices of the beginning and end of the steepest slop of the ECDF) and the modal interval (used to calculate the gradient of the Dip-value) if just_dip is False. Further, the indices of the Greatest Convex Minorant (gcm), Least Concave Majorant (lcm), minorant and majorant values can be returned by setting return_gcm_lcm_mn_mj to True. Note that modal_triangle can be (-1,-1,-1) if the triangle could not be determined correctly.

Parameters:
  • X (np.ndarray) – the given univariate data set

  • just_dip (bool) – Defines whether only the Dip-value should be returned or also the modal interval and modal triangle (default: True)

  • is_data_sorted (bool) – Should be True if the data set is already sorted (default: False)

  • return_gcm_lcm_mn_mj (bool) – Defines whether the gcm, lcm, mn and mj arrays should be returned. In this case just_dip must be False (default: False)

  • use_c (bool) – Defines whether the C implementation should be used (defualt: True)

  • debug (bool) – If true, additional information will be printed to the console (default: False)

Returns:

tuple – The resulting Dip-value, The indices of the modal_interval - corresponds to the steepest slope in the ECDF (if just_dip is False), The indices of the modal triangle (if just_dip is False), The indices of points that are part of the Greatest Convex Minorant (gcm) (if just_dip is False and return_gcm_lcm_mn_mj is True), The indices of points that are part of the Least Concave Majorant (lcm) (if just_dip is False and return_gcm_lcm_mn_mj is True), The minorant values (if just_dip is False and return_gcm_lcm_mn_mj is True), The majorant values (if just_dip is False and return_gcm_lcm_mn_mj is True)

Return type:

(float, tuple, tuple, np.ndarray, np.ndarray, np.ndarray, np.ndarray)

References

Hartigan, John A., and Pamela M. Hartigan. “The dip test of unimodality.” The annals of Statistics (1985): 70-84.

and

Hartigan, P. M. “Computation of the dip statistic to test for unimodality: Algorithm as 217.” Applied Statistics 34.3 (1985): 320-5.

clustpy.utils.evaluate_dataset(X: ~numpy.ndarray, evaluation_algorithms: list, evaluation_metrics: list | None = None, labels_true: ~numpy.ndarray | None = None, n_repetitions: int = 10, X_test: ~numpy.ndarray | None = None, labels_true_test: ~numpy.ndarray | None = None, iteration_specific_autoencoders: list | None = None, aggregation_functions: tuple = (<function mean>, <function std>), add_runtime: bool = True, add_n_clusters: bool = False, save_path: str | None = None, save_labels_path: str | None = None, ignore_algorithms: tuple = (), random_state: ~numpy.random.mtrand.RandomState | None = None) DataFrame[source]

Evaluate the clustering result of different clustering algorithms (as specified by evaluation_algorithms) on a given data set using different metrics (as specified by evaluation_metrics). Each algorithm will be executed n_repetitions times and all specified metrics will be used to evaluate the clustering result. The final result is a pandas DataFrame containing all the information.

Parameters:
  • X (np.ndarray) – the given data set

  • evaluation_algorithms (list) – Contains objects of type EvaluationAlgorithm which are wrappers for the clustering algorithms

  • evaluation_metrics (list) – Contains objects of type EvaluationMetric which are wrappers for the metrics (default: None)

  • labels_true (np.ndarray) – The ground truth labels of the data set (default: None)

  • n_repetitions (int) – Number of times that the clustering procedure should be executed on the same data set (default: 10)

  • X_test (np.ndarray) – An optional test data set that will be evaluated using the predict method of the clustering algorithms (default: None)

  • labels_true_test (np.ndarray) – The ground truth labels of the test data set (default: None)

  • iteration_specific_autoencoders (list) – List containing EvaluationAutoencoder objects for each iteration of deep clustering algorithm. Length of the list must be equal to ‘n_repetitions’. Each entry in the list must be of type EvaluationAutoencoder. If a clustering algorithm does not have a ‘autoencoder’ parameter, this parameter will be ignored. Can be None if no iteration-specific autoencoders are used (default: None)

  • aggregation_functions (tuple) – List of aggregation functions that should be applied to the n_repetitions different results of a single clustering algorithm (default: [np.mean, np.std])

  • add_runtime (bool) – Add runtime of each execution to the final table (default: True)

  • add_n_clusters (bool) – Add the resulting number of clusters to the final table (default: False)

  • save_path (str) – The path where the final DataFrame should be saved as csv. If None, the DataFrame will not be saved (default: None)

  • save_labels_path (str) – The path where the clustering labels should be saved as csv. If None, the labels will not be saved (default: None)

  • ignore_algorithms (tuple) – List of algorithm names (as specified in the EvaluationAlgorithm object) that should be ignored for this specific data set (default: [])

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

df – The final DataFrame

Return type:

pd.DataFrame

Examples

>>> from sklearn.cluster import KMeans, DBSCAN
>>> from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette
>>>
>>> def _add_value(x, value):
>>>     return x + value
>>>
>>> X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]])
>>> L = np.array([0] * 3 + [1] * 3)
>>> n_repetitions = 2
>>> aggregations = [np.mean, np.std, np.max]
>>> algorithms = [
>>>     EvaluationAlgorithm(name="KMeans", algorithm=KMeans, params={"n_clusters": 2}),
>>>     EvaluationAlgorithm(name="KMeans_with_preprocess", algorithm=KMeans, params={"n_clusters": 2},
>>>                         preprocess_methods=[_add_value],
>>>                         preprocess_params=[{"value": 1}]),
>>>     EvaluationAlgorithm(name="DBSCAN", algorithm=DBSCAN, params={"eps": 0.5, "min_samples": 2}, deterministic=True)]
>>> metrics = [EvaluationMetric(name="nmi", metric=nmi, params={"average_method": "geometric"}, use_gt=True),
>>>            EvaluationMetric(name="silhouette", metric=silhouette, use_gt=False)]
>>> df = evaluate_dataset(X=X, evaluation_algorithms=algorithms, evaluation_metrics=metrics, labels_true=L,
>>>                       n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True,
>>>                       add_n_clusters=True, save_path=None, ignore_algorithms=["KMeans_with_preprocess"],
>>>                       random_state=1)
clustpy.utils.evaluate_multiple_datasets(evaluation_datasets: list, evaluation_algorithms: list, evaluation_metrics: list | None = None, n_repetitions: int = 10, aggregation_functions: tuple = (<function mean>, <function std>), add_runtime: bool = True, add_n_clusters: bool = False, save_path: str | None = None, save_intermediate_results: bool = False, save_labels_path: str | None = None, random_state: ~numpy.random.mtrand.RandomState | None = None) DataFrame[source]

Evaluate the clustering result of different clustering algorithms (as specified by evaluation_algorithms) on a set of data sets (as specified by evaluation_datasets) using different metrics (as specified by evaluation_metrics). Each algorithm will be executed n_repetitions times and all specified metrics will be used to evaluate the clustering result. The final result is a pandas DataFrame containing all the information.

Parameters:
  • evaluation_datasets (list) – Contains objects of type EvaluationDataset which are wrappers for the data sets

  • evaluation_algorithms (list) – Contains objects of type EvaluationAlgorithm which are wrappers for the clustering algorithms

  • evaluation_metrics (list) – Contains objects of type EvaluationMetric which are wrappers for the metrics (default: None)

  • n_repetitions (int) – Number of times that the clustering procedure should be executed on the same data set (default: 10)

  • aggregation_functions (tuple) – List of aggregation functions that should be applied to the n_repetitions different results of a single clustering algorithm (default: [np.mean, np.std])

  • add_runtime (bool) – Add runtime of each execution to the final table (default: True)

  • add_n_clusters (bool) – Add the resulting number of clusters to the final table (default: False)

  • save_path (str) – The path where the final DataFrame should be saved as csv. If None, the DataFrame will not be saved (default: None)

  • save_intermediate_results (bool) – Defines whether the result of each data set should be separately saved. Useful if the evaluation takes a lot of time (default: False)

  • save_labels_path (str) – The path where the clustering labels should be saved as csv. If None, the labels will not be saved (default: None)

  • random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

df – The final DataFrame

Return type:

pd.DataFrame

Examples

See the readme.md

>>> from sklearn.cluster import KMeans, DBSCAN
>>> from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette
>>> from clustpy.data import load_iris
>>>
>>> def _add_value(x, value):
>>>     return x + value
>>>
>>> X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]])
>>> L = np.array([0] * 3 + [1] * 3)
>>> X2 = np.c_[X, L]
>>> n_repetitions = 2
>>> aggregations = [np.mean, np.std, np.max]
>>> algorithms = [
>>>     EvaluationAlgorithm(name="KMeans", algorithm=KMeans, params={"n_clusters": 2}),
>>>     EvaluationAlgorithm(name="KMeans_with_preprocess", algorithm=KMeans, params={"n_clusters": 2},
>>>                         preprocess_methods=[_add_value],
>>>                         preprocess_params=[{"value": 1}]),
>>>     EvaluationAlgorithm(name="DBSCAN", algorithm=DBSCAN, params={"eps": 0.5, "min_samples": 2}, deterministic=True)]
>>> metrics = [EvaluationMetric(name="nmi", metric=nmi, params={"average_method": "geometric"}, use_gt=True),
>>>            EvaluationMetric(name="silhouette", metric=silhouette, use_gt=False)]
>>> datasets = [EvaluationDataset(name="iris", data=load_iris, preprocess_methods=[_add_value],
>>>                               preprocess_params=[{"value": 2}]),
>>>             EvaluationDataset(name="X", data=X, labels_true=L),
>>>             EvaluationDataset(name="X2", data=X2, labels_true=-1, ignore_algorithms=["KMeans_with_preprocess"])
>>>             ]
>>> df = evaluate_multiple_datasets(evaluation_datasets=datasets, evaluation_algorithms=algorithms,
>>>                                 evaluation_metrics=metrics, n_repetitions=n_repetitions,
>>>                                 aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True,
>>>                                 save_path=None, save_intermediate_results=False, random_state=1)
clustpy.utils.evaluation_df_to_latex_table(df: DataFrame, output_path: str, use_std: bool = True, best_in_bold: bool = True, second_best_underlined: bool = True, color_by_value: str | None = None, higher_is_better: list | None = None, in_percent: int = True, decimal_places: int = 1) None[source]

Convert the resulting dataframe of an evaluation into a latex table. Note that the latex package booktabs is required, so usepackage{booktabs} must be included in the latex file. This method will only consider the mean values. Therefore, note that “mean” must be included in the aggregations! If “std” is also contained in the dataframe (and use_std is True) this value will also be added by using plusminus.

Parameters:
  • df (pd.DataFrame) – The pandas dataframe. Can also be a string that contains the path to the saved dataframe

  • output_path (std) – The path were the resulting latex table text file will be stored

  • use_std (bool) – Defines if the standard deviation (std) should also be added to the latex table (default: True)

  • best_in_bold (bool) – Print best value for each combination of dataset and metric in bold. Note, that the latex package bm is used, so usepackage{bm} must be included in the latex file (default: True)

  • second_best_underlined (bool) – Print second-best value for each combination of dataset and metric underlined (default: True)

  • color_by_value (str) – Define the color that should be used to indicate the difference between the values of the metrics. Uses colorcell, so usepackage{colortbl} or usepackage[table]{xcolor} must be included in the latex file. Can be ‘blue’ for example (default: None)

  • higher_is_better (list) – List with booleans. Each value indicates if a high value for a certain metric is better than a low value. The length of the list must be equal to the number of different metrics. If None, it is always assumed that a higher value is better, except for the runtime (default: None)

  • in_percent (bool) – If true, all values, except n_clusters and runtime, will be converted to percentages -> all values will be multiplied by 100 (default: True)

  • decimal_places (int) – Number of decimal places that should be used in the latex table (default: 1)

clustpy.utils.load_saved_autoencoder(path: str, autoencoder_class: Module, params: dict | None = None) Module[source]

Load the states of an already trained autoencoder. It will be assumed that the autoencoder was already fitted, so the ‘fitted’ parameter will be set to True.

Parameters:
  • path (str) – Path to the state dict that should be loaded

  • autoencoder_class (torch.nn.Module) – The actual autoencoder class

  • params (dict) – Parameters given to the autoencoder class (default: {})

Return type:

The autoencoder with the loaded states

clustpy.utils.plot_1d_data(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, show_legend: bool = True, show_plot: bool = True) None[source]

Plot a one-dimensional data set.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)

  • true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)

  • show_legend (bool) – Defines whether a legend should be shown (default: True)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plot_2d_data(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, centers: ~numpy.ndarray | None = None, true_labels: ~numpy.ndarray | None = None, cluster_ids_font_size: float | None = None, centers_ids_font_size: float = 10, show_legend: bool = True, title: str | None = None, scattersize: float = 10, centers_scattersize: float = 15, equal_axis: bool = False, container: ~matplotlib.axes._axes.Axes = <module 'matplotlib.pyplot' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.8/site-packages/matplotlib/pyplot.py'>, show_plot: bool = True) None[source]

Plot a two-dimensional data set.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)

  • true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)

  • cluster_ids_font_size (float) – The font size of the id of a predicted cluster, which is shown as text in the center of that cluster. Can be None if no id should be shown (default: None)

  • centers_ids_font_size (float) – The font size of the id that is shown next to the red marker of a cluster center. Only relevant if centers is not None. Can be None if no id should be shown (default: 10)

  • show_legend (bool) – Defines whether a legend should be shown (default: True)

  • title (str) – Title of the plot (default: None)

  • scattersize (float) – The size of the scatters (default: 10)

  • centers_scattersize (float) – The size of the red scatters of the cluster centers (default: 15)

  • equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)

  • container (plt.Axes) – The container to which the scatter plot is added. If another container is defined, show_plot should usually be False (default: matplotlib.pyplot)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plot_3d_data(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, show_legend: bool = True, scattersize: float = 10, show_plot: bool = True) None[source]

Plot a three-dimensional data set.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)

  • true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)

  • show_legend (bool) – Defines whether a legend should be shown (default: True)

  • scattersize (float) – The size of the scatters (default: 10)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plot_dip(X: ndarray, is_data_sorted: bool = False, dip_value: float | None = None, modal_interval: tuple | None = None, modal_triangle: tuple | None = None, gcm: ndarray | None = None, lcm: ndarray | None = None, linewidth_ecdf: float = 1, linewidth_extra: float = 2, show_legend: bool = True, add_histogram: bool = True, histogram_labels: ndarray | None = None, histogram_show_legend: bool = True, histogram_density: bool = True, histogram_n_bins: int = 100, height_ratio: tuple = (1, 2), show_plot: bool = True) None[source]

Plot a visual representation of the computational process of the Dip. Upper part shows an optional histogram of the data and the lower part shows the corresponding ECDF.

Parameters:
  • X (np.ndarray) – the given data set

  • is_data_sorted (bool) – Should be True if the data set is already sorted (default: False)

  • dip_value (float) – The Dip-value (default: None)

  • modal_interval (tuple) – Indices of the modal interval - corresponds to the steepest slope in the ECDF (default: None)

  • modal_triangle (tuple) – Indices of the modal triangle (default: None)

  • gcm (np.ndarray) – The indices of points that are part of the Greatest Convex Minorant (gcm) (default: None)

  • lcm (np.ndarray) – The indices of points that are part of the Least Concave Majorant (lcm) (default None)

  • linewidth_ecdf (flaot) – The linewidth for the eCDF (default: 1)

  • linewidth_extra (float) – The linewidth for the visualization of the dip, modal interval, modal triangle, gcm and lcm (default: 2)

  • show_legend (bool) – Defines whether the legend of the ECDF plot should be added (default: True)

  • add_histogram (bool) – Defines whether the histogram should be shown above the ECDF plot (default: True)

  • histogram_labels (np.ndarray) – Labels used to color parts of the histogram (default: None)

  • histogram_show_legend (bool) – Defines whether the legend of the histogram should be added (default: True)

  • histogram_density (bool) – Defines whether a kernel density should be added to the histogram plot (default: True)

  • histogram_n_bins (int) – Number of bins used for the histogram (default: 100)

  • height_ratio (tuple) – Defines the height ratio between histogram and ECDF plot. Only relevant if add_histogram is True. First value in the tuple defines the height of the histogram and the second value the height of the ECDF plot (default: (1, 2))

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plot_histogram(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, density: bool = True, n_bins: int = 100, show_legend: bool = True, container: ~matplotlib.axes._axes.Axes = <module 'matplotlib.pyplot' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/latest/lib/python3.8/site-packages/matplotlib/pyplot.py'>, show_plot: bool = True) None[source]

Plot a histogram.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • density (bool) – Defines whether a kernel density should be added to the histogram (default: True)

  • n_bins (int) – Number of bins (default: 100)

  • show_legend (bool) – Defines whether the legend of the histogram should be shown (default: True)

  • container (plt.Axes) – The container to which the histogram is added. If another container is defined, show_plot should usually be False (default: matplotlib.pyplot)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plot_image(img_data: ndarray, black_and_white: bool = False, image_shape: tuple | None = None, is_color_channel_last: bool = False, max_value: float | None = None, min_value: float | None = None, show_plot: bool = True) None[source]

Plot an image. Color image should occur in the HWC representation (height, width, color channels) if is_color_channel_last is True and in the CHW if is_color_channel_last is False.

Parameters:
  • img_data (np.ndarray) – The image data

  • black_and_white (bool) – Specifies whether the image should be plotted in grayscale colors. Only relevant for images without color channels (default: False)

  • image_shape (tuple) – (height, width) for grayscale images or HWC (height, width, color channels) / CHW for color images (default: None)

  • is_color_channel_last (bool) – if true, the color channels should be in the last dimension, known as HWC representation. Alternatively the color channel can be at the first position, known as CHW representation. Only relevant for color images (default: False)

  • max_value (float) – maximum pixel value, used for min-max normalization. Is often 255, if None the maximum value in the data set will be used (default: None)

  • min_value (float) – maximum pixel value, used for min-max normalization. Is often 0, if None the minimum value in the data set will be used (default: 255)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

Examples

>>> from clustpy.data import load_nrletters, load_optdigits
>>> X = load_nrletters().data
>>> plot_image(X[0], False, (9, 7, 3), True, 255, 0, show_plot=True)
>>> X = load_optdigits().data
>>> plot_image(X[0], True, (8, 8), None, 255, 0, show_plot=True)
clustpy.utils.plot_scatter_matrix(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, density: bool = True, n_bins: int = 100, show_legend: bool = True, scattersize: float = 10, equal_axis: bool = False, max_dimensions: int = 10, show_plot: bool = True) Axes[source]

Create a scatter matrix plot. Visualizes a 2d scatter plot for each combination of features. The center axis shows a histogram of each single feature.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)

  • true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)

  • density (bool) – Defines whether a kernel density should be added to the histogram (default: True)

  • n_bins (int) – Number of bins used for the histogram (default: 100)

  • show_legend (bool) – Defines whether a legend should be shown (default: True)

  • scattersize (float) – The size of the scatters (default: 10)

  • equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)

  • max_dimensions (int) – Maximum Number of dimensions that should be plotted. This value is intended to prevent the creation of overly complex plots that are very confusing and take a long time to create (default: 10)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

Returns:

axes – None if show_plot is True, otherwise the used matplotlib axes

Return type:

plt.Axes

clustpy.utils.plot_with_transformation(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, centers: ~numpy.ndarray | None = None, true_labels: ~numpy.ndarray | None = None, plot_dimensionality: int = 2, transformation_class: ~sklearn.base.TransformerMixin = <class 'sklearn.decomposition._pca.PCA'>, show_legend: bool = True, scattersize: float = 10, equal_axis: bool = False, show_plot: bool = True) None[source]

In Data Science, it is common to work with high-dimensional data. These cannot be visualized without further ado. Therefore, a dimensionality reduction technique is often applied before a plot is created. Examples for such techniques are PCA, ICA, t-SNE, UMAP, … Note that the chosen technique must work with a ‘fit_transform’ method.

This method automatically executes the aforementioned pipline: first it reduces the dimensionality, then it creates a plot adjusted to the number of features. Up to three dimensions are visualized with the help of scatter plats. Then a scatter matrix plot is used.

Parameters:
  • X (np.ndarray) – the given data set

  • labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)

  • centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)

  • true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)

  • plot_dimensionality (int) – The dimensionality of the feature space after the dimensionality reduction technique has been applied (default: 2)

  • transformation_class (TransformerMixin) – The transformation class / dimensionality reduction technique (default: sklearn.decomposition.PCA)

  • show_legend (bool) – Defines whether a legend should be shown (default: True)

  • scattersize (float) – The size of the scatters (default: 10)

  • equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)

  • show_plot (bool) – Defines whether the plot should directly be plotted (default: True)