clustpy.utils package

Submodules

clustpy.utils.diptest module

clustpy.utils.diptest.dip_boot_samples(n_points: int, n_boots: int = 1000, random_state: RandomState | None = None) → ndarray[source]

Sample random data sets and calculate corresponding Dip-values. E.g. used to determine p-values.

Parameters:

n_points (int) – The number of samples
n_boots (int) – Number of random data sets that should be created to calculate Dip-values (default: 1000)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

boot_dips – Array of Dip-values

Return type:

np.ndarray

clustpy.utils.diptest.dip_gradient(X: ndarray, X_proj: ndarray, sorted_indices: ndarray, modal_triangle: tuple) → ndarray[source]

Calculate the gradient of the Dip-value regarding the projection axis.

Parameters:

X (np.ndarray) – the given data set
X_proj (np.ndarray) – The univariate projected data set
sorted_indices (np.ndarray) – The indices of the sorted univariate data set
modal_triangle (tuple) – Indices of the modal triangle

Returns:

gradient – The gradient of the Dip-value regarding the projection axis

Return type:

np.ndarray

References

Krause, Andreas, and Volkmar Liebscher. “Multimodal projection pursuit using the dip statistic.” (2005).

clustpy.utils.diptest.dip_pval(dip_value: float, n_points: int, pval_strategy: str = 'table', n_boots: int = 1000, random_state: RandomState | None = None) → float[source]

Get the p-value of a corresponding Dip-value. P-values depend on the input Dip-value and the sample size. There are several strategies to calculate the p-value. These are: ‘table’ (most common), ‘function’ (available for all sample sizes) and ‘bootstrap’ (slow for large sample sizes)

Parameters:

dip_value (flaat) – The Dip-value
n_points (int) – The number of samples
pval_strategy (str) – Specifies the strategy that should be used to calculate the p-value (default: ‘table’)
n_boots (int) – Number of random data sets that should be created to calculate Dip-values. Only relevant if pval_strategy is ‘bootstrap’ (default: 1000)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only relevant if pval_strategy is ‘bootstrap’ (default: None)

Returns:

pval – The resulting p-value

Return type:

float

References

Hartigan, John A., and Pamela M. Hartigan. “The dip test of unimodality.” The annals of Statistics (1985): 70-84.

and

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

clustpy.utils.diptest.dip_pval_gradient(X: ndarray, X_proj: ndarray, sorted_indices: ndarray, modal_triangle: tuple, dip_value: float) → ndarray[source]

Calculate the gradient of the Dip p-value function regarding the projection axis.

Parameters:

X (np.ndarray) – the given data set
X_proj (np.ndarray) – The univariate projected data set
sorted_indices (np.ndarray) – The indices of the sorted univariate data set
modal_triangle (tuple) – Indices of the modal triangle
dip_value (float) – The Dip-value

Returns:

pval_grad – The gradient of the Dip p-value function regarding the projection axis

Return type:

np.ndarray

References

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

clustpy.utils.diptest.dip_test(X: ~numpy.ndarray, just_dip: bool = True, is_data_sorted: bool = False, return_gcm_lcm_mn_mj: bool = False, use_c: bool = True, debug: bool = False) -> (<class 'float'>, <class 'tuple'>, <class 'tuple'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Calculate the Dip-value. This can either be done using the C implementation or the python version. In addition to the Dip-value additional values can be returned. These are e.g. the modal interval (indices of the beginning and end of the steepest slop of the ECDF) and the modal interval (used to calculate the gradient of the Dip-value) if just_dip is False. Further, the indices of the Greatest Convex Minorant (gcm), Least Concave Majorant (lcm), minorant and majorant values can be returned by setting return_gcm_lcm_mn_mj to True. Note that modal_triangle can be (-1,-1,-1) if the triangle could not be determined correctly.

Parameters:

X (np.ndarray) – the given univariate data set
just_dip (bool) – Defines whether only the Dip-value should be returned or also the modal interval and modal triangle (default: True)
is_data_sorted (bool) – Should be True if the data set is already sorted (default: False)
return_gcm_lcm_mn_mj (bool) – Defines whether the gcm, lcm, mn and mj arrays should be returned. In this case just_dip must be False (default: False)
use_c (bool) – Defines whether the C implementation should be used (defualt: True)
debug (bool) – If true, additional information will be printed to the console (default: False)

Returns:

tuple – The resulting Dip-value, The indices of the modal_interval - corresponds to the steepest slope in the ECDF (if just_dip is False), The indices of the modal triangle (if just_dip is False), The indices of points that are part of the Greatest Convex Minorant (gcm) (if just_dip is False and return_gcm_lcm_mn_mj is True), The indices of points that are part of the Least Concave Majorant (lcm) (if just_dip is False and return_gcm_lcm_mn_mj is True), The minorant values (if just_dip is False and return_gcm_lcm_mn_mj is True), The majorant values (if just_dip is False and return_gcm_lcm_mn_mj is True)

Return type:

(float, tuple, tuple, np.ndarray, np.ndarray, np.ndarray, np.ndarray)

References

Hartigan, John A., and Pamela M. Hartigan. “The dip test of unimodality.” The annals of Statistics (1985): 70-84.

and

Hartigan, P. M. “Computation of the dip statistic to test for unimodality: Algorithm as 217.” Applied Statistics 34.3 (1985): 320-5.

clustpy.utils.diptest.plot_dip(X: ndarray, is_data_sorted: bool = False, dip_value: float | None = None, modal_interval: tuple | None = None, modal_triangle: tuple | None = None, gcm: ndarray | None = None, lcm: ndarray | None = None, show_legend: bool = True, add_histogram: bool = True, histogram_labels: ndarray | None = None, histogram_show_legend: bool = True, histogram_density: bool = True, histogram_n_bins: int = 100, height_ratio: tuple = (1, 2), show_plot: bool = True) → None[source]

Plot a visual representation of the computational process of the Dip. Upper part shows an optional histogram of the data and the lower part shows the corresponding ECDF.

Parameters:

X (np.ndarray) – the given data set
is_data_sorted (bool) – Should be True if the data set is already sorted (default: False)
dip_value (float) – The Dip-value (default: None)
modal_interval (tuple) – Indices of the modal interval - corresponds to the steepest slope in the ECDF (default: None)
modal_triangle (tuple) – Indices of the modal triangle (default: None)
gcm (np.ndarray) – The indices of points that are part of the Greatest Convex Minorant (gcm) (default: None)
lcm (np.ndarray) – The indices of points that are part of the Least Concave Majorant (lcm) (default None)
show_legend (bool) – Defines whether the legend of the ECDF plot should be added (default: True)
add_histogram (bool) – Defines whether the histogram should be shown above the ECDF plot (default: True)
histogram_labels (np.ndarray) – Labels used to color parts of the histogram (default: None)
histogram_show_legend (bool) – Defines whether the legend of the histogram should be added (default: True)
histogram_density (bool) – Defines whether a kernel density should be added to the histogram plot (default: True)
histogram_n_bins (int) – Number of bins used for the histogram (default: 100)
height_ratio (tuple) – Defines the height ratio between histogram and ECDF plot. Only relevant if add_histogram is True. First value in the tuple defines the height of the histogram and the second value the height of the ECDF plot (default: (1, 2))
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.evaluation module

class clustpy.utils.evaluation.EvaluationAlgorithm(name: str, algorithm: ClusterMixin, params: dict = {}, deterministic: bool = False, preprocess_methods: list | None = None, preprocess_params: list = {})[source]

Bases: object

The EvaluationAlgorithm object is a wrapper for clustering algorithms. It contains all the information necessary to evaluate a data set using the evaluate_dataset or evaluate_multiple_datasets method.

Parameters:

name (str) – Name of the metric. Can be chosen freely
algorithm (ClusterMixin) – The actual object of the clustering algorithm
params (dict) – Parameters given to the clustering algorithm (default: {})
deterministic (bool) – Defines if the algorithm produces a deterministic clustering result (e.g. like DBSCAN). In this case the algorithm will only be executed once even though a higher number of repetitions is specified when evaluating a data set (default: False)
preprocess_methods (list) – Specify preprocessing steps performed on each data set before executing the clustering algorithm. Can be either a list of callable functions or a single callable function (default: None)
preprocess_params (list) – List of dictionaries containing the parameters for the preprocessing methods. Needs one entry for each method in preprocess_methods. If only a single preprocessing method is given (instead of a list) a single dictionary is expected (default: {})

Examples

from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette from clustpy.data import load_iris

X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]]) L = np.array([0] * 3 + [1] * 3) X2 = np.c_[X, L]

n_repetitions = 2 aggregations = [np.mean, np.std, np.max]

algorithms = [

EvaluationAlgorithm(name=”KMeans”, algorithm=KMeans, params={“n_clusters”: 2}), EvaluationAlgorithm(name=”KMeans_with_preprocess”, algorithm=KMeans, params={“n_clusters”: 2},

EvaluationAlgorithm(name=”DBSCAN”, algorithm=DBSCAN, params={“eps”: 0.5, “min_samples”: 2}, deterministic=True)]

metrics = [EvaluationMetric(name=”nmi”, metric=nmi, params={“average_method”: “geometric”}, use_gt=True),

EvaluationMetric(name=”silhouette”, metric=silhouette, use_gt=False)]

datasets = [EvaluationDataset(name=”iris”, data=load_iris, preprocess_methods=[_add_value],

EvaluationDataset(name=”X”, data=X, labels_true=L), EvaluationDataset(name=”X2”, data=X2, labels_true=-1, ignore_algorithms=[“KMeans_with_preprocess”]) ]

df = evaluate_multiple_datasets(evaluation_datasets=datasets, evaluation_algorithms=algorithms,

evaluation_metrics=metrics, n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True, save_path=None, save_intermediate_results=False, random_state=1)

class clustpy.utils.evaluation.EvaluationDataset(name: str, data: ndarray, labels_true: ndarray | None = None, file_reader_params: dict = {}, preprocess_methods: list | None = None, preprocess_params: list = {}, ignore_algorithms: list = [])[source]

Bases: object

The EvaluationDataset object is a wrapper for actual data sets. It contains all the information necessary to evaluate a data set using the evaluate_multiple_datasets method.

Parameters:

name (str) – Name of the data set. Can be chosen freely
data (np.ndarray) – The actual data set. Can be a np.ndarray, a path to a data file (of type str) or a callable (e.g. a method from clustpy.data)
labels_true (np.ndarray) – The ground truth labels. Can be a np.ndarray, an int or list specifying which columns of the data contain the labels or None if no ground truth labels are present (default: None)
file_reader_params (dict) – Dictionary containing the information necessary to load a data file. Only relevant if data is of type str (default: {})
preprocess_methods (list) – Specify preprocessing steps before evaluating the data set. Can be either a list of callable functions or a single callable function (default: None)
preprocess_params (list) – List of dictionaries containing the parameters for the preprocessing methods. Needs one entry for each method in preprocess_methods. If only a single preprocessing method is given (instead of a list) a single dictionary is expected (default: {})
ignore_algorithms (list) – List of algorithm names (as specified in the EvaluationAlgorithm object) that should be ignored for this specific data set (default: [])

Examples

from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette from clustpy.data import load_iris

X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]]) L = np.array([0] * 3 + [1] * 3) X2 = np.c_[X, L]

n_repetitions = 2 aggregations = [np.mean, np.std, np.max]

algorithms = [

EvaluationAlgorithm(name=”KMeans”, algorithm=KMeans, params={“n_clusters”: 2}), EvaluationAlgorithm(name=”KMeans_with_preprocess”, algorithm=KMeans, params={“n_clusters”: 2},

EvaluationAlgorithm(name=”DBSCAN”, algorithm=DBSCAN, params={“eps”: 0.5, “min_samples”: 2}, deterministic=True)]

metrics = [EvaluationMetric(name=”nmi”, metric=nmi, params={“average_method”: “geometric”}, use_gt=True),

EvaluationMetric(name=”silhouette”, metric=silhouette, use_gt=False)]

datasets = [EvaluationDataset(name=”iris”, data=load_iris, preprocess_methods=[_add_value],

EvaluationDataset(name=”X”, data=X, labels_true=L), EvaluationDataset(name=”X2”, data=X2, labels_true=-1, ignore_algorithms=[“KMeans_with_preprocess”]) ]

df = evaluate_multiple_datasets(evaluation_datasets=datasets, evaluation_algorithms=algorithms,

evaluation_metrics=metrics, n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True, save_path=None, save_intermediate_results=False, random_state=1)

class clustpy.utils.evaluation.EvaluationMetric(name: str, metric: Callable, params: dict = {}, use_gt: bool = True)[source]

Bases: object

The EvaluationMetric object is a wrapper for evaluation metrics. It contains all the information necessary to evaluate a data set using the evaluate_dataset or evaluate_multiple_datasets method.

Parameters:

name (str) – Name of the metric. Can be chosen freely
metric (Callable) – The actual metric function
params (dict) – Parameters given to the metric function (default: {})
use_gt (bool) – If true, the input to the metric will be the ground truth labels and the predicted labels (e.g. normalized mutual information). If false, the input will be the data and the predicted labels (e.g. silhouette score) (default: True)

Examples

from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette from clustpy.data import load_iris

X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]]) L = np.array([0] * 3 + [1] * 3) X2 = np.c_[X, L]

n_repetitions = 2 aggregations = [np.mean, np.std, np.max]

algorithms = [

EvaluationAlgorithm(name=”KMeans”, algorithm=KMeans, params={“n_clusters”: 2}), EvaluationAlgorithm(name=”KMeans_with_preprocess”, algorithm=KMeans, params={“n_clusters”: 2},

EvaluationAlgorithm(name=”DBSCAN”, algorithm=DBSCAN, params={“eps”: 0.5, “min_samples”: 2}, deterministic=True)]

metrics = [EvaluationMetric(name=”nmi”, metric=nmi, params={“average_method”: “geometric”}, use_gt=True),

EvaluationMetric(name=”silhouette”, metric=silhouette, use_gt=False)]

datasets = [EvaluationDataset(name=”iris”, data=load_iris, preprocess_methods=[_add_value],

EvaluationDataset(name=”X”, data=X, labels_true=L), EvaluationDataset(name=”X2”, data=X2, labels_true=-1, ignore_algorithms=[“KMeans_with_preprocess”]) ]

df = evaluate_multiple_datasets(evaluation_datasets=datasets, evaluation_algorithms=algorithms,

evaluation_metrics=metrics, n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True, save_path=None, save_intermediate_results=False, random_state=1)

clustpy.utils.evaluation.evaluate_dataset(X: ~numpy.ndarray, evaluation_algorithms: list, evaluation_metrics: list | None = None, labels_true: ~numpy.ndarray | None = None, n_repetitions: int = 10, aggregation_functions: list = [<function mean>, <function std>], add_runtime: bool = True, add_n_clusters: bool = False, save_path: str | None = None, ignore_algorithms: list = [], random_state: ~numpy.random.mtrand.RandomState | None = None) → DataFrame[source]

Evaluate the clustering result of different clustering algorithms (as specified by evaluation_algorithms) on a given data set using different metrics (as specified by evaluation_metrics). Each algorithm will be executed n_repetitions times and all specified metrics will be used to evaluate the clustering result. The final result is a pandas DataFrame containing all the information.

Parameters:

X (np.ndarray) – the given data set
evaluation_algorithms (list) – Contains objects of type EvaluationAlgorithm which are wrappers for the clustering algorithms
evaluation_metrics (list) – Contains objects of type EvaluationMetric which are wrappers for the metrics (default: None)
labels_true (np.ndarray) – The ground truth labels of the data set (default: None)
n_repetitions (int) – Number of times that the clustering procedure should be executed on the same data set (default: 10)
aggregation_functions (list) – List of aggregation functions that should be applied to the n_repetitions different results of a single clustering algorithm (default: [np.mean, np.std])
add_runtime (bool) – Add runtime of each execution to the final table (default: True)
add_n_clusters (bool) – Add the resulting number of clusters to the final table (default: False)
save_path (str) – The path where the final DataFrame should be saved as csv. If None, the DataFrame will not be saved (default: None)
ignore_algorithms (list) – List of algorithm names (as specified in the EvaluationAlgorithm object) that should be ignored for this specific data set (default: [])
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

df – The final DataFrame

Return type:

pd.DataFrame

Examples

from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette

X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]]) L = np.array([0] * 3 + [1] * 3)

n_repetitions = 2 aggregations = [np.mean, np.std, np.max]

algorithms = [

EvaluationAlgorithm(name=”KMeans”, algorithm=KMeans, params={“n_clusters”: 2}), EvaluationAlgorithm(name=”KMeans_with_preprocess”, algorithm=KMeans, params={“n_clusters”: 2},

EvaluationAlgorithm(name=”DBSCAN”, algorithm=DBSCAN, params={“eps”: 0.5, “min_samples”: 2}, deterministic=True)]

metrics = [EvaluationMetric(name=”nmi”, metric=nmi, params={“average_method”: “geometric”}, use_gt=True),

EvaluationMetric(name=”silhouette”, metric=silhouette, use_gt=False)]

df = evaluate_dataset(X=X, evaluation_algorithms=algorithms, evaluation_metrics=metrics, labels_true=L,

n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True, save_path=None, ignore_algorithms=[“KMeans_with_preprocess”], random_state=1)

clustpy.utils.evaluation.evaluate_multiple_datasets(evaluation_datasets: list, evaluation_algorithms: list, evaluation_metrics: list | None = None, n_repetitions: int = 10, aggregation_functions: list = [<function mean>, <function std>], add_runtime: bool = True, add_n_clusters: bool = False, save_path: str | None = None, save_intermediate_results: bool = False, random_state: ~numpy.random.mtrand.RandomState | None = None) → DataFrame[source]

Evaluate the clustering result of different clustering algorithms (as specified by evaluation_algorithms) on a set of data sets (as specified by evaluation_datasets) using different metrics (as specified by evaluation_metrics). Each algorithm will be executed n_repetitions times and all specified metrics will be used to evaluate the clustering result. The final result is a pandas DataFrame containing all the information.

Parameters:

evaluation_datasets (list) – Contains objects of type EvaluationDataset which are wrappers for the data sets
evaluation_algorithms (list) – Contains objects of type EvaluationAlgorithm which are wrappers for the clustering algorithms
evaluation_metrics (list) – Contains objects of type EvaluationMetric which are wrappers for the metrics (default: None)
n_repetitions (int) – Number of times that the clustering procedure should be executed on the same data set (default: 10)
aggregation_functions (list) – List of aggregation functions that should be applied to the n_repetitions different results of a single clustering algorithm (default: [np.mean, np.std])
add_runtime (bool) – Add runtime of each execution to the final table (default: True)
add_n_clusters (bool) – Add the resulting number of clusters to the final table (default: False)
save_path (str) – The path where the final DataFrame should be saved as csv. If None, the DataFrame will not be saved (default: None)
save_intermediate_results (bool) – Defines whether the result of each data set should be separately saved. Useful if the evaluation takes a lot of time (default: False)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

df – The final DataFrame

Return type:

pd.DataFrame

Examples

See the readme.md

from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette from clustpy.data import load_iris

X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]]) L = np.array([0] * 3 + [1] * 3) X2 = np.c_[X, L]

n_repetitions = 2 aggregations = [np.mean, np.std, np.max]

algorithms = [

EvaluationAlgorithm(name=”KMeans”, algorithm=KMeans, params={“n_clusters”: 2}), EvaluationAlgorithm(name=”KMeans_with_preprocess”, algorithm=KMeans, params={“n_clusters”: 2},

EvaluationAlgorithm(name=”DBSCAN”, algorithm=DBSCAN, params={“eps”: 0.5, “min_samples”: 2}, deterministic=True)]

metrics = [EvaluationMetric(name=”nmi”, metric=nmi, params={“average_method”: “geometric”}, use_gt=True),

EvaluationMetric(name=”silhouette”, metric=silhouette, use_gt=False)]

datasets = [EvaluationDataset(name=”iris”, data=load_iris, preprocess_methods=[_add_value],

EvaluationDataset(name=”X”, data=X, labels_true=L), EvaluationDataset(name=”X2”, data=X2, labels_true=-1, ignore_algorithms=[“KMeans_with_preprocess”]) ]

df = evaluate_multiple_datasets(evaluation_datasets=datasets, evaluation_algorithms=algorithms,

evaluation_metrics=metrics, n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True, save_path=None, save_intermediate_results=False, random_state=1)

clustpy.utils.plots module

clustpy.utils.plots.plot_1d_data(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, show_legend: bool = True, show_plot: bool = True) → None[source]

Plot a one-dimensional data set.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)
true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)
show_legend (bool) – Defines whether a legend should be shown (default: True)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plots.plot_2d_data(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, centers: ~numpy.ndarray | None = None, true_labels: ~numpy.ndarray | None = None, show_legend: bool = True, scattersize: int = 10, equal_axis: bool = False, container: ~matplotlib.axes._axes.Axes = <module 'matplotlib.pyplot' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2-alpha/lib/python3.8/site-packages/matplotlib/pyplot.py'>, show_plot: bool = True) → None[source]

Plot a two-dimensional data set.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)
true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)
show_legend (bool) – Defines whether a legend should be shown (default: True)
scattersize (float) – The size of the scatters (default: 10)
equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)
container (plt.Axes) – The container to which the scatter plot is added. If another container is defined, show_plot should usually be False (default: matplotlib.pyplot)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plots.plot_3d_data(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, show_legend: bool = True, scattersize: int = 10, show_plot: bool = True) → None[source]

Plot a three-dimensional data set.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)
true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)
show_legend (bool) – Defines whether a legend should be shown (default: True)
scattersize (float) – The size of the scatters (default: 10)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plots.plot_histogram(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, density: bool = True, n_bins: int = 100, show_legend: bool = True, container: ~matplotlib.axes._axes.Axes = <module 'matplotlib.pyplot' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2-alpha/lib/python3.8/site-packages/matplotlib/pyplot.py'>, show_plot: bool = True) → None[source]

Plot a histogram.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
density (bool) – Defines whether a kernel density should be added to the histogram (default: True)
n_bins (int) – Number of bins (default: 100)
show_legend (bool) – Defines whether the legend of the histogram should be shown (default: True)
container (plt.Axes) – The container to which the histogram is added. If another container is defined, show_plot should usually be False (default: matplotlib.pyplot)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plots.plot_image(img_data: ndarray, black_and_white: bool = False, image_shape: tuple | None = None, max_value: float | None = None, min_value: float | None = None, show_plot: bool = True) → None[source]

Plot an image. Expects a color image to occur in the HWC representation (height, width, color channels).

Parameters:

img_data (np.ndarray) – The image data
black_and_white (bool) – Specifies whether the image should be plotted in grayscale colors. Only relevant for images without color channels (default: False)
image_shape (tuple) – (height, width) for grayscale images or (height, width, number of channels) for color images (default: None)
max_value (float) – maximum pixel value, used for min-max normalization. Is often 255, if None the maximum value in the data set will be used (default: None)
min_value (float) – maximum pixel value, used for min-max normalization. Is often 0, if None the minimum value in the data set will be used (default: 255)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

Examples

from clustpy.data import load_nrletters, load_optdigits X, _ = load_nrletters() plot_image(X[0], False, (9, 7, 3), 255, 0, show_plot=True)

X, _ = load_optdigits() plot_image(X[0], True, (8, 8), 255, 0, show_plot=True)

clustpy.utils.plots.plot_scatter_matrix(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, density: bool = True, n_bins: int = 100, show_legend: bool = True, scattersize: int = 10, equal_axis: bool = False, max_dimensions: int = 10, show_plot: bool = True) → Axes[source]

Create a scatter matrix plot. Visualizes a 2d scatter plot for each combination of features. The center axis shows a histogram of each single feature.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)
true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)
density (bool) – Defines whether a kernel density should be added to the histogram (default: True)
n_bins (int) – Number of bins used for the histogram (default: 100)
show_legend (bool) – Defines whether a legend should be shown (default: True)
scattersize (float) – The size of the scatters (default: 10)
equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)
max_dimensions (int) – Maximum Number of dimensions that should be plotted. This value is intended to prevent the creation of overly complex plots that are very confusing and take a long time to create (default: 10)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

Returns:

axes – The used matplotlib axes

Return type:

plt.Axes

clustpy.utils.plots.plot_with_transformation(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, centers: ~numpy.ndarray | None = None, true_labels: ~numpy.ndarray | None = None, plot_dimensionality: int = 2, transformation_class: ~sklearn.base.TransformerMixin = <class 'sklearn.decomposition._pca.PCA'>, show_legend: bool = True, scattersize: int = 10, equal_axis: bool = False, show_plot: bool = True) → None[source]

In Data Science, it is common to work with high-dimensional data. These cannot be visualized without further ado. Therefore, a dimensionality reduction technique is often applied before a plot is created. Examples for such techniques are PCA, ICA, t-SNE, UMAP, … Note that the chosen technique must work with a ‘fit_transform’ method.

This method automatically executes the aforementioned pipline: first it reduces the dimensionality, then it creates a plot adjusted to the number of features. Up to three dimensions are visualized with the help of scatter plats. Then a scatter matrix plot is used.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)
true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)
plot_dimensionality (int) – The dimensionality of the feature space after the dimensionality reduction technique has been applied (default: 2)
transformation_class (TransformerMixin) – The transformation class / dimensionality reduction technique (default: sklearn.decomposition.PCA)
show_legend (bool) – Defines whether a legend should be shown (default: True)
scattersize (float) – The size of the scatters (default: 10)
equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

Module contents

class clustpy.utils.EvaluationAlgorithm(name: str, algorithm: ClusterMixin, params: dict = {}, deterministic: bool = False, preprocess_methods: list | None = None, preprocess_params: list = {})[source]

Bases: object

The EvaluationAlgorithm object is a wrapper for clustering algorithms. It contains all the information necessary to evaluate a data set using the evaluate_dataset or evaluate_multiple_datasets method.

Parameters:

name (str) – Name of the metric. Can be chosen freely
algorithm (ClusterMixin) – The actual object of the clustering algorithm
params (dict) – Parameters given to the clustering algorithm (default: {})
deterministic (bool) – Defines if the algorithm produces a deterministic clustering result (e.g. like DBSCAN). In this case the algorithm will only be executed once even though a higher number of repetitions is specified when evaluating a data set (default: False)
preprocess_methods (list) – Specify preprocessing steps performed on each data set before executing the clustering algorithm. Can be either a list of callable functions or a single callable function (default: None)
preprocess_params (list) – List of dictionaries containing the parameters for the preprocessing methods. Needs one entry for each method in preprocess_methods. If only a single preprocessing method is given (instead of a list) a single dictionary is expected (default: {})

Examples

from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette from clustpy.data import load_iris

X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]]) L = np.array([0] * 3 + [1] * 3) X2 = np.c_[X, L]

n_repetitions = 2 aggregations = [np.mean, np.std, np.max]

algorithms = [

EvaluationAlgorithm(name=”KMeans”, algorithm=KMeans, params={“n_clusters”: 2}), EvaluationAlgorithm(name=”KMeans_with_preprocess”, algorithm=KMeans, params={“n_clusters”: 2},

EvaluationAlgorithm(name=”DBSCAN”, algorithm=DBSCAN, params={“eps”: 0.5, “min_samples”: 2}, deterministic=True)]

metrics = [EvaluationMetric(name=”nmi”, metric=nmi, params={“average_method”: “geometric”}, use_gt=True),

EvaluationMetric(name=”silhouette”, metric=silhouette, use_gt=False)]

datasets = [EvaluationDataset(name=”iris”, data=load_iris, preprocess_methods=[_add_value],

EvaluationDataset(name=”X”, data=X, labels_true=L), EvaluationDataset(name=”X2”, data=X2, labels_true=-1, ignore_algorithms=[“KMeans_with_preprocess”]) ]

df = evaluate_multiple_datasets(evaluation_datasets=datasets, evaluation_algorithms=algorithms,

evaluation_metrics=metrics, n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True, save_path=None, save_intermediate_results=False, random_state=1)

class clustpy.utils.EvaluationDataset(name: str, data: ndarray, labels_true: ndarray | None = None, file_reader_params: dict = {}, preprocess_methods: list | None = None, preprocess_params: list = {}, ignore_algorithms: list = [])[source]

Bases: object

The EvaluationDataset object is a wrapper for actual data sets. It contains all the information necessary to evaluate a data set using the evaluate_multiple_datasets method.

Parameters:

name (str) – Name of the data set. Can be chosen freely
data (np.ndarray) – The actual data set. Can be a np.ndarray, a path to a data file (of type str) or a callable (e.g. a method from clustpy.data)
labels_true (np.ndarray) – The ground truth labels. Can be a np.ndarray, an int or list specifying which columns of the data contain the labels or None if no ground truth labels are present (default: None)
file_reader_params (dict) – Dictionary containing the information necessary to load a data file. Only relevant if data is of type str (default: {})
preprocess_methods (list) – Specify preprocessing steps before evaluating the data set. Can be either a list of callable functions or a single callable function (default: None)
preprocess_params (list) – List of dictionaries containing the parameters for the preprocessing methods. Needs one entry for each method in preprocess_methods. If only a single preprocessing method is given (instead of a list) a single dictionary is expected (default: {})
ignore_algorithms (list) – List of algorithm names (as specified in the EvaluationAlgorithm object) that should be ignored for this specific data set (default: [])

Examples

from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette from clustpy.data import load_iris

X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]]) L = np.array([0] * 3 + [1] * 3) X2 = np.c_[X, L]

n_repetitions = 2 aggregations = [np.mean, np.std, np.max]

algorithms = [

EvaluationAlgorithm(name=”KMeans”, algorithm=KMeans, params={“n_clusters”: 2}), EvaluationAlgorithm(name=”KMeans_with_preprocess”, algorithm=KMeans, params={“n_clusters”: 2},

EvaluationAlgorithm(name=”DBSCAN”, algorithm=DBSCAN, params={“eps”: 0.5, “min_samples”: 2}, deterministic=True)]

metrics = [EvaluationMetric(name=”nmi”, metric=nmi, params={“average_method”: “geometric”}, use_gt=True),

EvaluationMetric(name=”silhouette”, metric=silhouette, use_gt=False)]

datasets = [EvaluationDataset(name=”iris”, data=load_iris, preprocess_methods=[_add_value],

EvaluationDataset(name=”X”, data=X, labels_true=L), EvaluationDataset(name=”X2”, data=X2, labels_true=-1, ignore_algorithms=[“KMeans_with_preprocess”]) ]

df = evaluate_multiple_datasets(evaluation_datasets=datasets, evaluation_algorithms=algorithms,

evaluation_metrics=metrics, n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True, save_path=None, save_intermediate_results=False, random_state=1)

class clustpy.utils.EvaluationMetric(name: str, metric: Callable, params: dict = {}, use_gt: bool = True)[source]

Bases: object

The EvaluationMetric object is a wrapper for evaluation metrics. It contains all the information necessary to evaluate a data set using the evaluate_dataset or evaluate_multiple_datasets method.

Parameters:

name (str) – Name of the metric. Can be chosen freely
metric (Callable) – The actual metric function
params (dict) – Parameters given to the metric function (default: {})
use_gt (bool) – If true, the input to the metric will be the ground truth labels and the predicted labels (e.g. normalized mutual information). If false, the input will be the data and the predicted labels (e.g. silhouette score) (default: True)

Examples

from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette from clustpy.data import load_iris

X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]]) L = np.array([0] * 3 + [1] * 3) X2 = np.c_[X, L]

n_repetitions = 2 aggregations = [np.mean, np.std, np.max]

algorithms = [

EvaluationAlgorithm(name=”KMeans”, algorithm=KMeans, params={“n_clusters”: 2}), EvaluationAlgorithm(name=”KMeans_with_preprocess”, algorithm=KMeans, params={“n_clusters”: 2},

EvaluationAlgorithm(name=”DBSCAN”, algorithm=DBSCAN, params={“eps”: 0.5, “min_samples”: 2}, deterministic=True)]

metrics = [EvaluationMetric(name=”nmi”, metric=nmi, params={“average_method”: “geometric”}, use_gt=True),

EvaluationMetric(name=”silhouette”, metric=silhouette, use_gt=False)]

datasets = [EvaluationDataset(name=”iris”, data=load_iris, preprocess_methods=[_add_value],

EvaluationDataset(name=”X”, data=X, labels_true=L), EvaluationDataset(name=”X2”, data=X2, labels_true=-1, ignore_algorithms=[“KMeans_with_preprocess”]) ]

df = evaluate_multiple_datasets(evaluation_datasets=datasets, evaluation_algorithms=algorithms,

evaluation_metrics=metrics, n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True, save_path=None, save_intermediate_results=False, random_state=1)

clustpy.utils.dip_boot_samples(n_points: int, n_boots: int = 1000, random_state: RandomState | None = None) → ndarray[source]

Sample random data sets and calculate corresponding Dip-values. E.g. used to determine p-values.

Parameters:

n_points (int) – The number of samples
n_boots (int) – Number of random data sets that should be created to calculate Dip-values (default: 1000)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

boot_dips – Array of Dip-values

Return type:

np.ndarray

clustpy.utils.dip_gradient(X: ndarray, X_proj: ndarray, sorted_indices: ndarray, modal_triangle: tuple) → ndarray[source]

Calculate the gradient of the Dip-value regarding the projection axis.

Parameters:

X (np.ndarray) – the given data set
X_proj (np.ndarray) – The univariate projected data set
sorted_indices (np.ndarray) – The indices of the sorted univariate data set
modal_triangle (tuple) – Indices of the modal triangle

Returns:

gradient – The gradient of the Dip-value regarding the projection axis

Return type:

np.ndarray

References

Krause, Andreas, and Volkmar Liebscher. “Multimodal projection pursuit using the dip statistic.” (2005).

clustpy.utils.dip_pval(dip_value: float, n_points: int, pval_strategy: str = 'table', n_boots: int = 1000, random_state: RandomState | None = None) → float[source]

Get the p-value of a corresponding Dip-value. P-values depend on the input Dip-value and the sample size. There are several strategies to calculate the p-value. These are: ‘table’ (most common), ‘function’ (available for all sample sizes) and ‘bootstrap’ (slow for large sample sizes)

Parameters:

dip_value (flaat) – The Dip-value
n_points (int) – The number of samples
pval_strategy (str) – Specifies the strategy that should be used to calculate the p-value (default: ‘table’)
n_boots (int) – Number of random data sets that should be created to calculate Dip-values. Only relevant if pval_strategy is ‘bootstrap’ (default: 1000)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int. Only relevant if pval_strategy is ‘bootstrap’ (default: None)

Returns:

pval – The resulting p-value

Return type:

float

References

Hartigan, John A., and Pamela M. Hartigan. “The dip test of unimodality.” The annals of Statistics (1985): 70-84.

and

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

clustpy.utils.dip_pval_gradient(X: ndarray, X_proj: ndarray, sorted_indices: ndarray, modal_triangle: tuple, dip_value: float) → ndarray[source]

Calculate the gradient of the Dip p-value function regarding the projection axis.

Parameters:

X (np.ndarray) – the given data set
X_proj (np.ndarray) – The univariate projected data set
sorted_indices (np.ndarray) – The indices of the sorted univariate data set
modal_triangle (tuple) – Indices of the modal triangle
dip_value (float) – The Dip-value

Returns:

pval_grad – The gradient of the Dip p-value function regarding the projection axis

Return type:

np.ndarray

References

Bauer, Lena, et al. “Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2023.

clustpy.utils.dip_test(X: ~numpy.ndarray, just_dip: bool = True, is_data_sorted: bool = False, return_gcm_lcm_mn_mj: bool = False, use_c: bool = True, debug: bool = False) -> (<class 'float'>, <class 'tuple'>, <class 'tuple'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Calculate the Dip-value. This can either be done using the C implementation or the python version. In addition to the Dip-value additional values can be returned. These are e.g. the modal interval (indices of the beginning and end of the steepest slop of the ECDF) and the modal interval (used to calculate the gradient of the Dip-value) if just_dip is False. Further, the indices of the Greatest Convex Minorant (gcm), Least Concave Majorant (lcm), minorant and majorant values can be returned by setting return_gcm_lcm_mn_mj to True. Note that modal_triangle can be (-1,-1,-1) if the triangle could not be determined correctly.

Parameters:

X (np.ndarray) – the given univariate data set
just_dip (bool) – Defines whether only the Dip-value should be returned or also the modal interval and modal triangle (default: True)
is_data_sorted (bool) – Should be True if the data set is already sorted (default: False)
return_gcm_lcm_mn_mj (bool) – Defines whether the gcm, lcm, mn and mj arrays should be returned. In this case just_dip must be False (default: False)
use_c (bool) – Defines whether the C implementation should be used (defualt: True)
debug (bool) – If true, additional information will be printed to the console (default: False)

Returns:

tuple – The resulting Dip-value, The indices of the modal_interval - corresponds to the steepest slope in the ECDF (if just_dip is False), The indices of the modal triangle (if just_dip is False), The indices of points that are part of the Greatest Convex Minorant (gcm) (if just_dip is False and return_gcm_lcm_mn_mj is True), The indices of points that are part of the Least Concave Majorant (lcm) (if just_dip is False and return_gcm_lcm_mn_mj is True), The minorant values (if just_dip is False and return_gcm_lcm_mn_mj is True), The majorant values (if just_dip is False and return_gcm_lcm_mn_mj is True)

Return type:

(float, tuple, tuple, np.ndarray, np.ndarray, np.ndarray, np.ndarray)

References

Hartigan, John A., and Pamela M. Hartigan. “The dip test of unimodality.” The annals of Statistics (1985): 70-84.

and

Hartigan, P. M. “Computation of the dip statistic to test for unimodality: Algorithm as 217.” Applied Statistics 34.3 (1985): 320-5.

clustpy.utils.evaluate_dataset(X: ~numpy.ndarray, evaluation_algorithms: list, evaluation_metrics: list | None = None, labels_true: ~numpy.ndarray | None = None, n_repetitions: int = 10, aggregation_functions: list = [<function mean>, <function std>], add_runtime: bool = True, add_n_clusters: bool = False, save_path: str | None = None, ignore_algorithms: list = [], random_state: ~numpy.random.mtrand.RandomState | None = None) → DataFrame[source]

Evaluate the clustering result of different clustering algorithms (as specified by evaluation_algorithms) on a given data set using different metrics (as specified by evaluation_metrics). Each algorithm will be executed n_repetitions times and all specified metrics will be used to evaluate the clustering result. The final result is a pandas DataFrame containing all the information.

Parameters:

X (np.ndarray) – the given data set
evaluation_algorithms (list) – Contains objects of type EvaluationAlgorithm which are wrappers for the clustering algorithms
evaluation_metrics (list) – Contains objects of type EvaluationMetric which are wrappers for the metrics (default: None)
labels_true (np.ndarray) – The ground truth labels of the data set (default: None)
n_repetitions (int) – Number of times that the clustering procedure should be executed on the same data set (default: 10)
aggregation_functions (list) – List of aggregation functions that should be applied to the n_repetitions different results of a single clustering algorithm (default: [np.mean, np.std])
add_runtime (bool) – Add runtime of each execution to the final table (default: True)
add_n_clusters (bool) – Add the resulting number of clusters to the final table (default: False)
save_path (str) – The path where the final DataFrame should be saved as csv. If None, the DataFrame will not be saved (default: None)
ignore_algorithms (list) – List of algorithm names (as specified in the EvaluationAlgorithm object) that should be ignored for this specific data set (default: [])
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

df – The final DataFrame

Return type:

pd.DataFrame

Examples

from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette

X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]]) L = np.array([0] * 3 + [1] * 3)

n_repetitions = 2 aggregations = [np.mean, np.std, np.max]

algorithms = [

EvaluationAlgorithm(name=”KMeans”, algorithm=KMeans, params={“n_clusters”: 2}), EvaluationAlgorithm(name=”KMeans_with_preprocess”, algorithm=KMeans, params={“n_clusters”: 2},

EvaluationAlgorithm(name=”DBSCAN”, algorithm=DBSCAN, params={“eps”: 0.5, “min_samples”: 2}, deterministic=True)]

metrics = [EvaluationMetric(name=”nmi”, metric=nmi, params={“average_method”: “geometric”}, use_gt=True),

EvaluationMetric(name=”silhouette”, metric=silhouette, use_gt=False)]

df = evaluate_dataset(X=X, evaluation_algorithms=algorithms, evaluation_metrics=metrics, labels_true=L,

n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True, save_path=None, ignore_algorithms=[“KMeans_with_preprocess”], random_state=1)

clustpy.utils.evaluate_multiple_datasets(evaluation_datasets: list, evaluation_algorithms: list, evaluation_metrics: list | None = None, n_repetitions: int = 10, aggregation_functions: list = [<function mean>, <function std>], add_runtime: bool = True, add_n_clusters: bool = False, save_path: str | None = None, save_intermediate_results: bool = False, random_state: ~numpy.random.mtrand.RandomState | None = None) → DataFrame[source]

Evaluate the clustering result of different clustering algorithms (as specified by evaluation_algorithms) on a set of data sets (as specified by evaluation_datasets) using different metrics (as specified by evaluation_metrics). Each algorithm will be executed n_repetitions times and all specified metrics will be used to evaluate the clustering result. The final result is a pandas DataFrame containing all the information.

Parameters:

evaluation_datasets (list) – Contains objects of type EvaluationDataset which are wrappers for the data sets
evaluation_algorithms (list) – Contains objects of type EvaluationAlgorithm which are wrappers for the clustering algorithms
evaluation_metrics (list) – Contains objects of type EvaluationMetric which are wrappers for the metrics (default: None)
n_repetitions (int) – Number of times that the clustering procedure should be executed on the same data set (default: 10)
aggregation_functions (list) – List of aggregation functions that should be applied to the n_repetitions different results of a single clustering algorithm (default: [np.mean, np.std])
add_runtime (bool) – Add runtime of each execution to the final table (default: True)
add_n_clusters (bool) – Add the resulting number of clusters to the final table (default: False)
save_path (str) – The path where the final DataFrame should be saved as csv. If None, the DataFrame will not be saved (default: None)
save_intermediate_results (bool) – Defines whether the result of each data set should be separately saved. Useful if the evaluation takes a lot of time (default: False)
random_state (np.random.RandomState) – use a fixed random state to get a repeatable solution. Can also be of type int (default: None)

Returns:

df – The final DataFrame

Return type:

pd.DataFrame

Examples

See the readme.md

from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score as silhouette from clustpy.data import load_iris

X = np.array([[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]]) L = np.array([0] * 3 + [1] * 3) X2 = np.c_[X, L]

n_repetitions = 2 aggregations = [np.mean, np.std, np.max]

algorithms = [

EvaluationAlgorithm(name=”KMeans”, algorithm=KMeans, params={“n_clusters”: 2}), EvaluationAlgorithm(name=”KMeans_with_preprocess”, algorithm=KMeans, params={“n_clusters”: 2},

EvaluationAlgorithm(name=”DBSCAN”, algorithm=DBSCAN, params={“eps”: 0.5, “min_samples”: 2}, deterministic=True)]

metrics = [EvaluationMetric(name=”nmi”, metric=nmi, params={“average_method”: “geometric”}, use_gt=True),

EvaluationMetric(name=”silhouette”, metric=silhouette, use_gt=False)]

datasets = [EvaluationDataset(name=”iris”, data=load_iris, preprocess_methods=[_add_value],

EvaluationDataset(name=”X”, data=X, labels_true=L), EvaluationDataset(name=”X2”, data=X2, labels_true=-1, ignore_algorithms=[“KMeans_with_preprocess”]) ]

df = evaluate_multiple_datasets(evaluation_datasets=datasets, evaluation_algorithms=algorithms,

evaluation_metrics=metrics, n_repetitions=n_repetitions, aggregation_functions=aggregations, add_runtime=True, add_n_clusters=True, save_path=None, save_intermediate_results=False, random_state=1)

clustpy.utils.plot_1d_data(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, show_legend: bool = True, show_plot: bool = True) → None[source]

Plot a one-dimensional data set.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)
true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)
show_legend (bool) – Defines whether a legend should be shown (default: True)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plot_2d_data(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, centers: ~numpy.ndarray | None = None, true_labels: ~numpy.ndarray | None = None, show_legend: bool = True, scattersize: int = 10, equal_axis: bool = False, container: ~matplotlib.axes._axes.Axes = <module 'matplotlib.pyplot' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2-alpha/lib/python3.8/site-packages/matplotlib/pyplot.py'>, show_plot: bool = True) → None[source]

Plot a two-dimensional data set.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)
true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)
show_legend (bool) – Defines whether a legend should be shown (default: True)
scattersize (float) – The size of the scatters (default: 10)
equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)
container (plt.Axes) – The container to which the scatter plot is added. If another container is defined, show_plot should usually be False (default: matplotlib.pyplot)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plot_3d_data(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, show_legend: bool = True, scattersize: int = 10, show_plot: bool = True) → None[source]

Plot a three-dimensional data set.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)
true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)
show_legend (bool) – Defines whether a legend should be shown (default: True)
scattersize (float) – The size of the scatters (default: 10)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plot_dip(X: ndarray, is_data_sorted: bool = False, dip_value: float | None = None, modal_interval: tuple | None = None, modal_triangle: tuple | None = None, gcm: ndarray | None = None, lcm: ndarray | None = None, show_legend: bool = True, add_histogram: bool = True, histogram_labels: ndarray | None = None, histogram_show_legend: bool = True, histogram_density: bool = True, histogram_n_bins: int = 100, height_ratio: tuple = (1, 2), show_plot: bool = True) → None[source]

Plot a visual representation of the computational process of the Dip. Upper part shows an optional histogram of the data and the lower part shows the corresponding ECDF.

Parameters:

X (np.ndarray) – the given data set
is_data_sorted (bool) – Should be True if the data set is already sorted (default: False)
dip_value (float) – The Dip-value (default: None)
modal_interval (tuple) – Indices of the modal interval - corresponds to the steepest slope in the ECDF (default: None)
modal_triangle (tuple) – Indices of the modal triangle (default: None)
gcm (np.ndarray) – The indices of points that are part of the Greatest Convex Minorant (gcm) (default: None)
lcm (np.ndarray) – The indices of points that are part of the Least Concave Majorant (lcm) (default None)
show_legend (bool) – Defines whether the legend of the ECDF plot should be added (default: True)
add_histogram (bool) – Defines whether the histogram should be shown above the ECDF plot (default: True)
histogram_labels (np.ndarray) – Labels used to color parts of the histogram (default: None)
histogram_show_legend (bool) – Defines whether the legend of the histogram should be added (default: True)
histogram_density (bool) – Defines whether a kernel density should be added to the histogram plot (default: True)
histogram_n_bins (int) – Number of bins used for the histogram (default: 100)
height_ratio (tuple) – Defines the height ratio between histogram and ECDF plot. Only relevant if add_histogram is True. First value in the tuple defines the height of the histogram and the second value the height of the ECDF plot (default: (1, 2))
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plot_histogram(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, density: bool = True, n_bins: int = 100, show_legend: bool = True, container: ~matplotlib.axes._axes.Axes = <module 'matplotlib.pyplot' from '/home/docs/checkouts/readthedocs.org/user_builds/clustpy/envs/v0.0.2-alpha/lib/python3.8/site-packages/matplotlib/pyplot.py'>, show_plot: bool = True) → None[source]

Plot a histogram.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
density (bool) – Defines whether a kernel density should be added to the histogram (default: True)
n_bins (int) – Number of bins (default: 100)
show_legend (bool) – Defines whether the legend of the histogram should be shown (default: True)
container (plt.Axes) – The container to which the histogram is added. If another container is defined, show_plot should usually be False (default: matplotlib.pyplot)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

clustpy.utils.plot_image(img_data: ndarray, black_and_white: bool = False, image_shape: tuple | None = None, max_value: float | None = None, min_value: float | None = None, show_plot: bool = True) → None[source]

Plot an image. Expects a color image to occur in the HWC representation (height, width, color channels).

Parameters:

img_data (np.ndarray) – The image data
black_and_white (bool) – Specifies whether the image should be plotted in grayscale colors. Only relevant for images without color channels (default: False)
image_shape (tuple) – (height, width) for grayscale images or (height, width, number of channels) for color images (default: None)
max_value (float) – maximum pixel value, used for min-max normalization. Is often 255, if None the maximum value in the data set will be used (default: None)
min_value (float) – maximum pixel value, used for min-max normalization. Is often 0, if None the minimum value in the data set will be used (default: 255)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

Examples

from clustpy.data import load_nrletters, load_optdigits X, _ = load_nrletters() plot_image(X[0], False, (9, 7, 3), 255, 0, show_plot=True)

X, _ = load_optdigits() plot_image(X[0], True, (8, 8), 255, 0, show_plot=True)

clustpy.utils.plot_scatter_matrix(X: ndarray, labels: ndarray | None = None, centers: ndarray | None = None, true_labels: ndarray | None = None, density: bool = True, n_bins: int = 100, show_legend: bool = True, scattersize: int = 10, equal_axis: bool = False, max_dimensions: int = 10, show_plot: bool = True) → Axes[source]

Create a scatter matrix plot. Visualizes a 2d scatter plot for each combination of features. The center axis shows a histogram of each single feature.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)
true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)
density (bool) – Defines whether a kernel density should be added to the histogram (default: True)
n_bins (int) – Number of bins used for the histogram (default: 100)
show_legend (bool) – Defines whether a legend should be shown (default: True)
scattersize (float) – The size of the scatters (default: 10)
equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)
max_dimensions (int) – Maximum Number of dimensions that should be plotted. This value is intended to prevent the creation of overly complex plots that are very confusing and take a long time to create (default: 10)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)

Returns:

axes – The used matplotlib axes

Return type:

plt.Axes

clustpy.utils.plot_with_transformation(X: ~numpy.ndarray, labels: ~numpy.ndarray | None = None, centers: ~numpy.ndarray | None = None, true_labels: ~numpy.ndarray | None = None, plot_dimensionality: int = 2, transformation_class: ~sklearn.base.TransformerMixin = <class 'sklearn.decomposition._pca.PCA'>, show_legend: bool = True, scattersize: int = 10, equal_axis: bool = False, show_plot: bool = True) → None[source]

In Data Science, it is common to work with high-dimensional data. These cannot be visualized without further ado. Therefore, a dimensionality reduction technique is often applied before a plot is created. Examples for such techniques are PCA, ICA, t-SNE, UMAP, … Note that the chosen technique must work with a ‘fit_transform’ method.

This method automatically executes the aforementioned pipline: first it reduces the dimensionality, then it creates a plot adjusted to the number of features. Up to three dimensions are visualized with the help of scatter plats. Then a scatter matrix plot is used.

Parameters:

X (np.ndarray) – the given data set
labels (np.ndarray) – The cluster labels. Specifies the color of the plotted objects. Can be None (default: None)
centers (np.ndarray) – The cluster centers. Will be plotted as red dots labeled by the corresponding cluster id. Can be None (default: None)
true_labels (np.ndarray) – The ground truth labels. Specifies the symbol of the plotted objects. Can be None (default: None)
plot_dimensionality (int) – The dimensionality of the feature space after the dimensionality reduction technique has been applied (default: 2)
transformation_class (TransformerMixin) – The transformation class / dimensionality reduction technique (default: sklearn.decomposition.PCA)
show_legend (bool) – Defines whether a legend should be shown (default: True)
scattersize (float) – The size of the scatters (default: 10)
equal_axis (bool) – Defines whether the axes are to be scaled to the same value range (default: False)
show_plot (bool) – Defines whether the plot should directly be plotted (default: True)