clustpy.hierarchical package

Submodules

clustpy.hierarchical.dctree_clusterer module

@authors: Pascal Weber

class clustpy.hierarchical.dctree_clusterer.DCTree_Clusterer(min_points: int = 5, use_less_memory: bool = False)[source]

Bases: ClusterMixin, BaseEstimator

The DCTree clustering algorithm. Identifies stable nodes within the DCTree and labels the data accordingly.

Parameters:
  • min_points (int) – the minimum number of points (default: 5)

  • use_less_memory (bool) – Use less memory when constructing the DCTree. This will, however, increase the runtime (default: False)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

dc_tree_

The resulting cluster tree

Type:

BinaryClusterTree

n_features_in_

the number of features used for the fitting

Type:

int

References

SHADE: Deep Density-based Clustering Anna Beer; Pascal Weber; Lukas Miklautz; Collin Leiber; Walid Durani; Christian Böhm IEEE International Conference on Data Mining (ICDM), Abu Dhabi, United Arab Emirates, 2024, pp. 675-680, doi: 10.1109/ICDM59182.2024.

fit(X: ndarray, y: ndarray = None) DCTree_Clusterer[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DCTree_Clusterer algorithm

Return type:

DCTree_Clusterer

clustpy.hierarchical.diana module

@authors: Collin Leiber

class clustpy.hierarchical.diana.Diana(n_clusters: int = None, distance_threshold: float = 0, construct_full_tree: bool = False, metric: str = 'euclidean')[source]

Bases: ClusterMixin, BaseEstimator

The DIvisive ANAlysis (DIANA) clustering algorithm. DIANA build a top-down clustering hierarchy by considering pairwise dissimilarity of objects. It recursively splits the clusters with maximum dissimilarity, whereby the dissimilarity is based on a specified distance metric (e.g., Euclidean distance).

Parameters:
  • n_clusters (int) – The number of clusters. If n_clusters is None the tree will be constructed until the max diamater is below distance_threshold (default: None)

  • distance_threshold (float) – The distance thresholds defines the minimum diameter that is considered. Must be 0 if n_clusters is specified (default: 0)

  • construct_full_tree (bool) – Defines whether the full tree should be constructed after n_clusters has been reached (default: False)

  • metric (str) – Metric used to compute the dissimilarity. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed” (see scipy.spatial.distance.pdist) (default: euclidean)

labels_

The final labels

Type:

np.ndarray

tree_

The resulting cluster tree

Type:

BinaryClusterTree

n_features_in_

the number of features used for the fitting

Type:

int

References

Kaufman, Rousseeuw “Divisive Analysis (Program DIANA)” Chapter six from Finding Groups in Data: An Introduction to Cluster Analysis. 1990.

fit(X: ndarray, y: ndarray = None) Diana[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the Diana algorithm

Return type:

Diana

flat_clustering(n_leaf_nodes_to_keep: int) ndarray[source]

Transform the predicted labels into a flat clustering result by only keeping n_leaf_nodes_to_keep leaf nodes in the tree. Returns labels as if the clustering procedure would have stopped at the specified number of nodes. Note that each leaf node corresponds to a cluster.

Parameters:

n_leaf_nodes_to_keep (int) – The number of leaf nodes to keep in the cluster tree

Returns:

labels_pruned – The new cluster labels

Return type:

np.ndarray

Module contents

class clustpy.hierarchical.DCTree_Clusterer(min_points: int = 5, use_less_memory: bool = False)[source]

Bases: ClusterMixin, BaseEstimator

The DCTree clustering algorithm. Identifies stable nodes within the DCTree and labels the data accordingly.

Parameters:
  • min_points (int) – the minimum number of points (default: 5)

  • use_less_memory (bool) – Use less memory when constructing the DCTree. This will, however, increase the runtime (default: False)

n_clusters_

The final number of clusters

Type:

int

labels_

The final labels

Type:

np.ndarray

dc_tree_

The resulting cluster tree

Type:

BinaryClusterTree

n_features_in_

the number of features used for the fitting

Type:

int

References

SHADE: Deep Density-based Clustering Anna Beer; Pascal Weber; Lukas Miklautz; Collin Leiber; Walid Durani; Christian Böhm IEEE International Conference on Data Mining (ICDM), Abu Dhabi, United Arab Emirates, 2024, pp. 675-680, doi: 10.1109/ICDM59182.2024.

fit(X: ndarray, y: ndarray = None) DCTree_Clusterer[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the DCTree_Clusterer algorithm

Return type:

DCTree_Clusterer

class clustpy.hierarchical.Diana(n_clusters: int = None, distance_threshold: float = 0, construct_full_tree: bool = False, metric: str = 'euclidean')[source]

Bases: ClusterMixin, BaseEstimator

The DIvisive ANAlysis (DIANA) clustering algorithm. DIANA build a top-down clustering hierarchy by considering pairwise dissimilarity of objects. It recursively splits the clusters with maximum dissimilarity, whereby the dissimilarity is based on a specified distance metric (e.g., Euclidean distance).

Parameters:
  • n_clusters (int) – The number of clusters. If n_clusters is None the tree will be constructed until the max diamater is below distance_threshold (default: None)

  • distance_threshold (float) – The distance thresholds defines the minimum diameter that is considered. Must be 0 if n_clusters is specified (default: 0)

  • construct_full_tree (bool) – Defines whether the full tree should be constructed after n_clusters has been reached (default: False)

  • metric (str) – Metric used to compute the dissimilarity. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed” (see scipy.spatial.distance.pdist) (default: euclidean)

labels_

The final labels

Type:

np.ndarray

tree_

The resulting cluster tree

Type:

BinaryClusterTree

n_features_in_

the number of features used for the fitting

Type:

int

References

Kaufman, Rousseeuw “Divisive Analysis (Program DIANA)” Chapter six from Finding Groups in Data: An Introduction to Cluster Analysis. 1990.

fit(X: ndarray, y: ndarray = None) Diana[source]

Initiate the actual clustering process on the input data set. The resulting cluster labels will be stored in the labels_ attribute.

Parameters:
  • X (np.ndarray) – the given data set

  • y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the Diana algorithm

Return type:

Diana

flat_clustering(n_leaf_nodes_to_keep: int) ndarray[source]

Transform the predicted labels into a flat clustering result by only keeping n_leaf_nodes_to_keep leaf nodes in the tree. Returns labels as if the clustering procedure would have stopped at the specified number of nodes. Note that each leaf node corresponds to a cluster.

Parameters:

n_leaf_nodes_to_keep (int) – The number of leaf nodes to keep in the cluster tree

Returns:

labels_pruned – The new cluster labels

Return type:

np.ndarray