clustpy.data package

Submodules

clustpy.data.preprocessing module

clustpy.data.real_clustpy_data module

clustpy.data.real_clustpy_data.load_aloi_small(return_X_y: bool = False) Bunch[source]

Load a subset of the Amsterdam Library of Object Image (ALOI) consisting of 288 images of the objects red ball, red cylinder, green ball and green cylinder. The two label sets are cylinder/ball and red/green. N=288, d=611, k=[2,2].

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (288 x 611), the labels numpy array (288 x 2)

Return type:

Bunch

References

https://aloi.science.uva.nl/

and

Ye, Wei, et al. “Generalized independent subspace clustering.” 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 2016.

clustpy.data.real_clustpy_data.load_fruit(return_X_y: bool = False) Bunch[source]

Load the fruits data set. It consists of 105 preprocessed images of apples, bananas and grapes in red, green and yellow. N=105, d=6, k=[3,3].

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (105 x 6), the labels numpy array (105 x 2)

Return type:

Bunch

References

Hu, Juhua, et al. “Finding multiple stable clusterings.” Knowledge and Information Systems 51.3 (2017): 991-1021.

clustpy.data.real_clustpy_data.load_nrletters(return_X_y: bool = False) Bunch[source]

Load the NRLetters data set. It consists of 10000 9x7 images of the letters A, B, C, X, Y and Z in pink, cyan and yellow. Additionally, each image highlights one corner in color. N=10000, d=189, k=[6,3,4].

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (10000 x 189), the labels numpy array (10000 x 3)

Return type:

Bunch

References

Leiber, Collin, et al. “Automatic Parameter Selection for Non-Redundant Clustering.” Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2022.

clustpy.data.real_clustpy_data.load_stickfigures(return_X_y: bool = False) Bunch[source]

Load the Dancing Stick Figures data set. It consists of 900 20x20 grayscale images of stick figures in different poses. The poses can be divided into three upp-body and three lower-body motions. N=900, d=400, k=[3,3].

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (900 x 400), labels: the labels numpy array (900 x 2)

Return type:

Bunch

References

Günnemann, Stephan, et al. “Smvc: semi-supervised multi-view clustering in subspace projections.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.

clustpy.data.real_medical_mnist_data module

clustpy.data.real_timeseries_data module

clustpy.data.real_timeseries_data.load_diatom_size_reduction(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the diatom size reduction data set. It consists of 322 samples belonging to one of 4 classes. The data set is composed of 16 training and 306 test samples. N=322, d=345, k=4.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (322 x 345), the labels numpy array (322)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=DiatomSizeReduction

clustpy.data.real_timeseries_data.load_lsst(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the LSST data set. It consists of 4925 samples belonging to one of 14 classes. The data set is composed of 2459 training and 2466 test samples. N=4925, d=216, k=14.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (4925 x 216), the labels numpy array (4925)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=LSST

clustpy.data.real_timeseries_data.load_motestrain(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the motestrain data set. It consists of 1272 samples belonging to one of 2 classes. The data set is composed of 20 training and 1252 test samples. N=1272, d=84, k=2.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1272 x 84), the labels numpy array (1272)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=MoteStrain

clustpy.data.real_timeseries_data.load_olive_oil(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the OliveOil data set. It consists of 60 samples belonging to one of 4 classes. The data set is composed of 30 training and 30 test samples. N=60, d=570, k=4.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (60 x 570), the labels numpy array (60)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=OliveOil

clustpy.data.real_timeseries_data.load_plane(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the plane data set. It consists of 210 samples belonging to one of 7 classes. The data set is composed of 105 training and 105 test samples. N=210, d=144, k=7.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (210 x 144), the labels numpy array (210)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=Plane

clustpy.data.real_timeseries_data.load_proximal_phalanx_outline(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the proximal phalanx outline data set. It consists of 876 samples belonging to one of 2 classes. The data set is composed of 600 training and 276 test samples. N=876, d=80, k=2.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (876 x 80), the labels numpy array (876)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=ProximalPhalanxOutlineCorrect

clustpy.data.real_timeseries_data.load_sony_aibo_robot_surface(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the Sony AIBO Robot Surface 1 data set. It consists of 621 samples belonging to one of 2 classes. The data set is composed of 20 training and 601 test samples. N=621, d=70, k=2.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (621 x 70), the labels numpy array (621)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=SonyAIBORobotSurface1

clustpy.data.real_timeseries_data.load_symbols(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the symbols data set. It consists of 1020 samples belonging to one of 6 classes. The data set is composed of 25 training and 995 test samples. N=1020, d=398, k=6.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1020 x 398), the labels numpy array (1020)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=Symbols

clustpy.data.real_timeseries_data.load_two_patterns(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the two patterns data set. It consists of 5000 samples belonging to one of 4 classes. The data set is composed of 1000 training and 4000 test samples. N=5000, d=128, k=4.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (5000 x 128), the labels numpy array (5000)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=TwoPatterns

clustpy.data.real_torchvision_data module

clustpy.data.real_uci_data module

clustpy.data.real_uci_data.load_banknotes(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the banknote authentication data set. It consists of 1372 genuine and forged banknote samples. N=1372, d=4, k=2.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1372 x 4), the labels numpy array (1372)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/banknote+authentication

clustpy.data.real_uci_data.load_breast_cancer_wisconsin_original(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the original breast cancer Wisconsin data set. It consists of 699 samples belonging to one of 2 classes. 16 samples contain ‘?’ values and will be removed. N=683, d=9, k=2.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (683 x 9), the labels numpy array (683)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29

clustpy.data.real_uci_data.load_breast_tissue(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the breast tissue data set. It consists of 106 samples belonging to one of 6 classes. N=106, d=9, k=6.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (106 x 9), the labels numpy array (106)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/breast+tissue

clustpy.data.real_uci_data.load_cmu_faces(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the CMU Face Images data set. It consists of 640 30x32 grayscale images showing 20 persons in different poses (up, straight, left, right) and with different expressions (neutral, happy, sad, angry). Additionally, the persons can wear sunglasses or not. 16 images show glitches which is why the final data set only contains 624 images. N=624, d=400, k=[20,4,4,2].

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (624 x 400), the labels numpy array (624 x 4)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/cmu+face+images

clustpy.data.real_uci_data.load_dermatology(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the dermatology data set. It consists of 366 samples belonging to one of 6 classes. 8 samples contain ‘?’ values and are therefore removed. N=358, d=34, k=6.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (358 x 34), the labels numpy array (358)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/dermatology

clustpy.data.real_uci_data.load_ecoli(ignore_small_clusters: bool = False, return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the ecoli data set. It consists of 336 samples belonging to one of 8 classes. N=336, d=7, k=8.

Parameters:
  • ignore_small_clusters (bool) – specify if the three small clusters with size 2, 2 and 5 should be ignored (default: False)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (336 x 7), the labels numpy array (336)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/ecoli

clustpy.data.real_uci_data.load_forest_types(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the forest type mapping data set. It consists of 523 samples belonging to one of 4 classes. The data set is composed of 198 training and 325 test samples. N=523, d=27, k=4.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (523 x 27), the labels numpy array (523)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Forest+type+mapping

clustpy.data.real_uci_data.load_har(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the Human Activity Recognition data set. It consists of 10299 samples each representing sensor data of a person performing an activity. The six activities are walking, walking_upstairs, walking_downstairs, sitting, standing and laying. The data set is composed of 7352 training and 2947 test samples. N=10992, d=561, k=6.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (10992 x 561), the labels numpy array (10992)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones

clustpy.data.real_uci_data.load_htru2(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the HTRU2 data set. It consists of 17898 samples belonging to the pulsar or non-pulsar class. A special property is that more than 90% of the data belongs to class 0. N=17898, d=8, k=2.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (17898 x 8), the labels numpy array (17898)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/HTRU2

clustpy.data.real_uci_data.load_letterrecognition(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the Letter Recognition data set. It consists of 20000 samples where each sample represents one of the 26 capital letters in the English alphabet. All samples are composed of 16 numerical stimuli describing the respective letter. N=20000, d=16, k=26.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (20000 x 16), the labels numpy array (20000)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/letter+recognition

clustpy.data.real_uci_data.load_mice_protein(return_additional_labels: bool = False, return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the Mice Protein Expression data set. It consists of 1077 samples belonging to one of 8 classes. Each feature represents the expression level of one of 77 proteins. Samples containing more than 43 NaN values (3 cases) will be removed. Afterwards, all columns containing NaN values will be removed. This reduces the number of features from 77 to 68. The classes can be further subdivided by using the return_additional_labels parameter. This gives the additional information mouseID, behavior, treatment type and genotype. N=1077, d=68, k=8.

Parameters:
  • return_additional_labels (bool) – return additional labels (default: False)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1077 x 68), the labels numpy array (1077)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression

clustpy.data.real_uci_data.load_multiple_features(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the multiple features data set. It consists of 2000 samples belonging to one of 10 classes. Each class corresponds to handwritten numerals (0-9) extracted from a collection of Dutch utility maps. N=2000, d=649, k=10.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (2000 x 649), the labels numpy array (2000)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Multiple+Features

clustpy.data.real_uci_data.load_optdigits(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the optdigits data set. It consists of 5620 8x8 grayscale images, each representing a digit (0 to 9). Each pixel depicts the number of marked pixel within a 4x4 block of the original 32x32 bitmaps. The data set is composed of 3823 training and 1797 test samples. N=5620, d=64, k=10.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (5620 x 64), the labels numpy array (5620)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits

clustpy.data.real_uci_data.load_pendigits(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the pendigits data set. It consists of 10992 vectors of length 16, representing 8 coordinates. The coordinates were taken from the task of writing digits (0 to 9) on a tablet. The data set is composed of 7494 training and 3498 test samples. N=10992, d=16, k=10.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (10992 x 16), the labels numpy array (10992)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/pen-based+recognition+of+handwritten+digits

clustpy.data.real_uci_data.load_seeds(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the seeds data set. It consists of 210 samples belonging to one of three varieties of wheat. N=210, d=7, k=3.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (210 x 7), the labels numpy array (210)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/seeds

clustpy.data.real_uci_data.load_semeion(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the semeion data set. It consists of 1593 samples belonging to one of 10 classes. Each sample corresponds to a grayscale 16x16 scan of handwritten digits originating from about 80 different persons. Further, each pixel was converted to a boolean value using a fixed threshold. N=1593, d=256, k=10.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1593 x 256), the labels numpy array (1593)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/semeion+handwritten+digit

clustpy.data.real_uci_data.load_skin(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the Skin Segmentation data set. It consists of 245057 skin- and non-skin samples with their B, G, R color information. N=245057, d=3, k=2.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (245057 x 3), the labels numpy array (245057)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/skin+segmentation

clustpy.data.real_uci_data.load_soybean_large(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the large version of the soybean data set. It consists of 562 samples belonging to one of 15 classes. Originally, the data set would have samples and 19 classes but some samples have attributes showing ‘?’ values. Those will be ignored. The data set is composed of 266 training and 296 test samples. N=562, d=35, k=15.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (562 x 35), the labels numpy array (562)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/soybean+(Large)

clustpy.data.real_uci_data.load_soybean_small(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the small version of the soybean data set. It is a small subset of the original soybean data set. It consists of 47 samples belonging to one of 4 classes. N=47, d=35, k=4.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (47 x 35), the labels numpy array (47)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/soybean+(small)

clustpy.data.real_uci_data.load_spambase(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the spambase data set. It consists of 4601 spam and non-spam mails. N=4601, d=57, k=2.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (4601 x 57), the labels numpy array (4601)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/spambase

clustpy.data.real_uci_data.load_statlog_australian_credit_approval(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the statlog Australian Credit Approval data set. It consists of 690 samples belonging to one of 2 classes. N=690, d=14, k=2.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (690 x 14), the labels numpy array (690)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/statlog+(australian+credit+approval)

clustpy.data.real_uci_data.load_statlog_shuttle(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the statlog shuttle data set. It consists of 58000 samples belonging to one of 7 classes. A special property is that about 80% of the data belongs to class 0. The data set is composed of 43500 training and 14500 test samples. N=58000, d=9, k=7.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (58000 x 9), the labels numpy array (58000)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle)

clustpy.data.real_uci_data.load_user_knowledge(subset: str = 'all', return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the user knowledge data set. It consists of 403 samples belonging to one of 4 classes. The 4 classes are the knowledge levels ‘very low’, ‘low’, ‘middle’ and ‘high’. The data set is composed of 258 training and 145 test samples. N=403, d=5, k=4.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (403 x 5), the labels numpy array (403)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling

clustpy.data.real_video_data module

clustpy.data.real_world_data module

clustpy.data.real_world_data.load_breast_cancer(return_X_y: bool = False) Bunch[source]

Load the breast cancer wisconsin data set. It consists of 32 features computed from digitized images of fine needle aspirate of breast mass. The classes are the result of a diagnosis (malignant or benign). N=569, d=30, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (569 x 30), the labels numpy array (569)

Return type:

Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

clustpy.data.real_world_data.load_coil100(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the COIL-100 data set. It consists of 7200 128x128 color images of 100 objects photographed from 72 different angles. N=7200, d=49152, k=100.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (7200 x 49152), the labels numpy array (7200)

Return type:

Bunch

References

https://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php

clustpy.data.real_world_data.load_coil20(return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the COIL-20 data set. It consists of 1440 128x128 gray-scale images of 20 objects photographed from 72 different angles. N=1440, d=16384, k=20.

Parameters:
  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1440 x 16384), the labels numpy array (1440)

Return type:

Bunch

References

https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php

clustpy.data.real_world_data.load_imagenet10(use_224_size: bool = True, return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the ImageNet-10 data set. This is a subset of the well-known ImageNet data set with only 10 classes. It consists of 13000 224x224 (or 96x96) color images showing different objects. N=13000, d=150528, k=10.

Parameters:
  • use_224_size (bool) – defines wheter the images should be loaded in the size (224 x 224) or (96 x 96) (default: True)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (13000 x 150528), the labels numpy array (13000)

Return type:

Bunch

References

https://www.image-net.org/

and

Russakovsky, Olga, et al. “Imagenet large scale visual recognition challenge.” International journal of computer vision 115 (2015): 211-252.

clustpy.data.real_world_data.load_imagenet_dog(subset: str = 'all', image_size: tuple = (224, 224), breeds: list = ['n02085936-Maltese_dog', 'n02086646-Blenheim_spaniel', 'n02088238-basset', 'n02091467-Norwegian_elkhound', 'n02097209-standard_schnauzer', 'n02099601-golden_retriever', 'n02101388-Brittany_spaniel', 'n02101556-clumber', 'n02102177-Welsh_springer_spaniel', 'n02105056-groenendael', 'n02105412-kelpie', 'n02105855-Shetland_sheepdog', 'n02107142-Doberman', 'n02110958-pug', 'n02112137-chow'], return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the ImageNet Dog data set. It consists of 20580 color images of different sizes showing 120 breeds of dogs. The data set is composed of 12000 training and 8580 test images. Usually, a subset of 15 dog breeds is used (Maltese_dog, Blenheim_spaniel, Basset, Norwegian_elkhound, Standard_schnauzer, Golden_retriever, Brittany_spaniel, Clumber, Welsh_springer_spaniel, Groenendael, Kelpie, Shetland_sheepdog, Doberman, Pug, Chow), resulting in 2574 images for the “all” subset. N=20580, d=image_size[0]*image_size[1]*3, k=120.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • image_size (tuple) – the images of various sizes must be converted into a coherent size. The tuple equals (width, height) of the images (default: (224, 224))

  • breeds (list) – list containing all the identifiers of the dog breeds that should be extracted. All entries must be of type str. If None, all breeds will be extracted. Usually, a subset consisting of 15 breeds is extracted (default: list with 15 dog breeds)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (bool) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (20580 x image_size[0]*image_size[1]*3), the labels numpy array (20580)

Return type:

Bunch

References

http://vision.stanford.edu/aditya86/ImageNetDogs/main.html

and

Khosla, Aditya, et al. “Novel dataset for fine-grained image categorization: Stanford dogs.” Proc. CVPR workshop on fine-grained visual categorization (FGVC). Vol. 2. No. 1. Citeseer, 2011.

clustpy.data.real_world_data.load_iris(return_X_y: bool = False) Bunch[source]

Load the iris data set. It consists of the petal and sepal width and length of three different types of irises (Setosa, Versicolour, Virginica). N=150, d=4, k=3.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (150 x 4), the labels numpy array (150)

Return type:

Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html https://archive.ics.uci.edu/ml/datasets/iris

clustpy.data.real_world_data.load_newsgroups(subset: str = 'all', n_features: int = 2000, return_X_y: bool = False) Bunch[source]

Load the 20 newsgroups data set. It consists of a collection of 18846 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The documents are converted into feature vectors using TF-IDF. The data set is composed of 11314 training and 7532 test documents. N=18846, d=2000, k=20 using the default settings.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • n_features (int) – number of features used by TF-IDF (default: 2000)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (18846 x 2000 - using the default settings), the labels numpy array (18846)

Return type:

Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups http://qwone.com/~jason/20Newsgroups/

clustpy.data.real_world_data.load_olivetti_faces(return_X_y: bool = False) Bunch[source]

Load the olivetti faces data set. It consists of 400 64x64 grayscale images showing faces of 40 different persons. N=400, d=4096, k=40.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (400 x 4096), the labels numpy array (400)

Return type:

Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_olivetti_faces.html

clustpy.data.real_world_data.load_reuters(subset: str = 'all', n_features: int = 2000, categories: tuple = ('CCAT', 'GCAT', 'MCAT', 'ECAT'), return_X_y: bool = False) Bunch[source]

Load the Reuters data set. It consists of over 800000 manually categorized newswire stories made available by Reuters, Ltd. Usually only a subset of the categories is used. Those categories are defined by the attribute ‘categories’. We use only those articles that belong to a single category. Further, we only use the n_features most frequent features. The data set is composed of 19806 training and 665265 test documents using the default settings. N=685071, d=2000, k=4 using the default settings.

Parameters:
  • subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)

  • n_features (int) – number of features used (default: 2000)

  • categories (tuple) – the categories that should be contained (default: (“CCAT”, “GCAT”, “MCAT”, “ECAT”))

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (685071 x 2000 - using the default settings), the labels numpy array (685071 - using the default settings)

Return type:

Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_rcv1.html#sklearn.datasets.fetch_rcv1

and

Lewis, David D., et al. “Rcv1: A new benchmark collection for text categorization research.” Journal of machine learning research 5.Apr (2004): 361-397.

clustpy.data.real_world_data.load_webkb(use_universities: tuple = ('cornell', 'texas', 'washington', 'wisconsin'), use_categories: tuple = ('course', 'faculty', 'project', 'student'), remove_headers: bool = True, min_doc_frequency: float = 0.01, min_variance: float = 0.25, return_X_y: bool = False, downloads_path: str | None = None) Bunch[source]

Load the WebKB data set. It consists of 1041 Html documents from different universities (default: “cornell”, “texas”, “washington” and “wisconsin”). These web pages have a specified category (default: “course”, “faculty”, “project”, “student”). For more information see the references website. The data is preprocessed by using stemming and removing stop words. Furthermore, words with a document frequency smaller than min_doc_frequency or with a variance smaller than min_variance will be removed. N=1041, d=323, k=[4,4] using the default settings.

Parameters:
  • use_universities (tuple) – specify the universities (default: (“cornell”, “texas”, “washington”, “wisconsin”))

  • use_categories (tuple) – specify the categories (default: (“course”, “faculty”, “project”, “student”))

  • remove_headers (bool) – should the headers of the Html files be removed? (default: True)

  • min_doc_frequency (float) – minimum document frequency of the words (default: 0.01)

  • min_variance (float) – minimum variance of the words (default: 0.25)

  • return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

  • downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1041 x 323 - using the default settings), the labels numpy array (1041 x 2 - using the default settings)

Return type:

Bunch

References

http://www.cs.cmu.edu/~webkb/

clustpy.data.real_world_data.load_wine(return_X_y: bool = False) Bunch[source]

Load the wine data set. It consists of 13 different properties of three different types of wine. N=178, d=13, k=3.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (178 x 13), the labels numpy array (178)

Return type:

Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html https://archive.ics.uci.edu/ml/datasets/wine

clustpy.data.synthetic_data_creator module

clustpy.data.synthetic_data_creator.create_nr_data(n_samples: int = 1000, n_clusters: tuple = (3, 3, 1), subspace_features: tuple = (2, 2, 2), n_outliers: tuple = (0, 0, 0), std: float = 1.0, box: tuple = (-10, 10), rotate_space: bool = True, random_state: int | None = None) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Create a synthetic non-redundant data set consisting of multiple subspaces containing Gaussian clusters (called clustered spaces). You can also create subspaces with a single Gaussian cluster (called noise space). The sklearn method make_blobs is used to create the clusters. The dimensionality of the subspaces is specified by the subspace_features parameter. It can be an integer, where the dimensionality is the same for all subspaces, or it can be a list. Additionally, one can specify the number of outliers for each subspace. Outliers will be created using a uniform distribution using the box parameter as limits. If outliers are used, the number of samples within the clusters is reduced accordingly. The standard deviation and the bounding box can be specified either for each subspace individually or a single value will be shared across all spaces.

Parameters:
  • n_samples (int) – Number of samples in the clusters. If n_samples is int, the samples will be equally divided across all clusters in each subspace. Otherwise, a tuple of tuples (e.g. ((100, 200, 700), (300,300,400), (300,300,400))) can specify the size of each cluster in each subspace individually. Beware that the overall number of samples (including outliers) must be equal for each subspace (default: 1000)

  • n_clusters (tuple) – Specifies the number of clusters for each subspace (default: (3, 3, 1))

  • subspace_features (tuple) – Number of features in each subspace (default: (2, 2, 2))

  • n_outliers (tuple) – Number of outliers for each subspace. Overall number of samples will be n_samples + n_outliers. Beware that n_samples + n_outliers must be equal for each subspace (default: (0, 0, 0))

  • std (float) – Standard deviation of the Gaussian clusters. Can be a list specifying an individual value for each subspace (default: 1.)

  • box (tuple) – The bounding box of the cluster centers. Can be a list specifying an individual value for each subspace (default: (-10, 10))

  • rotate_space (bool) – Specifies whether the feature space should be rotated by an orthonormal matrix (default: True)

  • random_state (int / np.random.RandomState) – The random state (default: None)

Returns:

data, labels – the data numpy array (n_samples x sum(subspace_features)), the labels numpy array (n_samples x len(subspace_features))

Return type:

(np.ndarray, np.ndarray)

clustpy.data.synthetic_data_creator.create_subspace_data(n_samples: int = 1000, n_clusters: int = 3, subspace_features: tuple = (2, 2), n_outliers: tuple = (0, 0), std: float = 1.0, box: tuple = (-10, 10), rotate_space: bool = True, random_state: int | None = None) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Create a synthetic subspace data set consisting of a subspace containing multiple Gaussian clusters (called clustered space) and a subspace containing a single Gaussian cluster (called noise space). This method is a special case of the create_nr_data method using only a single clustered space. See create_nr_data for more information.

Parameters:
  • n_samples (int) – Number of samples in the clusters. If n_samples is int, the samples will be equally divided across all clusters. Otherwise, a tuple (e.g. (100, 200, 700)) can specify the size of each cluster individually (default: 1000)

  • n_clusters (int) – Specifies the number of clusters in the clustered space (default: 3)

  • subspace_features (tuple) – Number of features in each of the two subspaces (default: (2, 2))

  • n_outliers (tuple) – Number of outliers for each subspace. Overall number of samples will be n_samples + n_outliers. Beware that n_samples + n_outliers must be equal for both subspaces (default: (0, 0))

  • std (float) – Standard deviation of the Gaussian clusters. Can be a list specifying an individual value for each subspace (default: 1.)

  • box (tuple) – The bounding box of the cluster centers. Can be a list specifying an individual value for each subspace (default: (-10, 10))

  • rotate_space (bool) – Specifies whether the feature space should be rotated by an orthonormal matrix (default: True)

  • random_state (int / np.random.RandomState) – The random state (default: None)

Returns:

data, labels – the data numpy array (n_samples x sum(subspace_features)), the labels numpy array (n_samples)

Return type:

(np.ndarray, np.ndarray)

Module contents