clustpy.data package

Submodules

clustpy.data.preprocessing module

class clustpy.data.preprocessing.ZNormalizer(feature_or_channel_wise: bool = False)[source]

Bases: TransformerMixin, BaseEstimator

Normalize a data set by calculating (data - mean) / std. In general, two strategies are sensible to normalize a data set. Either use all features simultaneously for the normalization or normalize each feature separately. In the case of image data, a feature-wise transformation usually corresponds to a channel-wise transformation. If this normalizer should be applied to RGB image data, the color channels should be in the first dimension, known as CHW representation.

Parameters:: feature_or_channel_wise (bool) – Specifies if all data should be used for the normalization or if a feature-/channel-wise normalization should be applied (default: False)

shape

Shape of the data set with which this normalizer has been fitted

Type:: list

mean

Mean value(s) of the data set

Type:: np.ndarray or int

std

Standard deviation value(s) of the data set

Type:: np.ndarray or int

fit(X: ndarray, y: ndarray = None) → ZNormalizer[source]

Compute the mean and std values regarding the input data set.

Parameters:

X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the ZNormalizer

Return type:

ZNormalizer

inverse_transform(X: ndarray) → ndarray[source]

Invert the transformation by applying (data * std) + mean.

Parameters:: X (np.ndarray) – the given data set
Returns:: X_out – The transformed data set
Return type:: np.ndarray

transform(X: ndarray) → ndarray[source]

Transform the given data set using the fitted mean and std values.

Parameters:: X (np.ndarray) – the given data set
Returns:: X_out – The transformed data set
Return type:: np.ndarray

clustpy.data.preprocessing.z_normalization(X: ndarray, feature_or_channel_wise: bool = False) → ndarray[source]

Wrapper for the ZNormalizer. It automatically executes: X_transform = ZNormalizer(feature_or_channel_wise).fit_transform(X)

Parameters:

X (np.ndarray) – the given data set
feature_or_channel_wise (bool) – Specifies if all data should be used for the normalization or if a feature-/channel-wise normalization should be applied (default: False)

Returns:

X_transform – The transformed data set

Return type:

np.ndarray

clustpy.data.real_clustpy_data module

clustpy.data.real_clustpy_data.load_aloi_small(return_X_y: bool = False) → Bunch[source]

Load a subset of the Amsterdam Library of Object Image (ALOI) consisting of 288 images of the objects red ball, red cylinder, green ball and green cylinder. The two label sets are cylinder/ball and red/green. N=288, d=611, k=[2,2].

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (288 x 611), the labels numpy array (288 x 2)
Return type:: Bunch

References

https://aloi.science.uva.nl/

and

Ye, Wei, et al. “Generalized independent subspace clustering.” 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 2016.

clustpy.data.real_clustpy_data.load_fruit(return_X_y: bool = False) → Bunch[source]

Load the fruits data set. It consists of 105 preprocessed images of apples, bananas and grapes in red, green and yellow. N=105, d=6, k=[3,3].

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (105 x 6), the labels numpy array (105 x 2)
Return type:: Bunch

References

Hu, Juhua, et al. “Finding multiple stable clusterings.” Knowledge and Information Systems 51.3 (2017): 991-1021.

clustpy.data.real_clustpy_data.load_nrletters(return_X_y: bool = False) → Bunch[source]

Load the NRLetters data set. It consists of 10000 9x7 images of the letters A, B, C, X, Y and Z in pink, cyan and yellow. Additionally, each image highlights one corner in color. N=10000, d=189, k=[6,3,4].

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (10000 x 189), the labels numpy array (10000 x 3)
Return type:: Bunch

References

Leiber, Collin, et al. “Automatic Parameter Selection for Non-Redundant Clustering.” Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2022.

clustpy.data.real_clustpy_data.load_stickfigures(return_X_y: bool = False) → Bunch[source]

Load the Dancing Stick Figures data set. It consists of 900 20x20 grayscale images of stick figures in different poses. The poses can be divided into three upp-body and three lower-body motions. N=900, d=400, k=[3,3].

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (900 x 400), labels: the labels numpy array (900 x 2)
Return type:: Bunch

References

Günnemann, Stephan, et al. “Smvc: semi-supervised multi-view clustering in subspace projections.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.

clustpy.data.real_medical_mnist_data module

clustpy.data.real_medical_mnist_data.load_adrenal_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the AdrenalMNIST3D data set. It consists of 1584 28x28x28 grayscale images belonging to one of 2 classes. The data set is composed of 1188 training, 98 validation and 298 test samples. N=1584, d=21952, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1584 x 21952), the labels numpy array (1584)

Return type:

Bunch

References

https://medmnist.com/

clustpy.data.real_medical_mnist_data.load_blood_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the BloodMNIST data set. It consists of 17092 28x28 colored images belonging to one of 8 classes. The data set is composed of 11959 training, 1712 validation and 3421 test samples. N=17092, d=2352, k=8.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (17092 x 2352), the labels numpy array (17092)

Return type:

Bunch

References

https://medmnist.com/

Andrea Acevedo, Anna Merino, et al., “A dataset of microscopic peripheral blood cell images for development of automatic recognition systems,” Data in Brief, vol. 30, pp. 105474, 2020.

clustpy.data.real_medical_mnist_data.load_breast_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the BreastMNIST data set. It consists of 780 28x28 grayscale images belonging to one of 2 classes. The data set is composed of 546 training, 78 validation and 156 test samples. N=780, d=784, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (780 x 784), the labels numpy array (780)

Return type:

Bunch

References

https://medmnist.com/

Walid Al-Dhabyani, Mohammed Gomaa, et al., “Dataset of breast ultrasound images,” Data in Brief, vol. 28, pp. 104863, 2020.

clustpy.data.real_medical_mnist_data.load_chest_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the ChestMNIST data set. It consists of 112120 28x28 grayscale images. The ground truth labels consist of 14 labelings with 2 clusters each. The data set is composed of 78468 training, 11219 validation and 22433 test samples. N=112120, d=784, k=[2,2,2,2,2,2,2,2,2,2,2,2,2,2].

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (112120 x 784), the labels numpy array (112120)

Return type:

Bunch

References

https://medmnist.com/

Xiaosong Wang, Yifan Peng, et al., “Chest x-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in CVPR, 2017, pp. 3462–3471.

clustpy.data.real_medical_mnist_data.load_derma_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the DermaMNIST data set. It consists of 10015 28x28 colored images belonging to one of 7 classes. The data set is composed of 7007 training, 1003 validation and 2005 test samples. N=10015, d=2352, k=7.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (10015 x 2352), the labels numpy array (10015)

Return type:

Bunch

References

https://medmnist.com/

Philipp Tschandl, Cliff Rosendahl, et al., “The ham10000 dataset, a large collection of multisource dermatoscopic images of common pigmented skin lesions,” Scientific data, vol. 5, pp. 180161, 2018.

Noel Codella, Veronica Rotemberg, et al., “Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)”, 2018, arXiv:1902.03368.

clustpy.data.real_medical_mnist_data.load_fracture_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the FractureMNIST3D data set. It consists of 1370 28x28x28 grayscale images belonging to one of 3 classes. The data set is composed of 1027 training, 103 validation and 240 test samples. N=1370, d=21952, k=3.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1370 x 21952), the labels numpy array (1370)

Return type:

Bunch

References

https://medmnist.com/

Liang Jin, Jiancheng Yang, et al., “Deep-learning-assisted detection and segmentation of rib fractures from ct scans: Development and validation of fracnet,” EBioMedicine, vol. 62, pp. 103106, 2020.

clustpy.data.real_medical_mnist_data.load_nodule_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the NoduleMNIST3D data set. It consists of 1633 28x28x28 grayscale images belonging to one of 2 classes. The data set is composed of 1158 training, 165 validation and 310 test samples. N=1633, d=21952, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1633 x 21952), the labels numpy array (1633)

Return type:

Bunch

References

https://medmnist.com/

Samuel G. Armato III, Geoffrey McLennan, et al., “The lung image database consortium (lidc) and image database resource initiative (idri): A completed reference databaseof lung nodules on ct scans,” Medical Physics, vol. 38,no. 2, pp. 915–931, 2011.

clustpy.data.real_medical_mnist_data.load_oct_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OCTMNIST data set. It consists of 109309 28x28 grayscale images belonging to one of 4 classes. The data set is composed of 97477 training, 10832 validation and 1000 test samples. N=109309, d=784, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (109309 x 784), the labels numpy array (109309)

Return type:

Bunch

References

https://medmnist.com/

Daniel S. Kermany, Michael Goldbaum, et al., “Identifying medical diagnoses and treatable diseases by image-based deep learning,” Cell, vol. 172, no. 5, pp. 1122 – 1131.e9, 2018.

clustpy.data.real_medical_mnist_data.load_organ_a_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OrganAMNIST data set. It consists of 58850 28x28 grayscale images belonging to one of 11 classes. The data set is composed of 34581 training, 6491 validation and 17778 test samples. N=58850, d=784, k=11.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (58850 x 784), the labels numpy array (58850)

Return type:

Bunch

References

https://medmnist.com/

Patrick Bilic, Patrick Ferdinand Christ, et al., “The liver tumor segmentation benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019.

Xuanang Xu, Fugen Zhou, et al., “Efficient multiple organ localization in ct image using 3d region proposal network,” IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1885–1898, 2019.

clustpy.data.real_medical_mnist_data.load_organ_c_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OrganCMNIST data set. It consists of 23660 28x28 grayscale images belonging to one of 11 classes. The data set is composed of 13000 training, 2392 validation and 8268 test samples. N=23660, d=784, k=11.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (23660 x 784), the labels numpy array (23660)

Return type:

Bunch

References

https://medmnist.com/

Patrick Bilic, Patrick Ferdinand Christ, et al., “The liver tumor segmentation benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019.

Xuanang Xu, Fugen Zhou, et al., “Efficient multiple organ localization in ct image using 3d region proposal network,” IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1885–1898, 2019.

clustpy.data.real_medical_mnist_data.load_organ_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OrganMNIST3D data set. It consists of 1743 28x28x28 grayscale images belonging to one of 11 classes. The data set is composed of 972 training, 161 validation and 610 test samples. N=1743, d=21952, k=11.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1743 x 21952), the labels numpy array (1743)

Return type:

Bunch

References

https://medmnist.com/

Patrick Bilic, Patrick Ferdinand Christ, et al., “The liver tumor segmentation benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019.

Xuanang Xu, Fugen Zhou, et al., “Efficient multiple organ localization in ct image using 3d region proposal network,” IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1885–1898, 2019.

clustpy.data.real_medical_mnist_data.load_organ_s_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OrganSMNIST data set. It consists of 25221 28x28 grayscale images belonging to one of 11 classes. The data set is composed of 13940 training, 2452 validation and 8829 test samples. N=25221, d=784, k=11.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (25221 x 784), the labels numpy array (25221)

Return type:

Bunch

References

https://medmnist.com/

Patrick Bilic, Patrick Ferdinand Christ, et al., “The liver tumor segmentation benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019.

Xuanang Xu, Fugen Zhou, et al., “Efficient multiple organ localization in ct image using 3d region proposal network,” IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1885–1898, 2019.

clustpy.data.real_medical_mnist_data.load_path_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the PathMNIST data set. It consists of 107180 28x28 colored images belonging to one of 9 classes. The data set is composed of 89996 training, 10004 validation and 7180 test samples. N=107180, d=2352, k=9.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (107180 x 2352), the labels numpy array (107180)

Return type:

Bunch

References

https://medmnist.com/

Jakob Nikolas Kather, Johannes Krisam, et al., “Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study,” PLOS Medicine, vol. 16, no. 1, pp. 1–22, 01 2019.

clustpy.data.real_medical_mnist_data.load_pneumonia_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the PneumoniaMNIST data set. It consists of 5856 28x28 grayscale images belonging to one of 2 classes. The data set is composed of 4708 training, 524 validation and 624 test samples. N=5856, d=784, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (5856 x 784), the labels numpy array (5856)

Return type:

Bunch

References

https://medmnist.com/

Daniel S. Kermany, Michael Goldbaum, et al., “Identifying medical diagnoses and treatable diseases by image-based deep learning,” Cell, vol. 172, no. 5, pp. 1122 – 1131.e9, 2018.

clustpy.data.real_medical_mnist_data.load_retina_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the RetinaMNIST data set. It consists of 1600 28x28 colored images belonging to one of 5 classes. The data set is composed of 1080 training, 120 validation and 400 test samples. N=1600, d=2352, k=5.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1600 x 2352), the labels numpy array (1600)

Return type:

Bunch

References

https://medmnist.com/

DeepDR Diabetic Retinopathy Image Dataset (DeepDRiD), “The 2nd diabetic retinopathy grading and image quality estimation challenge,” https://isbi.deepdr.org/data.html, 2020.

clustpy.data.real_medical_mnist_data.load_synapse_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the SynapseMNIST3D data set. It consists of 1759 28x28x28 grayscale images belonging to one of 2 classes. The data set is composed of 1230 training, 177 validation and 352 test samples. N=1759, d=21952, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1759 x 21952), the labels numpy array (1759)

Return type:

Bunch

References

https://medmnist.com/

clustpy.data.real_medical_mnist_data.load_tissue_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the TissueMNIST data set. It consists of 236386 28x28 grayscale images belonging to one of 8 classes. The data set is composed of 165466 training, 23640 validation and 47280 test samples. N=236386, d=784, k=8.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (236386 x 784), the labels numpy array (236386)

Return type:

Bunch

References

https://medmnist.com/

Vebjorn Ljosa, Katherine L Sokolnicki, et al., “Annotated high-throughput microscopy imagesets for validation.,” Nature methods, vol. 9, no. 7, pp.637–637, 2012.

clustpy.data.real_medical_mnist_data.load_vessel_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the VesselMNIST3D data set. It consists of 1909 28x28x28 grayscale images belonging to one of 2 classes. The data set is composed of 1335 training, 192 validation and 382 test samples. N=1909, d=21952, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1909 x 21952), the labels numpy array (1909)

Return type:

Bunch

References

https://medmnist.com/

Xi Yang, Ding Xia, et al., “Intra: 3d intracranial aneurysm dataset for deep learning,” in Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR), June 2020.

clustpy.data.real_timeseries_data module

clustpy.data.real_timeseries_data.load_diatom_size_reduction(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the diatom size reduction data set. It consists of 322 samples belonging to one of 4 classes. The data set is composed of 16 training and 306 test samples. N=322, d=345, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (322 x 345), the labels numpy array (322)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=DiatomSizeReduction

clustpy.data.real_timeseries_data.load_lsst(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the LSST data set. It consists of 4925 samples belonging to one of 14 classes. The data set is composed of 2459 training and 2466 test samples. N=4925, d=216, k=14.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (4925 x 216), the labels numpy array (4925)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=LSST

clustpy.data.real_timeseries_data.load_motestrain(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the motestrain data set. It consists of 1272 samples belonging to one of 2 classes. The data set is composed of 20 training and 1252 test samples. N=1272, d=84, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1272 x 84), the labels numpy array (1272)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=MoteStrain

clustpy.data.real_timeseries_data.load_olive_oil(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OliveOil data set. It consists of 60 samples belonging to one of 4 classes. The data set is composed of 30 training and 30 test samples. N=60, d=570, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (60 x 570), the labels numpy array (60)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=OliveOil

clustpy.data.real_timeseries_data.load_plane(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the plane data set. It consists of 210 samples belonging to one of 7 classes. The data set is composed of 105 training and 105 test samples. N=210, d=144, k=7.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (210 x 144), the labels numpy array (210)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=Plane

clustpy.data.real_timeseries_data.load_proximal_phalanx_outline(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the proximal phalanx outline data set. It consists of 876 samples belonging to one of 2 classes. The data set is composed of 600 training and 276 test samples. N=876, d=80, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (876 x 80), the labels numpy array (876)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=ProximalPhalanxOutlineCorrect

clustpy.data.real_timeseries_data.load_sony_aibo_robot_surface(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Sony AIBO Robot Surface 1 data set. It consists of 621 samples belonging to one of 2 classes. The data set is composed of 20 training and 601 test samples. N=621, d=70, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (621 x 70), the labels numpy array (621)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=SonyAIBORobotSurface1

clustpy.data.real_timeseries_data.load_symbols(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the symbols data set. It consists of 1020 samples belonging to one of 6 classes. The data set is composed of 25 training and 995 test samples. N=1020, d=398, k=6.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1020 x 398), the labels numpy array (1020)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=Symbols

clustpy.data.real_timeseries_data.load_two_patterns(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the two patterns data set. It consists of 5000 samples belonging to one of 4 classes. The data set is composed of 1000 training and 4000 test samples. N=5000, d=128, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (5000 x 128), the labels numpy array (5000)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=TwoPatterns

clustpy.data.real_torchvision_data module

clustpy.data.real_torchvision_data.load_cifar10(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the CIFAR10 data set. It consists of 60000 32x32 color images showing different objects. The classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. The data set is composed of 50000 training and 10000 test images. N=60000, d=3072, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (60000 x 3072), the labels numpy array (60000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.CIFAR10.html

and

Krizhevsky, Alex, and Geoffrey Hinton. “Learning multiple layers of features from tiny images.” (2009): 7.

clustpy.data.real_torchvision_data.load_cifar100(subset: str = 'all', use_superclasses: bool = False, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the CIFAR100 data set. It consists of 60000 32x32 color images showing different objects. A total of 100 classes are included, each depicting a specific of objects. Each class contains 600 objects. If use_superclasses is True, only the 20 superclasses are used. The data set is composed of 50000 training and 10000 test images. N=60000, d=3072, k=100.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
use_superclasses (bool) – If set to True, the 20 superclasses are used instead of the 100 regular classes (default: False)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (60000 x 3072), the labels numpy array (60000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.CIFAR100.html

and

Krizhevsky, Alex, and Geoffrey Hinton. “Learning multiple layers of features from tiny images.” (2009): 7.

clustpy.data.real_torchvision_data.load_fmnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Fashion-MNIST data set. It consists of 70000 28x28 grayscale images showing articles from the Zalando online store. Each sample belongs to one of 10 product groups. The data set is composed of 60000 training and 10000 test images. N=70000, d=784, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (70000 x 784), the labels numpy array (70000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.FashionMNIST.html

and

Xiao, Han, Kashif Rasul, and Roland Vollgraf. “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.” arXiv preprint arXiv:1708.07747 (2017).

clustpy.data.real_torchvision_data.load_gtsrb(subset: str = 'all', image_size: tuple = (32, 32), return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the GTSRB (German Traffic Sign Recognition Benchmark) data set. It consists of 39270 color images showing 43 different traffic signs. Example classes are: stop sign, speed limit 50 sign, speed limit 70 sign, construction site sign and many others. The data set is composed of 26640 training and 12630 test images. N=39270, d=image_size[0]*image_size[1]*3, k=43.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
image_size (tuple) – the images of various sizes must be converted into a coherent size. The tuple equals (width, height) of the images (default: (32, 32))
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (39270 x image_size[0]*image_size[1]*3), the labels numpy array (20580)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.GTSRB.html

and

https://benchmark.ini.rub.de/

and

Stallkamp, Johannes, et al. “Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition.” Neural networks 32 (2012): 323-332.

clustpy.data.real_torchvision_data.load_kmnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Kuzushiji-MNIST data set. It consists of 70000 28x28 grayscale images showing Kanji characters. It is composed of 10 different characters, each representing one column of hiragana. The data set is composed of 60000 training and 10000 test images. N=70000, d=784, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (70000 x 784), the labels numpy array (70000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.KMNIST.html

and

Clanuwat, Tarin, et al. “Deep learning for classical japanese literature.” arXiv preprint arXiv:1812.01718 (2018).

clustpy.data.real_torchvision_data.load_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the MNIST data set. It consists of 70000 28x28 grayscale images showing handwritten digits (0 to 9). The data set is composed of 60000 training and 10000 test images. N=70000, d=784, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (bool) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (70000 x 784), the labels numpy array (70000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html

and

LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.

clustpy.data.real_torchvision_data.load_stl10(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the STL10 data set. It consists of 13000 96x96 color images showing different objects. The classes are airplane, bird, car, cat, deer, dog, horse, monkey, ship and truck. The data set is composed of 5000 training and 8000 test images. N=13000, d=27648, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (13000 x 27648), the labels numpy array (13000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.STL10.html

and

Coates, Adam, Andrew Ng, and Honglak Lee. “An analysis of single-layer networks in unsupervised feature learning.” Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011.

clustpy.data.real_torchvision_data.load_svhn(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the SVHN data set. It consists of 99289 32x32 color images showing house numbers (0 to 9). The data set is composed of 73257 training and 26032 test images. N=99289, d=3072, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (99289 x 3072), the labels numpy array (99289)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.SVHN.html

and

Netzer, Yuval, et al. “Reading digits in natural images with unsupervised feature learning.” (2011).

clustpy.data.real_torchvision_data.load_usps(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the USPS data set. It consists of 9298 16x16 grayscale images showing handwritten digits (0 to 9). The data set is composed of 7291 training and 2007 test images. N=9298, d=256, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (9298 x 256), the labels numpy array (9298)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.USPS.html

and

Hull, Jonathan J. “A database for handwritten text recognition research.” IEEE Transactions on pattern analysis and machine intelligence 16.5 (1994): 550-554.

clustpy.data.real_uci_data module

clustpy.data.real_uci_data.load_banknotes(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the banknote authentication data set. It consists of 1372 genuine and forged banknote samples. N=1372, d=4, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1372 x 4), the labels numpy array (1372)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/banknote+authentication

clustpy.data.real_uci_data.load_breast_cancer_wisconsin_original(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the original breast cancer Wisconsin data set. It consists of 699 samples belonging to one of 2 classes. 16 samples contain ‘?’ values and will be removed. N=683, d=9, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (683 x 9), the labels numpy array (683)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29

clustpy.data.real_uci_data.load_breast_tissue(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the breast tissue data set. It consists of 106 samples belonging to one of 6 classes. N=106, d=9, k=6.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (106 x 9), the labels numpy array (106)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/breast+tissue

clustpy.data.real_uci_data.load_cmu_faces(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the CMU Face Images data set. It consists of 640 30x32 grayscale images showing 20 persons in different poses (up, straight, left, right) and with different expressions (neutral, happy, sad, angry). Additionally, the persons can wear sunglasses or not. 16 images show glitches which is why the final data set only contains 624 images. N=624, d=400, k=[20,4,4,2].

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (624 x 400), the labels numpy array (624 x 4)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/cmu+face+images

clustpy.data.real_uci_data.load_dermatology(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the dermatology data set. It consists of 366 samples belonging to one of 6 classes. 8 samples contain ‘?’ values and are therefore removed. N=358, d=34, k=6.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (358 x 34), the labels numpy array (358)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/dermatology

clustpy.data.real_uci_data.load_ecoli(ignore_small_clusters: bool = False, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the ecoli data set. It consists of 336 samples belonging to one of 8 classes. N=336, d=7, k=8.

Parameters:

ignore_small_clusters (bool) – specify if the three small clusters with size 2, 2 and 5 should be ignored (default: False)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (336 x 7), the labels numpy array (336)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/ecoli

clustpy.data.real_uci_data.load_forest_types(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the forest type mapping data set. It consists of 523 samples belonging to one of 4 classes. The data set is composed of 198 training and 325 test samples. N=523, d=27, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (523 x 27), the labels numpy array (523)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Forest+type+mapping

clustpy.data.real_uci_data.load_har(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Human Activity Recognition data set. It consists of 10299 samples each representing sensor data of a person performing an activity. The six activities are walking, walking_upstairs, walking_downstairs, sitting, standing and laying. The data set is composed of 7352 training and 2947 test samples. N=10992, d=561, k=6.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (10992 x 561), the labels numpy array (10992)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones

clustpy.data.real_uci_data.load_htru2(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the HTRU2 data set. It consists of 17898 samples belonging to the pulsar or non-pulsar class. A special property is that more than 90% of the data belongs to class 0. N=17898, d=8, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (17898 x 8), the labels numpy array (17898)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/HTRU2

clustpy.data.real_uci_data.load_letterrecognition(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Letter Recognition data set. It consists of 20000 samples where each sample represents one of the 26 capital letters in the English alphabet. All samples are composed of 16 numerical stimuli describing the respective letter. N=20000, d=16, k=26.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (20000 x 16), the labels numpy array (20000)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/letter+recognition

clustpy.data.real_uci_data.load_mice_protein(return_additional_labels: bool = False, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Mice Protein Expression data set. It consists of 1077 samples belonging to one of 8 classes. Each feature represents the expression level of one of 77 proteins. Samples containing more than 43 NaN values (3 cases) will be removed. Afterwards, all columns containing NaN values will be removed. This reduces the number of features from 77 to 68. The classes can be further subdivided by using the return_additional_labels parameter. This gives the additional information mouseID, behavior, treatment type and genotype. N=1077, d=68, k=8.

Parameters:

return_additional_labels (bool) – return additional labels (default: False)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1077 x 68), the labels numpy array (1077)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression

clustpy.data.real_uci_data.load_multiple_features(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the multiple features data set. It consists of 2000 samples belonging to one of 10 classes. Each class corresponds to handwritten numerals (0-9) extracted from a collection of Dutch utility maps. N=2000, d=649, k=10.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (2000 x 649), the labels numpy array (2000)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Multiple+Features

clustpy.data.real_uci_data.load_optdigits(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the optdigits data set. It consists of 5620 8x8 grayscale images, each representing a digit (0 to 9). Each pixel depicts the number of marked pixel within a 4x4 block of the original 32x32 bitmaps. The data set is composed of 3823 training and 1797 test samples. N=5620, d=64, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (5620 x 64), the labels numpy array (5620)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits

clustpy.data.real_uci_data.load_pendigits(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the pendigits data set. It consists of 10992 vectors of length 16, representing 8 coordinates. The coordinates were taken from the task of writing digits (0 to 9) on a tablet. The data set is composed of 7494 training and 3498 test samples. N=10992, d=16, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (10992 x 16), the labels numpy array (10992)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/pen-based+recognition+of+handwritten+digits

clustpy.data.real_uci_data.load_seeds(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the seeds data set. It consists of 210 samples belonging to one of three varieties of wheat. N=210, d=7, k=3.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (210 x 7), the labels numpy array (210)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/seeds

clustpy.data.real_uci_data.load_semeion(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the semeion data set. It consists of 1593 samples belonging to one of 10 classes. Each sample corresponds to a grayscale 16x16 scan of handwritten digits originating from about 80 different persons. Further, each pixel was converted to a boolean value using a fixed threshold. N=1593, d=256, k=10.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1593 x 256), the labels numpy array (1593)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/semeion+handwritten+digit

clustpy.data.real_uci_data.load_skin(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Skin Segmentation data set. It consists of 245057 skin- and non-skin samples with their B, G, R color information. N=245057, d=3, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (245057 x 3), the labels numpy array (245057)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/skin+segmentation

clustpy.data.real_uci_data.load_soybean_large(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the large version of the soybean data set. It consists of 562 samples belonging to one of 15 classes. Originally, the data set would have samples and 19 classes but some samples have attributes showing ‘?’ values. Those will be ignored. The data set is composed of 266 training and 296 test samples. N=562, d=35, k=15.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (562 x 35), the labels numpy array (562)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/soybean+(Large)

clustpy.data.real_uci_data.load_soybean_small(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the small version of the soybean data set. It is a small subset of the original soybean data set. It consists of 47 samples belonging to one of 4 classes. N=47, d=35, k=4.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (47 x 35), the labels numpy array (47)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/soybean+(small)

clustpy.data.real_uci_data.load_spambase(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the spambase data set. It consists of 4601 spam and non-spam mails. N=4601, d=57, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (4601 x 57), the labels numpy array (4601)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/spambase

clustpy.data.real_uci_data.load_statlog_australian_credit_approval(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the statlog Australian Credit Approval data set. It consists of 690 samples belonging to one of 2 classes. N=690, d=14, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (690 x 14), the labels numpy array (690)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/statlog+(australian+credit+approval)

clustpy.data.real_uci_data.load_statlog_shuttle(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the statlog shuttle data set. It consists of 58000 samples belonging to one of 7 classes. A special property is that about 80% of the data belongs to class 0. The data set is composed of 43500 training and 14500 test samples. N=58000, d=9, k=7.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (58000 x 9), the labels numpy array (58000)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle)

clustpy.data.real_uci_data.load_user_knowledge(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the user knowledge data set. It consists of 403 samples belonging to one of 4 classes. The 4 classes are the knowledge levels ‘very low’, ‘low’, ‘middle’ and ‘high’. The data set is composed of 258 training and 145 test samples. N=403, d=5, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (403 x 5), the labels numpy array (403)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling

clustpy.data.real_video_data module

clustpy.data.real_video_data.load_video_keck_gesture(subset: str = 'all', image_size: tuple = (200, 200), frame_sampling_ratio: float = 1, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Keck Gesture video data set. It consists of 42 training and 56 testing videos showing 4 different persons performing 14 different gestures. We assign the label ‘0’ to the gesture ‘no gesture’, which describes the frames between the actual gestures. This results in 15 different gestures. Note, that the person with label ‘3’ is only contained in the testing data. We transform the data set by extracting the 25457 480x640 colored frames. Note that the number of frames can differ depending on the used machine and version of opencv. Further, we recommend to downsize the frames due to possible memory issues. The final data set is divided into 13546 training and 11911 test images. The two label sets are the gestures and the persons. N=25457, d=120000 (for image_size (200, 200)), k=[15, 4].

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
image_size (tuple) – The single frames can be downsized. This is necessary for large datasets. The tuple equals (width, height) of the images. Can also be None if the image size should not be changed (default: (200, 200)))
frame_sampling_ratio (float) – Ratio to downsample the number of frames of each video. If it is set to 1 all frames will be returned. Can take values within (0, 1] (default: 1)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (25457 x 120000 (for image_size (200, 200))), the labels numpy array (25457 x 2)

Return type:

Bunch

References

http://www.zhuolin.umiacs.io/Keckgesturedataset.html

clustpy.data.real_video_data.load_video_weizmann(image_size: tuple = None, frame_sampling_ratio: float = 1, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Weizmann video data set. It consists of 93 videos showing 9 different persons performing 10 different activities. We transform the data set by extracting the 5687 144x180 colored frames. Note that the number of frames can differ depending on the used machine and version of opencv. The two label sets are the activities and the persons. N=5687, d=77760, k=[10, 9].

Parameters:

image_size (tuple) – The single frames can be downsized. This is necessary for large datasets. The tuple equals (width, height) of the images. Can also be None if the image size should not be changed (default: None)
frame_sampling_ratio (float) – Ratio to downsample the number of frames of each video. If it is set to 1 all frames will be returned. Can take values within (0, 1] (default: 1)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (5687 x 77760), the labels numpy array (5687 x 2)

Return type:

Bunch

References

https://www.wisdom.weizmann.ac.il/~vision/SpaceTimeActions.html

clustpy.data.real_world_data module

clustpy.data.real_world_data.load_breast_cancer(return_X_y: bool = False) → Bunch[source]

Load the breast cancer wisconsin data set. It consists of 32 features computed from digitized images of fine needle aspirate of breast mass. The classes are the result of a diagnosis (malignant or benign). N=569, d=30, k=2.

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (569 x 30), the labels numpy array (569)
Return type:: Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

clustpy.data.real_world_data.load_coil100(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the COIL-100 data set. It consists of 7200 128x128 color images of 100 objects photographed from 72 different angles. N=7200, d=49152, k=100.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (7200 x 49152), the labels numpy array (7200)

Return type:

Bunch

References

https://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php

clustpy.data.real_world_data.load_coil20(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the COIL-20 data set. It consists of 1440 128x128 gray-scale images of 20 objects photographed from 72 different angles. N=1440, d=16384, k=20.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1440 x 16384), the labels numpy array (1440)

Return type:

Bunch

References

https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php

clustpy.data.real_world_data.load_imagenet10(use_224_size: bool = True, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the ImageNet-10 data set. This is a subset of the well-known ImageNet data set with only 10 classes. It consists of 13000 224x224 (or 96x96) color images showing different objects. N=13000, d=150528, k=10.

Parameters:

use_224_size (bool) – defines wheter the images should be loaded in the size (224 x 224) or (96 x 96) (default: True)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (13000 x 150528), the labels numpy array (13000)

Return type:

Bunch

References

https://www.image-net.org/

and

Russakovsky, Olga, et al. “Imagenet large scale visual recognition challenge.” International journal of computer vision 115 (2015): 211-252.

clustpy.data.real_world_data.load_imagenet_dog(subset: str = 'all', image_size: tuple = (224, 224), breeds: list = ['n02085936-Maltese_dog', 'n02086646-Blenheim_spaniel', 'n02088238-basset', 'n02091467-Norwegian_elkhound', 'n02097209-standard_schnauzer', 'n02099601-golden_retriever', 'n02101388-Brittany_spaniel', 'n02101556-clumber', 'n02102177-Welsh_springer_spaniel', 'n02105056-groenendael', 'n02105412-kelpie', 'n02105855-Shetland_sheepdog', 'n02107142-Doberman', 'n02110958-pug', 'n02112137-chow'], return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the ImageNet Dog data set. It consists of 20580 color images of different sizes showing 120 breeds of dogs. The data set is composed of 12000 training and 8580 test images. Usually, a subset of 15 dog breeds is used (Maltese_dog, Blenheim_spaniel, Basset, Norwegian_elkhound, Standard_schnauzer, Golden_retriever, Brittany_spaniel, Clumber, Welsh_springer_spaniel, Groenendael, Kelpie, Shetland_sheepdog, Doberman, Pug, Chow), resulting in 2574 images for the “all” subset. N=20580, d=image_size[0]*image_size[1]*3, k=120.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
image_size (tuple) – the images of various sizes must be converted into a coherent size. The tuple equals (width, height) of the images (default: (224, 224))
breeds (list) – list containing all the identifiers of the dog breeds that should be extracted. All entries must be of type str. If None, all breeds will be extracted. Usually, a subset consisting of 15 breeds is extracted (default: list with 15 dog breeds)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (bool) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (20580 x image_size[0]*image_size[1]*3), the labels numpy array (20580)

Return type:

Bunch

References

http://vision.stanford.edu/aditya86/ImageNetDogs/main.html

and

Khosla, Aditya, et al. “Novel dataset for fine-grained image categorization: Stanford dogs.” Proc. CVPR workshop on fine-grained visual categorization (FGVC). Vol. 2. No. 1. Citeseer, 2011.

clustpy.data.real_world_data.load_iris(return_X_y: bool = False) → Bunch[source]

Load the iris data set. It consists of the petal and sepal width and length of three different types of irises (Setosa, Versicolour, Virginica). N=150, d=4, k=3.

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (150 x 4), the labels numpy array (150)
Return type:: Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html https://archive.ics.uci.edu/ml/datasets/iris

clustpy.data.real_world_data.load_newsgroups(subset: str = 'all', n_features: int = 2000, return_X_y: bool = False) → Bunch[source]

Load the 20 newsgroups data set. It consists of a collection of 18846 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The documents are converted into feature vectors using TF-IDF. The data set is composed of 11314 training and 7532 test documents. N=18846, d=2000, k=20 using the default settings.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
n_features (int) – number of features used by TF-IDF (default: 2000)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (18846 x 2000 - using the default settings), the labels numpy array (18846)

Return type:

Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups http://qwone.com/~jason/20Newsgroups/

clustpy.data.real_world_data.load_olivetti_faces(return_X_y: bool = False) → Bunch[source]

Load the olivetti faces data set. It consists of 400 64x64 grayscale images showing faces of 40 different persons. N=400, d=4096, k=40.

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (400 x 4096), the labels numpy array (400)
Return type:: Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_olivetti_faces.html

clustpy.data.real_world_data.load_reuters(subset: str = 'all', n_features: int = 2000, categories: tuple = ('CCAT', 'GCAT', 'MCAT', 'ECAT'), return_X_y: bool = False) → Bunch[source]

Load the Reuters data set. It consists of over 800000 manually categorized newswire stories made available by Reuters, Ltd. Usually only a subset of the categories is used. Those categories are defined by the attribute ‘categories’. We use only those articles that belong to a single category. Further, we only use the n_features most frequent features. The data set is composed of 19806 training and 665265 test documents using the default settings. N=685071, d=2000, k=4 using the default settings.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
n_features (int) – number of features used (default: 2000)
categories (tuple) – the categories that should be contained (default: (“CCAT”, “GCAT”, “MCAT”, “ECAT”))
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (685071 x 2000 - using the default settings), the labels numpy array (685071 - using the default settings)

Return type:

Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_rcv1.html#sklearn.datasets.fetch_rcv1

and

Lewis, David D., et al. “Rcv1: A new benchmark collection for text categorization research.” Journal of machine learning research 5.Apr (2004): 361-397.

clustpy.data.real_world_data.load_webkb(use_universities: tuple = ('cornell', 'texas', 'washington', 'wisconsin'), use_categories: tuple = ('course', 'faculty', 'project', 'student'), remove_headers: bool = True, min_doc_frequency: float = 0.01, min_variance: float = 0.25, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the WebKB data set. It consists of 1041 Html documents from different universities (default: “cornell”, “texas”, “washington” and “wisconsin”). These web pages have a specified category (default: “course”, “faculty”, “project”, “student”). For more information see the references website. The data is preprocessed by using stemming and removing stop words. Furthermore, words with a document frequency smaller than min_doc_frequency or with a variance smaller than min_variance will be removed. N=1041, d=323, k=[4,4] using the default settings.

Parameters:

use_universities (tuple) – specify the universities (default: (“cornell”, “texas”, “washington”, “wisconsin”))
use_categories (tuple) – specify the categories (default: (“course”, “faculty”, “project”, “student”))
remove_headers (bool) – should the headers of the Html files be removed? (default: True)
min_doc_frequency (float) – minimum document frequency of the words (default: 0.01)
min_variance (float) – minimum variance of the words (default: 0.25)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1041 x 323 - using the default settings), the labels numpy array (1041 x 2 - using the default settings)

Return type:

Bunch

References

http://www.cs.cmu.edu/~webkb/

clustpy.data.real_world_data.load_wine(return_X_y: bool = False) → Bunch[source]

Load the wine data set. It consists of 13 different properties of three different types of wine. N=178, d=13, k=3.

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (178 x 13), the labels numpy array (178)
Return type:: Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html https://archive.ics.uci.edu/ml/datasets/wine

clustpy.data.synthetic_data_creator module

clustpy.data.synthetic_data_creator.create_nr_data(n_samples: int = 1000, n_clusters: tuple = (3, 3, 1), subspace_features: tuple = (2, 2, 2), n_outliers: tuple = (0, 0, 0), std: float = 1.0, box: tuple = (-10, 10), rotate_space: bool = True, random_state: ~numpy.random.mtrand.RandomState | int = None) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Create a synthetic non-redundant data set consisting of multiple subspaces containing Gaussian clusters (called clustered spaces). You can also create subspaces with a single Gaussian cluster (called noise space). The sklearn method make_blobs is used to create the clusters. The dimensionality of the subspaces is specified by the subspace_features parameter. It can be an integer, where the dimensionality is the same for all subspaces, or it can be a list. Additionally, one can specify the number of outliers for each subspace. Outliers will be created using a uniform distribution using the box parameter as limits. If outliers are used, the number of samples within the clusters is reduced accordingly. The standard deviation and the bounding box can be specified either for each subspace individually or a single value will be shared across all spaces.

Parameters:

n_samples (int) – Number of samples in the clusters. If n_samples is int, the samples will be equally divided across all clusters in each subspace. Otherwise, a tuple of tuples (e.g. ((100, 200, 700), (300,300,400), (300,300,400))) can specify the size of each cluster in each subspace individually. Beware that the overall number of samples (including outliers) must be equal for each subspace (default: 1000)
n_clusters (tuple) – Specifies the number of clusters for each subspace (default: (3, 3, 1))
subspace_features (tuple) – Number of features in each subspace (default: (2, 2, 2))
n_outliers (tuple) – Number of outliers for each subspace. Overall number of samples will be n_samples + n_outliers. Beware that n_samples + n_outliers must be equal for each subspace (default: (0, 0, 0))
std (float) – Standard deviation of the Gaussian clusters. Can be a list specifying an individual value for each subspace (default: 1.)
box (tuple) – The bounding box of the cluster centers. Can be a list specifying an individual value for each subspace (default: (-10, 10))
rotate_space (bool) – Specifies whether the feature space should be rotated by an orthonormal matrix (default: True)
random_state (np.random.RandomState | int) – The random state (default: None)

Returns:

data, labels – the data numpy array (n_samples x sum(subspace_features)), the labels numpy array (n_samples x len(subspace_features))

Return type:

(np.ndarray, np.ndarray)

clustpy.data.synthetic_data_creator.create_subspace_data(n_samples: int = 1000, n_clusters: int = 3, subspace_features: tuple = (2, 2), n_outliers: tuple = (0, 0), std: float = 1.0, box: tuple = (-10, 10), rotate_space: bool = True, random_state: ~numpy.random.mtrand.RandomState | int = None) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Create a synthetic subspace data set consisting of a subspace containing multiple Gaussian clusters (called clustered space) and a subspace containing a single Gaussian cluster (called noise space). This method is a special case of the create_nr_data method using only a single clustered space. See create_nr_data for more information.

Parameters:

n_samples (int) – Number of samples in the clusters. If n_samples is int, the samples will be equally divided across all clusters. Otherwise, a tuple (e.g. (100, 200, 700)) can specify the size of each cluster individually (default: 1000)
n_clusters (int) – Specifies the number of clusters in the clustered space (default: 3)
subspace_features (tuple) – Number of features in each of the two subspaces (default: (2, 2))
n_outliers (tuple) – Number of outliers for each subspace. Overall number of samples will be n_samples + n_outliers. Beware that n_samples + n_outliers must be equal for both subspaces (default: (0, 0))
std (float) – Standard deviation of the Gaussian clusters. Can be a list specifying an individual value for each subspace (default: 1.)
box (tuple) – The bounding box of the cluster centers. Can be a list specifying an individual value for each subspace (default: (-10, 10))
rotate_space (bool) – Specifies whether the feature space should be rotated by an orthonormal matrix (default: True)
random_state (np.random.RandomState | int) – The random state (default: None)

Returns:

data, labels – the data numpy array (n_samples x sum(subspace_features)), the labels numpy array (n_samples)

Return type:

(np.ndarray, np.ndarray)

Module contents

class clustpy.data.ZNormalizer(feature_or_channel_wise: bool = False)[source]

Bases: TransformerMixin, BaseEstimator

Normalize a data set by calculating (data - mean) / std. In general, two strategies are sensible to normalize a data set. Either use all features simultaneously for the normalization or normalize each feature separately. In the case of image data, a feature-wise transformation usually corresponds to a channel-wise transformation. If this normalizer should be applied to RGB image data, the color channels should be in the first dimension, known as CHW representation.

Parameters:: feature_or_channel_wise (bool) – Specifies if all data should be used for the normalization or if a feature-/channel-wise normalization should be applied (default: False)

shape

Shape of the data set with which this normalizer has been fitted

Type:: list

mean

Mean value(s) of the data set

Type:: np.ndarray or int

std

Standard deviation value(s) of the data set

Type:: np.ndarray or int

fit(X: ndarray, y: ndarray = None) → ZNormalizer[source]

Compute the mean and std values regarding the input data set.

Parameters:

X (np.ndarray) – the given data set
y (np.ndarray) – the labels (can be ignored)

Returns:

self – this instance of the ZNormalizer

Return type:

ZNormalizer

inverse_transform(X: ndarray) → ndarray[source]

Invert the transformation by applying (data * std) + mean.

Parameters:: X (np.ndarray) – the given data set
Returns:: X_out – The transformed data set
Return type:: np.ndarray

transform(X: ndarray) → ndarray[source]

Transform the given data set using the fitted mean and std values.

Parameters:: X (np.ndarray) – the given data set
Returns:: X_out – The transformed data set
Return type:: np.ndarray

clustpy.data.create_nr_data(n_samples: int = 1000, n_clusters: tuple = (3, 3, 1), subspace_features: tuple = (2, 2, 2), n_outliers: tuple = (0, 0, 0), std: float = 1.0, box: tuple = (-10, 10), rotate_space: bool = True, random_state: ~numpy.random.mtrand.RandomState | int = None) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Create a synthetic non-redundant data set consisting of multiple subspaces containing Gaussian clusters (called clustered spaces). You can also create subspaces with a single Gaussian cluster (called noise space). The sklearn method make_blobs is used to create the clusters. The dimensionality of the subspaces is specified by the subspace_features parameter. It can be an integer, where the dimensionality is the same for all subspaces, or it can be a list. Additionally, one can specify the number of outliers for each subspace. Outliers will be created using a uniform distribution using the box parameter as limits. If outliers are used, the number of samples within the clusters is reduced accordingly. The standard deviation and the bounding box can be specified either for each subspace individually or a single value will be shared across all spaces.

Parameters:

n_samples (int) – Number of samples in the clusters. If n_samples is int, the samples will be equally divided across all clusters in each subspace. Otherwise, a tuple of tuples (e.g. ((100, 200, 700), (300,300,400), (300,300,400))) can specify the size of each cluster in each subspace individually. Beware that the overall number of samples (including outliers) must be equal for each subspace (default: 1000)
n_clusters (tuple) – Specifies the number of clusters for each subspace (default: (3, 3, 1))
subspace_features (tuple) – Number of features in each subspace (default: (2, 2, 2))
n_outliers (tuple) – Number of outliers for each subspace. Overall number of samples will be n_samples + n_outliers. Beware that n_samples + n_outliers must be equal for each subspace (default: (0, 0, 0))
std (float) – Standard deviation of the Gaussian clusters. Can be a list specifying an individual value for each subspace (default: 1.)
box (tuple) – The bounding box of the cluster centers. Can be a list specifying an individual value for each subspace (default: (-10, 10))
rotate_space (bool) – Specifies whether the feature space should be rotated by an orthonormal matrix (default: True)
random_state (np.random.RandomState | int) – The random state (default: None)

Returns:

data, labels – the data numpy array (n_samples x sum(subspace_features)), the labels numpy array (n_samples x len(subspace_features))

Return type:

(np.ndarray, np.ndarray)

clustpy.data.create_subspace_data(n_samples: int = 1000, n_clusters: int = 3, subspace_features: tuple = (2, 2), n_outliers: tuple = (0, 0), std: float = 1.0, box: tuple = (-10, 10), rotate_space: bool = True, random_state: ~numpy.random.mtrand.RandomState | int = None) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Create a synthetic subspace data set consisting of a subspace containing multiple Gaussian clusters (called clustered space) and a subspace containing a single Gaussian cluster (called noise space). This method is a special case of the create_nr_data method using only a single clustered space. See create_nr_data for more information.

Parameters:

n_samples (int) – Number of samples in the clusters. If n_samples is int, the samples will be equally divided across all clusters. Otherwise, a tuple (e.g. (100, 200, 700)) can specify the size of each cluster individually (default: 1000)
n_clusters (int) – Specifies the number of clusters in the clustered space (default: 3)
subspace_features (tuple) – Number of features in each of the two subspaces (default: (2, 2))
n_outliers (tuple) – Number of outliers for each subspace. Overall number of samples will be n_samples + n_outliers. Beware that n_samples + n_outliers must be equal for both subspaces (default: (0, 0))
std (float) – Standard deviation of the Gaussian clusters. Can be a list specifying an individual value for each subspace (default: 1.)
box (tuple) – The bounding box of the cluster centers. Can be a list specifying an individual value for each subspace (default: (-10, 10))
rotate_space (bool) – Specifies whether the feature space should be rotated by an orthonormal matrix (default: True)
random_state (np.random.RandomState | int) – The random state (default: None)

Returns:

data, labels – the data numpy array (n_samples x sum(subspace_features)), the labels numpy array (n_samples)

Return type:

(np.ndarray, np.ndarray)

clustpy.data.flatten_images(data: ndarray, format: str) → ndarray[source]

Convert data array from image to numerical vector. Before flattening, color images will be converted to the HWC/HWDC (height, width, color channels) format.

Parameters:

data (np.ndarray) – The given data set
format (str) – Format of the images with the data array. Can be: “HW”, “HWD”, “CHW”, “CHWD”, “HWC”, “HWDC”. Abbreviations stand for: H: Height, W: Width, D: Depth, C: Color-channels

Returns:

data – The flatten data array

Return type:

np.ndarray

clustpy.data.load_adrenal_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the AdrenalMNIST3D data set. It consists of 1584 28x28x28 grayscale images belonging to one of 2 classes. The data set is composed of 1188 training, 98 validation and 298 test samples. N=1584, d=21952, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1584 x 21952), the labels numpy array (1584)

Return type:

Bunch

References

https://medmnist.com/

clustpy.data.load_aloi_small(return_X_y: bool = False) → Bunch[source]

Load a subset of the Amsterdam Library of Object Image (ALOI) consisting of 288 images of the objects red ball, red cylinder, green ball and green cylinder. The two label sets are cylinder/ball and red/green. N=288, d=611, k=[2,2].

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (288 x 611), the labels numpy array (288 x 2)
Return type:: Bunch

References

https://aloi.science.uva.nl/

and

Ye, Wei, et al. “Generalized independent subspace clustering.” 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 2016.

clustpy.data.load_banknotes(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the banknote authentication data set. It consists of 1372 genuine and forged banknote samples. N=1372, d=4, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1372 x 4), the labels numpy array (1372)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/banknote+authentication

clustpy.data.load_blood_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the BloodMNIST data set. It consists of 17092 28x28 colored images belonging to one of 8 classes. The data set is composed of 11959 training, 1712 validation and 3421 test samples. N=17092, d=2352, k=8.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (17092 x 2352), the labels numpy array (17092)

Return type:

Bunch

References

https://medmnist.com/

Andrea Acevedo, Anna Merino, et al., “A dataset of microscopic peripheral blood cell images for development of automatic recognition systems,” Data in Brief, vol. 30, pp. 105474, 2020.

clustpy.data.load_breast_cancer(return_X_y: bool = False) → Bunch[source]

Load the breast cancer wisconsin data set. It consists of 32 features computed from digitized images of fine needle aspirate of breast mass. The classes are the result of a diagnosis (malignant or benign). N=569, d=30, k=2.

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (569 x 30), the labels numpy array (569)
Return type:: Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

clustpy.data.load_breast_cancer_wisconsin_original(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the original breast cancer Wisconsin data set. It consists of 699 samples belonging to one of 2 classes. 16 samples contain ‘?’ values and will be removed. N=683, d=9, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (683 x 9), the labels numpy array (683)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29

clustpy.data.load_breast_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the BreastMNIST data set. It consists of 780 28x28 grayscale images belonging to one of 2 classes. The data set is composed of 546 training, 78 validation and 156 test samples. N=780, d=784, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (780 x 784), the labels numpy array (780)

Return type:

Bunch

References

https://medmnist.com/

Walid Al-Dhabyani, Mohammed Gomaa, et al., “Dataset of breast ultrasound images,” Data in Brief, vol. 28, pp. 104863, 2020.

clustpy.data.load_breast_tissue(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the breast tissue data set. It consists of 106 samples belonging to one of 6 classes. N=106, d=9, k=6.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (106 x 9), the labels numpy array (106)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/breast+tissue

clustpy.data.load_chest_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the ChestMNIST data set. It consists of 112120 28x28 grayscale images. The ground truth labels consist of 14 labelings with 2 clusters each. The data set is composed of 78468 training, 11219 validation and 22433 test samples. N=112120, d=784, k=[2,2,2,2,2,2,2,2,2,2,2,2,2,2].

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (112120 x 784), the labels numpy array (112120)

Return type:

Bunch

References

https://medmnist.com/

Xiaosong Wang, Yifan Peng, et al., “Chest x-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in CVPR, 2017, pp. 3462–3471.

clustpy.data.load_cifar10(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the CIFAR10 data set. It consists of 60000 32x32 color images showing different objects. The classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. The data set is composed of 50000 training and 10000 test images. N=60000, d=3072, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (60000 x 3072), the labels numpy array (60000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.CIFAR10.html

and

Krizhevsky, Alex, and Geoffrey Hinton. “Learning multiple layers of features from tiny images.” (2009): 7.

clustpy.data.load_cifar100(subset: str = 'all', use_superclasses: bool = False, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the CIFAR100 data set. It consists of 60000 32x32 color images showing different objects. A total of 100 classes are included, each depicting a specific of objects. Each class contains 600 objects. If use_superclasses is True, only the 20 superclasses are used. The data set is composed of 50000 training and 10000 test images. N=60000, d=3072, k=100.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
use_superclasses (bool) – If set to True, the 20 superclasses are used instead of the 100 regular classes (default: False)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (60000 x 3072), the labels numpy array (60000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.CIFAR100.html

and

Krizhevsky, Alex, and Geoffrey Hinton. “Learning multiple layers of features from tiny images.” (2009): 7.

clustpy.data.load_cmu_faces(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the CMU Face Images data set. It consists of 640 30x32 grayscale images showing 20 persons in different poses (up, straight, left, right) and with different expressions (neutral, happy, sad, angry). Additionally, the persons can wear sunglasses or not. 16 images show glitches which is why the final data set only contains 624 images. N=624, d=400, k=[20,4,4,2].

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (624 x 400), the labels numpy array (624 x 4)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/cmu+face+images

clustpy.data.load_coil100(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the COIL-100 data set. It consists of 7200 128x128 color images of 100 objects photographed from 72 different angles. N=7200, d=49152, k=100.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (7200 x 49152), the labels numpy array (7200)

Return type:

Bunch

References

https://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php

clustpy.data.load_coil20(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the COIL-20 data set. It consists of 1440 128x128 gray-scale images of 20 objects photographed from 72 different angles. N=1440, d=16384, k=20.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1440 x 16384), the labels numpy array (1440)

Return type:

Bunch

References

https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php

clustpy.data.load_derma_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the DermaMNIST data set. It consists of 10015 28x28 colored images belonging to one of 7 classes. The data set is composed of 7007 training, 1003 validation and 2005 test samples. N=10015, d=2352, k=7.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (10015 x 2352), the labels numpy array (10015)

Return type:

Bunch

References

https://medmnist.com/

Philipp Tschandl, Cliff Rosendahl, et al., “The ham10000 dataset, a large collection of multisource dermatoscopic images of common pigmented skin lesions,” Scientific data, vol. 5, pp. 180161, 2018.

Noel Codella, Veronica Rotemberg, et al., “Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)”, 2018, arXiv:1902.03368.

clustpy.data.load_dermatology(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the dermatology data set. It consists of 366 samples belonging to one of 6 classes. 8 samples contain ‘?’ values and are therefore removed. N=358, d=34, k=6.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (358 x 34), the labels numpy array (358)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/dermatology

clustpy.data.load_diatom_size_reduction(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the diatom size reduction data set. It consists of 322 samples belonging to one of 4 classes. The data set is composed of 16 training and 306 test samples. N=322, d=345, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (322 x 345), the labels numpy array (322)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=DiatomSizeReduction

clustpy.data.load_ecoli(ignore_small_clusters: bool = False, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the ecoli data set. It consists of 336 samples belonging to one of 8 classes. N=336, d=7, k=8.

Parameters:

ignore_small_clusters (bool) – specify if the three small clusters with size 2, 2 and 5 should be ignored (default: False)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (336 x 7), the labels numpy array (336)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/ecoli

clustpy.data.load_fmnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Fashion-MNIST data set. It consists of 70000 28x28 grayscale images showing articles from the Zalando online store. Each sample belongs to one of 10 product groups. The data set is composed of 60000 training and 10000 test images. N=70000, d=784, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (70000 x 784), the labels numpy array (70000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.FashionMNIST.html

and

Xiao, Han, Kashif Rasul, and Roland Vollgraf. “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.” arXiv preprint arXiv:1708.07747 (2017).

clustpy.data.load_forest_types(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the forest type mapping data set. It consists of 523 samples belonging to one of 4 classes. The data set is composed of 198 training and 325 test samples. N=523, d=27, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (523 x 27), the labels numpy array (523)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Forest+type+mapping

clustpy.data.load_fracture_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the FractureMNIST3D data set. It consists of 1370 28x28x28 grayscale images belonging to one of 3 classes. The data set is composed of 1027 training, 103 validation and 240 test samples. N=1370, d=21952, k=3.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1370 x 21952), the labels numpy array (1370)

Return type:

Bunch

References

https://medmnist.com/

Liang Jin, Jiancheng Yang, et al., “Deep-learning-assisted detection and segmentation of rib fractures from ct scans: Development and validation of fracnet,” EBioMedicine, vol. 62, pp. 103106, 2020.

clustpy.data.load_fruit(return_X_y: bool = False) → Bunch[source]

Load the fruits data set. It consists of 105 preprocessed images of apples, bananas and grapes in red, green and yellow. N=105, d=6, k=[3,3].

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (105 x 6), the labels numpy array (105 x 2)
Return type:: Bunch

References

Hu, Juhua, et al. “Finding multiple stable clusterings.” Knowledge and Information Systems 51.3 (2017): 991-1021.

clustpy.data.load_gtsrb(subset: str = 'all', image_size: tuple = (32, 32), return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the GTSRB (German Traffic Sign Recognition Benchmark) data set. It consists of 39270 color images showing 43 different traffic signs. Example classes are: stop sign, speed limit 50 sign, speed limit 70 sign, construction site sign and many others. The data set is composed of 26640 training and 12630 test images. N=39270, d=image_size[0]*image_size[1]*3, k=43.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
image_size (tuple) – the images of various sizes must be converted into a coherent size. The tuple equals (width, height) of the images (default: (32, 32))
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (39270 x image_size[0]*image_size[1]*3), the labels numpy array (20580)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.GTSRB.html

and

https://benchmark.ini.rub.de/

and

Stallkamp, Johannes, et al. “Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition.” Neural networks 32 (2012): 323-332.

clustpy.data.load_har(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Human Activity Recognition data set. It consists of 10299 samples each representing sensor data of a person performing an activity. The six activities are walking, walking_upstairs, walking_downstairs, sitting, standing and laying. The data set is composed of 7352 training and 2947 test samples. N=10992, d=561, k=6.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (10992 x 561), the labels numpy array (10992)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones

clustpy.data.load_htru2(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the HTRU2 data set. It consists of 17898 samples belonging to the pulsar or non-pulsar class. A special property is that more than 90% of the data belongs to class 0. N=17898, d=8, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (17898 x 8), the labels numpy array (17898)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/HTRU2

clustpy.data.load_imagenet10(use_224_size: bool = True, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the ImageNet-10 data set. This is a subset of the well-known ImageNet data set with only 10 classes. It consists of 13000 224x224 (or 96x96) color images showing different objects. N=13000, d=150528, k=10.

Parameters:

use_224_size (bool) – defines wheter the images should be loaded in the size (224 x 224) or (96 x 96) (default: True)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (13000 x 150528), the labels numpy array (13000)

Return type:

Bunch

References

https://www.image-net.org/

and

Russakovsky, Olga, et al. “Imagenet large scale visual recognition challenge.” International journal of computer vision 115 (2015): 211-252.

clustpy.data.load_imagenet_dog(subset: str = 'all', image_size: tuple = (224, 224), breeds: list = ['n02085936-Maltese_dog', 'n02086646-Blenheim_spaniel', 'n02088238-basset', 'n02091467-Norwegian_elkhound', 'n02097209-standard_schnauzer', 'n02099601-golden_retriever', 'n02101388-Brittany_spaniel', 'n02101556-clumber', 'n02102177-Welsh_springer_spaniel', 'n02105056-groenendael', 'n02105412-kelpie', 'n02105855-Shetland_sheepdog', 'n02107142-Doberman', 'n02110958-pug', 'n02112137-chow'], return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the ImageNet Dog data set. It consists of 20580 color images of different sizes showing 120 breeds of dogs. The data set is composed of 12000 training and 8580 test images. Usually, a subset of 15 dog breeds is used (Maltese_dog, Blenheim_spaniel, Basset, Norwegian_elkhound, Standard_schnauzer, Golden_retriever, Brittany_spaniel, Clumber, Welsh_springer_spaniel, Groenendael, Kelpie, Shetland_sheepdog, Doberman, Pug, Chow), resulting in 2574 images for the “all” subset. N=20580, d=image_size[0]*image_size[1]*3, k=120.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
image_size (tuple) – the images of various sizes must be converted into a coherent size. The tuple equals (width, height) of the images (default: (224, 224))
breeds (list) – list containing all the identifiers of the dog breeds that should be extracted. All entries must be of type str. If None, all breeds will be extracted. Usually, a subset consisting of 15 breeds is extracted (default: list with 15 dog breeds)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (bool) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (20580 x image_size[0]*image_size[1]*3), the labels numpy array (20580)

Return type:

Bunch

References

http://vision.stanford.edu/aditya86/ImageNetDogs/main.html

and

Khosla, Aditya, et al. “Novel dataset for fine-grained image categorization: Stanford dogs.” Proc. CVPR workshop on fine-grained visual categorization (FGVC). Vol. 2. No. 1. Citeseer, 2011.

clustpy.data.load_iris(return_X_y: bool = False) → Bunch[source]

Load the iris data set. It consists of the petal and sepal width and length of three different types of irises (Setosa, Versicolour, Virginica). N=150, d=4, k=3.

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (150 x 4), the labels numpy array (150)
Return type:: Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html https://archive.ics.uci.edu/ml/datasets/iris

clustpy.data.load_kmnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Kuzushiji-MNIST data set. It consists of 70000 28x28 grayscale images showing Kanji characters. It is composed of 10 different characters, each representing one column of hiragana. The data set is composed of 60000 training and 10000 test images. N=70000, d=784, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (70000 x 784), the labels numpy array (70000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.KMNIST.html

and

Clanuwat, Tarin, et al. “Deep learning for classical japanese literature.” arXiv preprint arXiv:1812.01718 (2018).

clustpy.data.load_letterrecognition(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Letter Recognition data set. It consists of 20000 samples where each sample represents one of the 26 capital letters in the English alphabet. All samples are composed of 16 numerical stimuli describing the respective letter. N=20000, d=16, k=26.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (20000 x 16), the labels numpy array (20000)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/letter+recognition

clustpy.data.load_lsst(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the LSST data set. It consists of 4925 samples belonging to one of 14 classes. The data set is composed of 2459 training and 2466 test samples. N=4925, d=216, k=14.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (4925 x 216), the labels numpy array (4925)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=LSST

clustpy.data.load_mice_protein(return_additional_labels: bool = False, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Mice Protein Expression data set. It consists of 1077 samples belonging to one of 8 classes. Each feature represents the expression level of one of 77 proteins. Samples containing more than 43 NaN values (3 cases) will be removed. Afterwards, all columns containing NaN values will be removed. This reduces the number of features from 77 to 68. The classes can be further subdivided by using the return_additional_labels parameter. This gives the additional information mouseID, behavior, treatment type and genotype. N=1077, d=68, k=8.

Parameters:

return_additional_labels (bool) – return additional labels (default: False)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1077 x 68), the labels numpy array (1077)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression

clustpy.data.load_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the MNIST data set. It consists of 70000 28x28 grayscale images showing handwritten digits (0 to 9). The data set is composed of 60000 training and 10000 test images. N=70000, d=784, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (bool) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (70000 x 784), the labels numpy array (70000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html

and

LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.

clustpy.data.load_motestrain(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the motestrain data set. It consists of 1272 samples belonging to one of 2 classes. The data set is composed of 20 training and 1252 test samples. N=1272, d=84, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1272 x 84), the labels numpy array (1272)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=MoteStrain

clustpy.data.load_multiple_features(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the multiple features data set. It consists of 2000 samples belonging to one of 10 classes. Each class corresponds to handwritten numerals (0-9) extracted from a collection of Dutch utility maps. N=2000, d=649, k=10.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (2000 x 649), the labels numpy array (2000)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Multiple+Features

clustpy.data.load_newsgroups(subset: str = 'all', n_features: int = 2000, return_X_y: bool = False) → Bunch[source]

Load the 20 newsgroups data set. It consists of a collection of 18846 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The documents are converted into feature vectors using TF-IDF. The data set is composed of 11314 training and 7532 test documents. N=18846, d=2000, k=20 using the default settings.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
n_features (int) – number of features used by TF-IDF (default: 2000)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (18846 x 2000 - using the default settings), the labels numpy array (18846)

Return type:

Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups http://qwone.com/~jason/20Newsgroups/

clustpy.data.load_nodule_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the NoduleMNIST3D data set. It consists of 1633 28x28x28 grayscale images belonging to one of 2 classes. The data set is composed of 1158 training, 165 validation and 310 test samples. N=1633, d=21952, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1633 x 21952), the labels numpy array (1633)

Return type:

Bunch

References

https://medmnist.com/

Samuel G. Armato III, Geoffrey McLennan, et al., “The lung image database consortium (lidc) and image database resource initiative (idri): A completed reference databaseof lung nodules on ct scans,” Medical Physics, vol. 38,no. 2, pp. 915–931, 2011.

clustpy.data.load_nrletters(return_X_y: bool = False) → Bunch[source]

Load the NRLetters data set. It consists of 10000 9x7 images of the letters A, B, C, X, Y and Z in pink, cyan and yellow. Additionally, each image highlights one corner in color. N=10000, d=189, k=[6,3,4].

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (10000 x 189), the labels numpy array (10000 x 3)
Return type:: Bunch

References

Leiber, Collin, et al. “Automatic Parameter Selection for Non-Redundant Clustering.” Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, 2022.

clustpy.data.load_oct_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OCTMNIST data set. It consists of 109309 28x28 grayscale images belonging to one of 4 classes. The data set is composed of 97477 training, 10832 validation and 1000 test samples. N=109309, d=784, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (109309 x 784), the labels numpy array (109309)

Return type:

Bunch

References

https://medmnist.com/

Daniel S. Kermany, Michael Goldbaum, et al., “Identifying medical diagnoses and treatable diseases by image-based deep learning,” Cell, vol. 172, no. 5, pp. 1122 – 1131.e9, 2018.

clustpy.data.load_olive_oil(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OliveOil data set. It consists of 60 samples belonging to one of 4 classes. The data set is composed of 30 training and 30 test samples. N=60, d=570, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (60 x 570), the labels numpy array (60)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=OliveOil

clustpy.data.load_olivetti_faces(return_X_y: bool = False) → Bunch[source]

Load the olivetti faces data set. It consists of 400 64x64 grayscale images showing faces of 40 different persons. N=400, d=4096, k=40.

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (400 x 4096), the labels numpy array (400)
Return type:: Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_olivetti_faces.html

clustpy.data.load_optdigits(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the optdigits data set. It consists of 5620 8x8 grayscale images, each representing a digit (0 to 9). Each pixel depicts the number of marked pixel within a 4x4 block of the original 32x32 bitmaps. The data set is composed of 3823 training and 1797 test samples. N=5620, d=64, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (5620 x 64), the labels numpy array (5620)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits

clustpy.data.load_organ_a_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OrganAMNIST data set. It consists of 58850 28x28 grayscale images belonging to one of 11 classes. The data set is composed of 34581 training, 6491 validation and 17778 test samples. N=58850, d=784, k=11.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (58850 x 784), the labels numpy array (58850)

Return type:

Bunch

References

https://medmnist.com/

Patrick Bilic, Patrick Ferdinand Christ, et al., “The liver tumor segmentation benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019.

Xuanang Xu, Fugen Zhou, et al., “Efficient multiple organ localization in ct image using 3d region proposal network,” IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1885–1898, 2019.

clustpy.data.load_organ_c_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OrganCMNIST data set. It consists of 23660 28x28 grayscale images belonging to one of 11 classes. The data set is composed of 13000 training, 2392 validation and 8268 test samples. N=23660, d=784, k=11.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (23660 x 784), the labels numpy array (23660)

Return type:

Bunch

References

https://medmnist.com/

Patrick Bilic, Patrick Ferdinand Christ, et al., “The liver tumor segmentation benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019.

Xuanang Xu, Fugen Zhou, et al., “Efficient multiple organ localization in ct image using 3d region proposal network,” IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1885–1898, 2019.

clustpy.data.load_organ_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OrganMNIST3D data set. It consists of 1743 28x28x28 grayscale images belonging to one of 11 classes. The data set is composed of 972 training, 161 validation and 610 test samples. N=1743, d=21952, k=11.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1743 x 21952), the labels numpy array (1743)

Return type:

Bunch

References

https://medmnist.com/

Patrick Bilic, Patrick Ferdinand Christ, et al., “The liver tumor segmentation benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019.

Xuanang Xu, Fugen Zhou, et al., “Efficient multiple organ localization in ct image using 3d region proposal network,” IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1885–1898, 2019.

clustpy.data.load_organ_s_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the OrganSMNIST data set. It consists of 25221 28x28 grayscale images belonging to one of 11 classes. The data set is composed of 13940 training, 2452 validation and 8829 test samples. N=25221, d=784, k=11.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (25221 x 784), the labels numpy array (25221)

Return type:

Bunch

References

https://medmnist.com/

Patrick Bilic, Patrick Ferdinand Christ, et al., “The liver tumor segmentation benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019.

Xuanang Xu, Fugen Zhou, et al., “Efficient multiple organ localization in ct image using 3d region proposal network,” IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1885–1898, 2019.

clustpy.data.load_path_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the PathMNIST data set. It consists of 107180 28x28 colored images belonging to one of 9 classes. The data set is composed of 89996 training, 10004 validation and 7180 test samples. N=107180, d=2352, k=9.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (107180 x 2352), the labels numpy array (107180)

Return type:

Bunch

References

https://medmnist.com/

Jakob Nikolas Kather, Johannes Krisam, et al., “Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study,” PLOS Medicine, vol. 16, no. 1, pp. 1–22, 01 2019.

clustpy.data.load_pendigits(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the pendigits data set. It consists of 10992 vectors of length 16, representing 8 coordinates. The coordinates were taken from the task of writing digits (0 to 9) on a tablet. The data set is composed of 7494 training and 3498 test samples. N=10992, d=16, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (10992 x 16), the labels numpy array (10992)

Return type:

Bunch

References

http://archive.ics.uci.edu/ml/datasets/pen-based+recognition+of+handwritten+digits

clustpy.data.load_plane(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the plane data set. It consists of 210 samples belonging to one of 7 classes. The data set is composed of 105 training and 105 test samples. N=210, d=144, k=7.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (210 x 144), the labels numpy array (210)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=Plane

clustpy.data.load_pneumonia_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the PneumoniaMNIST data set. It consists of 5856 28x28 grayscale images belonging to one of 2 classes. The data set is composed of 4708 training, 524 validation and 624 test samples. N=5856, d=784, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (5856 x 784), the labels numpy array (5856)

Return type:

Bunch

References

https://medmnist.com/

Daniel S. Kermany, Michael Goldbaum, et al., “Identifying medical diagnoses and treatable diseases by image-based deep learning,” Cell, vol. 172, no. 5, pp. 1122 – 1131.e9, 2018.

clustpy.data.load_proximal_phalanx_outline(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the proximal phalanx outline data set. It consists of 876 samples belonging to one of 2 classes. The data set is composed of 600 training and 276 test samples. N=876, d=80, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (876 x 80), the labels numpy array (876)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=ProximalPhalanxOutlineCorrect

clustpy.data.load_retina_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the RetinaMNIST data set. It consists of 1600 28x28 colored images belonging to one of 5 classes. The data set is composed of 1080 training, 120 validation and 400 test samples. N=1600, d=2352, k=5.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1600 x 2352), the labels numpy array (1600)

Return type:

Bunch

References

https://medmnist.com/

DeepDR Diabetic Retinopathy Image Dataset (DeepDRiD), “The 2nd diabetic retinopathy grading and image quality estimation challenge,” https://isbi.deepdr.org/data.html, 2020.

clustpy.data.load_reuters(subset: str = 'all', n_features: int = 2000, categories: tuple = ('CCAT', 'GCAT', 'MCAT', 'ECAT'), return_X_y: bool = False) → Bunch[source]

Load the Reuters data set. It consists of over 800000 manually categorized newswire stories made available by Reuters, Ltd. Usually only a subset of the categories is used. Those categories are defined by the attribute ‘categories’. We use only those articles that belong to a single category. Further, we only use the n_features most frequent features. The data set is composed of 19806 training and 665265 test documents using the default settings. N=685071, d=2000, k=4 using the default settings.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
n_features (int) – number of features used (default: 2000)
categories (tuple) – the categories that should be contained (default: (“CCAT”, “GCAT”, “MCAT”, “ECAT”))
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (685071 x 2000 - using the default settings), the labels numpy array (685071 - using the default settings)

Return type:

Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_rcv1.html#sklearn.datasets.fetch_rcv1

and

Lewis, David D., et al. “Rcv1: A new benchmark collection for text categorization research.” Journal of machine learning research 5.Apr (2004): 361-397.

clustpy.data.load_seeds(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the seeds data set. It consists of 210 samples belonging to one of three varieties of wheat. N=210, d=7, k=3.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (210 x 7), the labels numpy array (210)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/seeds

clustpy.data.load_semeion(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the semeion data set. It consists of 1593 samples belonging to one of 10 classes. Each sample corresponds to a grayscale 16x16 scan of handwritten digits originating from about 80 different persons. Further, each pixel was converted to a boolean value using a fixed threshold. N=1593, d=256, k=10.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1593 x 256), the labels numpy array (1593)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/semeion+handwritten+digit

clustpy.data.load_skin(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Skin Segmentation data set. It consists of 245057 skin- and non-skin samples with their B, G, R color information. N=245057, d=3, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (245057 x 3), the labels numpy array (245057)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/skin+segmentation

clustpy.data.load_sony_aibo_robot_surface(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Sony AIBO Robot Surface 1 data set. It consists of 621 samples belonging to one of 2 classes. The data set is composed of 20 training and 601 test samples. N=621, d=70, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (621 x 70), the labels numpy array (621)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=SonyAIBORobotSurface1

clustpy.data.load_soybean_large(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the large version of the soybean data set. It consists of 562 samples belonging to one of 15 classes. Originally, the data set would have samples and 19 classes but some samples have attributes showing ‘?’ values. Those will be ignored. The data set is composed of 266 training and 296 test samples. N=562, d=35, k=15.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (562 x 35), the labels numpy array (562)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/soybean+(Large)

clustpy.data.load_soybean_small(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the small version of the soybean data set. It is a small subset of the original soybean data set. It consists of 47 samples belonging to one of 4 classes. N=47, d=35, k=4.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (47 x 35), the labels numpy array (47)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/soybean+(small)

clustpy.data.load_spambase(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the spambase data set. It consists of 4601 spam and non-spam mails. N=4601, d=57, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (4601 x 57), the labels numpy array (4601)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/spambase

clustpy.data.load_statlog_australian_credit_approval(return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the statlog Australian Credit Approval data set. It consists of 690 samples belonging to one of 2 classes. N=690, d=14, k=2.

Parameters:

return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (690 x 14), the labels numpy array (690)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/statlog+(australian+credit+approval)

clustpy.data.load_statlog_shuttle(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the statlog shuttle data set. It consists of 58000 samples belonging to one of 7 classes. A special property is that about 80% of the data belongs to class 0. The data set is composed of 43500 training and 14500 test samples. N=58000, d=9, k=7.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (58000 x 9), the labels numpy array (58000)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle)

clustpy.data.load_stickfigures(return_X_y: bool = False) → Bunch[source]

Load the Dancing Stick Figures data set. It consists of 900 20x20 grayscale images of stick figures in different poses. The poses can be divided into three upp-body and three lower-body motions. N=900, d=400, k=[3,3].

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (900 x 400), labels: the labels numpy array (900 x 2)
Return type:: Bunch

References

Günnemann, Stephan, et al. “Smvc: semi-supervised multi-view clustering in subspace projections.” Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.

clustpy.data.load_stl10(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the STL10 data set. It consists of 13000 96x96 color images showing different objects. The classes are airplane, bird, car, cat, deer, dog, horse, monkey, ship and truck. The data set is composed of 5000 training and 8000 test images. N=13000, d=27648, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (13000 x 27648), the labels numpy array (13000)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.STL10.html

and

Coates, Adam, Andrew Ng, and Honglak Lee. “An analysis of single-layer networks in unsupervised feature learning.” Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011.

clustpy.data.load_svhn(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the SVHN data set. It consists of 99289 32x32 color images showing house numbers (0 to 9). The data set is composed of 73257 training and 26032 test images. N=99289, d=3072, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (99289 x 3072), the labels numpy array (99289)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.SVHN.html

and

Netzer, Yuval, et al. “Reading digits in natural images with unsupervised feature learning.” (2011).

clustpy.data.load_symbols(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the symbols data set. It consists of 1020 samples belonging to one of 6 classes. The data set is composed of 25 training and 995 test samples. N=1020, d=398, k=6.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1020 x 398), the labels numpy array (1020)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=Symbols

clustpy.data.load_synapse_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the SynapseMNIST3D data set. It consists of 1759 28x28x28 grayscale images belonging to one of 2 classes. The data set is composed of 1230 training, 177 validation and 352 test samples. N=1759, d=21952, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1759 x 21952), the labels numpy array (1759)

Return type:

Bunch

References

https://medmnist.com/

clustpy.data.load_tissue_mnist(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the TissueMNIST data set. It consists of 236386 28x28 grayscale images belonging to one of 8 classes. The data set is composed of 165466 training, 23640 validation and 47280 test samples. N=236386, d=784, k=8.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (236386 x 784), the labels numpy array (236386)

Return type:

Bunch

References

https://medmnist.com/

Vebjorn Ljosa, Katherine L Sokolnicki, et al., “Annotated high-throughput microscopy imagesets for validation.,” Nature methods, vol. 9, no. 7, pp.637–637, 2012.

clustpy.data.load_two_patterns(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the two patterns data set. It consists of 5000 samples belonging to one of 4 classes. The data set is composed of 1000 training and 4000 test samples. N=5000, d=128, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (5000 x 128), the labels numpy array (5000)

Return type:

Bunch

References

http://www.timeseriesclassification.com/description.php?Dataset=TwoPatterns

clustpy.data.load_user_knowledge(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the user knowledge data set. It consists of 403 samples belonging to one of 4 classes. The 4 classes are the knowledge levels ‘very low’, ‘low’, ‘middle’ and ‘high’. The data set is composed of 258 training and 145 test samples. N=403, d=5, k=4.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (403 x 5), the labels numpy array (403)

Return type:

Bunch

References

https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling

clustpy.data.load_usps(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the USPS data set. It consists of 9298 16x16 grayscale images showing handwritten digits (0 to 9). The data set is composed of 7291 training and 2007 test images. N=9298, d=256, k=10.

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (9298 x 256), the labels numpy array (9298)

Return type:

Bunch

References

https://pytorch.org/vision/stable/generated/torchvision.datasets.USPS.html

and

Hull, Jonathan J. “A database for handwritten text recognition research.” IEEE Transactions on pattern analysis and machine intelligence 16.5 (1994): 550-554.

clustpy.data.load_vessel_mnist_3d(subset: str = 'all', return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the VesselMNIST3D data set. It consists of 1909 28x28x28 grayscale images belonging to one of 2 classes. The data set is composed of 1335 training, 192 validation and 382 test samples. N=1909, d=21952, k=2.

Parameters:

subset (str) – can be ‘all’, ‘test’, ‘train’ or ‘val’. ‘all’ combines test, train and validation data (default: ‘all’)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1909 x 21952), the labels numpy array (1909)

Return type:

Bunch

References

https://medmnist.com/

Xi Yang, Ding Xia, et al., “Intra: 3d intracranial aneurysm dataset for deep learning,” in Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR), June 2020.

clustpy.data.load_video_keck_gesture(subset: str = 'all', image_size: tuple = (200, 200), frame_sampling_ratio: float = 1, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Keck Gesture video data set. It consists of 42 training and 56 testing videos showing 4 different persons performing 14 different gestures. We assign the label ‘0’ to the gesture ‘no gesture’, which describes the frames between the actual gestures. This results in 15 different gestures. Note, that the person with label ‘3’ is only contained in the testing data. We transform the data set by extracting the 25457 480x640 colored frames. Note that the number of frames can differ depending on the used machine and version of opencv. Further, we recommend to downsize the frames due to possible memory issues. The final data set is divided into 13546 training and 11911 test images. The two label sets are the gestures and the persons. N=25457, d=120000 (for image_size (200, 200)), k=[15, 4].

Parameters:

subset (str) – can be ‘all’, ‘test’ or ‘train’. ‘all’ combines test and train data (default: ‘all’)
image_size (tuple) – The single frames can be downsized. This is necessary for large datasets. The tuple equals (width, height) of the images. Can also be None if the image size should not be changed (default: (200, 200)))
frame_sampling_ratio (float) – Ratio to downsample the number of frames of each video. If it is set to 1 all frames will be returned. Can take values within (0, 1] (default: 1)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (25457 x 120000 (for image_size (200, 200))), the labels numpy array (25457 x 2)

Return type:

Bunch

References

http://www.zhuolin.umiacs.io/Keckgesturedataset.html

clustpy.data.load_video_weizmann(image_size: tuple = None, frame_sampling_ratio: float = 1, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the Weizmann video data set. It consists of 93 videos showing 9 different persons performing 10 different activities. We transform the data set by extracting the 5687 144x180 colored frames. Note that the number of frames can differ depending on the used machine and version of opencv. The two label sets are the activities and the persons. N=5687, d=77760, k=[10, 9].

Parameters:

image_size (tuple) – The single frames can be downsized. This is necessary for large datasets. The tuple equals (width, height) of the images. Can also be None if the image size should not be changed (default: None)
frame_sampling_ratio (float) – Ratio to downsample the number of frames of each video. If it is set to 1 all frames will be returned. Can take values within (0, 1] (default: 1)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Furthermore, the original images are contained in the ‘images’ attribute. Note that the data within ‘data’ is in HWC format and within ‘images’ in the CHW format. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (5687 x 77760), the labels numpy array (5687 x 2)

Return type:

Bunch

References

https://www.wisdom.weizmann.ac.il/~vision/SpaceTimeActions.html

clustpy.data.load_webkb(use_universities: tuple = ('cornell', 'texas', 'washington', 'wisconsin'), use_categories: tuple = ('course', 'faculty', 'project', 'student'), remove_headers: bool = True, min_doc_frequency: float = 0.01, min_variance: float = 0.25, return_X_y: bool = False, downloads_path: str = None) → Bunch[source]

Load the WebKB data set. It consists of 1041 Html documents from different universities (default: “cornell”, “texas”, “washington” and “wisconsin”). These web pages have a specified category (default: “course”, “faculty”, “project”, “student”). For more information see the references website. The data is preprocessed by using stemming and removing stop words. Furthermore, words with a document frequency smaller than min_doc_frequency or with a variance smaller than min_variance will be removed. N=1041, d=323, k=[4,4] using the default settings.

Parameters:

use_universities (tuple) – specify the universities (default: (“cornell”, “texas”, “washington”, “wisconsin”))
use_categories (tuple) – specify the categories (default: (“course”, “faculty”, “project”, “student”))
remove_headers (bool) – should the headers of the Html files be removed? (default: True)
min_doc_frequency (float) – minimum document frequency of the words (default: 0.01)
min_variance (float) – minimum variance of the words (default: 0.25)
return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
downloads_path (str) – path to the directory where the data is stored (default: None -> [USER]/Downloads/clustpy_datafiles)

Returns:

bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (1041 x 323 - using the default settings), the labels numpy array (1041 x 2 - using the default settings)

Return type:

Bunch

References

http://www.cs.cmu.edu/~webkb/

clustpy.data.load_wine(return_X_y: bool = False) → Bunch[source]

Load the wine data set. It consists of 13 different properties of three different types of wine. N=178, d=13, k=3.

Parameters:: return_X_y (bool) – If True, returns (data, target) instead of a Bunch object. See below for more information about the data and target object (default: False)
Returns:: bunch – A Bunch object containing the data in the ‘data’ attribute and the labels in the ‘target’ attribute. Alternatively, if return_X_y is True two arrays will be returned: the data numpy array (178 x 13), the labels numpy array (178)
Return type:: Bunch

References

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html https://archive.ics.uci.edu/ml/datasets/wine

clustpy.data.unflatten_images(data_flatten: ndarray, image_size: tuple) → ndarray[source]

Convert data array from numerical vector to image. After unflattening, color images will be converted to the CHW/CHWD (color channels, height, width) format.

Parameters:

data_flatten (np.ndarray) – The given flatten data set
image_size (str) – The size of a single image, e.g., (28,28,3) for a colored image of size 28 x 28

Returns:

data_image – The unflatten data array corresponding to an image

Return type:

np.ndarray

clustpy.data.z_normalization(X: ndarray, feature_or_channel_wise: bool = False) → ndarray[source]

Wrapper for the ZNormalizer. It automatically executes: X_transform = ZNormalizer(feature_or_channel_wise).fit_transform(X)

Parameters:

X (np.ndarray) – the given data set
feature_or_channel_wise (bool) – Specifies if all data should be used for the normalization or if a feature-/channel-wise normalization should be applied (default: False)

Returns:

X_transform – The transformed data set

Return type:

np.ndarray