Skip to content

Custom Datasets

This module contains custom dataset classes for handling image and tabular data from OpenML in PyTorch. To add support for new data types, new classes can be added to this module.

OpenMLImageDataset

Bases: Dataset

Class representing an image dataset from OpenML for use in PyTorch.

Methods:

__init__(self, X, y, image_size, image_dir, transform_x=None, transform_y=None)
    Initializes the dataset with given data, image size, directory, and optional transformations.

__getitem__(self, idx)
    Retrieves an image and its corresponding label (if available) from the dataset at the specified index. Applies transformations if provided.

__len__(self)
    Returns the total number of images in the dataset.
Source code in openml_pytorch/custom_datasets.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class OpenMLImageDataset(Dataset):
    """
        Class representing an image dataset from OpenML for use in PyTorch.

        Methods:

            __init__(self, X, y, image_size, image_dir, transform_x=None, transform_y=None)
                Initializes the dataset with given data, image size, directory, and optional transformations.

            __getitem__(self, idx)
                Retrieves an image and its corresponding label (if available) from the dataset at the specified index. Applies transformations if provided.

            __len__(self)
                Returns the total number of images in the dataset.
    """
    def __init__(self, X, y, image_size, image_dir, transform_x = None, transform_y = None):
        self.X = X
        self.y = y
        self.image_size = image_size
        self.image_dir = image_dir
        self.transform_x = transform_x
        self.transform_y = transform_y

    def __getitem__(self, idx):
        img_name = str(os.path.join(self.image_dir, self.X.iloc[idx, 0]))
        image = read_image(img_name)
        image = image.float()
        image = T.Resize((self.image_size, self.image_size))(image)
        if self.transform_x is not None:
            image = self.transform_x(image)
        if self.y is not None:
            label = self.y.iloc[idx]
            if label is not None:
                if self.transform_y is not None:
                    label = self.transform_y(label)
                return image, label
        else:
            return image

    def __len__(self):
        return len(self.X)

OpenMLTabularDataset

Bases: Dataset

OpenMLTabularDataset

A custom dataset class to handle tabular data from OpenML (or any similar tabular dataset). It encodes categorical features and the target column using LabelEncoder from sklearn.

Methods:

Name Description
__init__

Initializes the dataset with the data and the target column. Encodes the categorical features and target if provided.

__getitem__

Retrieves the input data and target value at the specified index. Converts the data to tensors and returns them.

__len__

Returns the length of the dataset.

Source code in openml_pytorch/custom_datasets.py
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
class OpenMLTabularDataset(Dataset):
    """
    OpenMLTabularDataset

    A custom dataset class to handle tabular data from OpenML (or any similar tabular dataset).
    It encodes categorical features and the target column using LabelEncoder from sklearn.

    Methods:
        __init__(X, y) : Initializes the dataset with the data and the target column.
                         Encodes the categorical features and target if provided.

        __getitem__(idx): Retrieves the input data and target value at the specified index.
                          Converts the data to tensors and returns them.

        __len__(): Returns the length of the dataset.
    """
    def __init__(self, X, y):
        self.data = X
        # self.target_col_name = target_col
        for col in self.data.select_dtypes(include=['object', 'category']):
            # convert to float
            self.data[col] = self.data[col].astype('category').cat.codes
        self.label_mapping = None

        # self.label_mapping = preprocessing.LabelEncoder()
        # try:
        #     self.data = self.data.apply(self.label_mapping.fit_transform)
        # except ValueError:
        #     pass

        # try:
        #     self.y = self.label_mapping.fit_transform(y)
        # except ValueError:
        #     self.y = None
        self.y = y

    def __getitem__(self, idx):
        # x is the input data, y is the target value from the target column
        x = self.data.iloc[idx, :]
        x = torch.tensor(x.values.astype('float32'))
        if self.y is not None:
            y = self.y[idx]
            y = torch.tensor(y)
            return x, y
        else:
            return x


    def __len__(self):
        return len(self.data)