dataset

`OpenMLDataset` ¶

Bases: OpenMLBase

Dataset object.

Allows fetching and uploading datasets to OpenML.

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the dataset.	required
`description`	`str`	Description of the dataset.	required
`data_format`	`str`	Format of the dataset which can be either 'arff' or 'sparse_arff'.	`'arff'`
`cache_format`	`str`	Format for caching the dataset which can be either 'feather' or 'pickle'.	`'pickle'`
`dataset_id`	`int`	Id autogenerated by the server.	`None`
`version`	`int`	Version of this dataset. '1' for original version. Auto-incremented by server.	`None`
`creator`	`str`	The person who created the dataset.	`None`
`contributor`	`str`	People who contributed to the current version of the dataset.	`None`
`collection_date`	`str`	The date the data was originally collected, given by the uploader.	`None`
`upload_date`	`str`	The date-time when the dataset was uploaded, generated by server.	`None`
`language`	`str`	Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. 'English'.	`None`
`licence`	`str`	License of the data.	`None`
`url`	`str`	Valid URL, points to actual data file. The file can be on the OpenML server or another dataset repository.	`None`
`default_target_attribute`	`str`	The default target attribute, if it exists. Can have multiple values, comma separated.	`None`
`row_id_attribute`	`str`	The attribute that represents the row-id column, if present in the dataset.	`None`
`ignore_attribute`	`str \| list`	Attributes that should be excluded in modelling, such as identifiers and indexes.	`None`
`version_label`	`str`	Version label provided by user. Can be a date, hash, or some other type of id.	`None`
`citation`	`str`	Reference(s) that should be cited when building on this data.	`None`
`tag`	`str`	Tags, describing the algorithms.	`None`
`visibility`	`str`	Who can see the dataset. Typical values: 'Everyone','All my friends','Only me'. Can also be any of the user's circles.	`None`
`original_data_url`	`str`	For derived data, the url to the original dataset.	`None`
`paper_url`	`str`	Link to a paper describing the dataset.	`None`
`update_comment`	`str`	An explanation for when the dataset is uploaded.	`None`
`md5_checksum`	`str`	MD5 checksum to check if the dataset is downloaded without corruption.	`None`
`data_file`	`str`	Path to where the dataset is located.	`None`
`features_file`	`dict`	A dictionary of dataset features, which maps a feature index to a OpenMLDataFeature.	`None`
`qualities_file`	`dict`	A dictionary of dataset qualities, which maps a quality name to a quality value.	`None`
`dataset`	`str \| None`	Serialized arff dataset string.	`None`
`parquet_url`	`str \| None`	This is the URL to the storage location where the dataset files are hosted. This can be a MinIO bucket URL. If specified, the data will be accessed from this URL when reading the files.	`None`
`parquet_file`	`str \| None`	Path to the local file.	`None`

Source code in openml/datasets/dataset.py

class OpenMLDataset(OpenMLBase):
    """Dataset object.

    Allows fetching and uploading datasets to OpenML.

    Parameters
    ----------
    name : str
        Name of the dataset.
    description : str
        Description of the dataset.
    data_format : str
        Format of the dataset which can be either 'arff' or 'sparse_arff'.
    cache_format : str
        Format for caching the dataset which can be either 'feather' or 'pickle'.
    dataset_id : int, optional
        Id autogenerated by the server.
    version : int, optional
        Version of this dataset. '1' for original version.
        Auto-incremented by server.
    creator : str, optional
        The person who created the dataset.
    contributor : str, optional
        People who contributed to the current version of the dataset.
    collection_date : str, optional
        The date the data was originally collected, given by the uploader.
    upload_date : str, optional
        The date-time when the dataset was uploaded, generated by server.
    language : str, optional
        Language in which the data is represented.
        Starts with 1 upper case letter, rest lower case, e.g. 'English'.
    licence : str, optional
        License of the data.
    url : str, optional
        Valid URL, points to actual data file.
        The file can be on the OpenML server or another dataset repository.
    default_target_attribute : str, optional
        The default target attribute, if it exists.
        Can have multiple values, comma separated.
    row_id_attribute : str, optional
        The attribute that represents the row-id column,
        if present in the dataset.
    ignore_attribute : str | list, optional
        Attributes that should be excluded in modelling,
        such as identifiers and indexes.
    version_label : str, optional
        Version label provided by user.
        Can be a date, hash, or some other type of id.
    citation : str, optional
        Reference(s) that should be cited when building on this data.
    tag : str, optional
        Tags, describing the algorithms.
    visibility : str, optional
        Who can see the dataset.
        Typical values: 'Everyone','All my friends','Only me'.
        Can also be any of the user's circles.
    original_data_url : str, optional
        For derived data, the url to the original dataset.
    paper_url : str, optional
        Link to a paper describing the dataset.
    update_comment : str, optional
        An explanation for when the dataset is uploaded.
    md5_checksum : str, optional
        MD5 checksum to check if the dataset is downloaded without corruption.
    data_file : str, optional
        Path to where the dataset is located.
    features_file : dict, optional
        A dictionary of dataset features,
        which maps a feature index to a OpenMLDataFeature.
    qualities_file : dict, optional
        A dictionary of dataset qualities,
        which maps a quality name to a quality value.
    dataset: string, optional
        Serialized arff dataset string.
    parquet_url: string, optional
        This is the URL to the storage location where the dataset files are hosted.
        This can be a MinIO bucket URL. If specified, the data will be accessed
        from this URL when reading the files.
    parquet_file: string, optional
        Path to the local file.
    """

    def __init__(  # noqa: C901, PLR0912, PLR0913, PLR0915
        self,
        name: str,
        description: str | None,
        data_format: Literal["arff", "sparse_arff"] = "arff",
        cache_format: Literal["feather", "pickle"] = "pickle",
        dataset_id: int | None = None,
        version: int | None = None,
        creator: str | None = None,
        contributor: str | None = None,
        collection_date: str | None = None,
        upload_date: str | None = None,
        language: str | None = None,
        licence: str | None = None,
        url: str | None = None,
        default_target_attribute: str | None = None,
        row_id_attribute: str | None = None,
        ignore_attribute: str | list[str] | None = None,
        version_label: str | None = None,
        citation: str | None = None,
        tag: str | None = None,
        visibility: str | None = None,
        original_data_url: str | None = None,
        paper_url: str | None = None,
        update_comment: str | None = None,
        md5_checksum: str | None = None,
        data_file: str | None = None,
        features_file: str | None = None,
        qualities_file: str | None = None,
        dataset: str | None = None,
        parquet_url: str | None = None,
        parquet_file: str | None = None,
    ):
        if cache_format not in ["feather", "pickle"]:
            raise ValueError(
                "cache_format must be one of 'feather' or 'pickle. "
                f"Invalid format specified: {cache_format}",
            )

        def find_invalid_characters(string: str, pattern: str) -> str:
            invalid_chars = set()
            regex = re.compile(pattern)
            for char in string:
                if not regex.match(char):
                    invalid_chars.add(char)
            return ",".join(
                [f"'{char}'" if char != "'" else f'"{char}"' for char in invalid_chars],
            )

        if dataset_id is None:
            pattern = "^[\x00-\x7f]*$"
            if description and not re.match(pattern, description):
                # not basiclatin (XSD complains)
                invalid_characters = find_invalid_characters(description, pattern)
                raise ValueError(
                    f"Invalid symbols {invalid_characters} in description: {description}",
                )
            pattern = "^[\x00-\x7f]*$"
            if citation and not re.match(pattern, citation):
                # not basiclatin (XSD complains)
                invalid_characters = find_invalid_characters(citation, pattern)
                raise ValueError(
                    f"Invalid symbols {invalid_characters} in citation: {citation}",
                )
            pattern = "^[a-zA-Z0-9_\\-\\.\\(\\),]+$"
            if not re.match(pattern, name):
                # regex given by server in error message
                invalid_characters = find_invalid_characters(name, pattern)
                raise ValueError(f"Invalid symbols {invalid_characters} in name: {name}")

        self.ignore_attribute: list[str] | None = None
        if isinstance(ignore_attribute, str):
            self.ignore_attribute = [ignore_attribute]
        elif isinstance(ignore_attribute, list) or ignore_attribute is None:
            self.ignore_attribute = ignore_attribute
        else:
            raise ValueError("Wrong data type for ignore_attribute. Should be list.")

        # TODO add function to check if the name is casual_string128
        # Attributes received by querying the RESTful API
        self.dataset_id = int(dataset_id) if dataset_id is not None else None
        self.name = name
        self.version = int(version) if version is not None else None
        self.description = description
        self.cache_format = cache_format
        # Has to be called format, otherwise there will be an XML upload error
        self.format = data_format
        self.creator = creator
        self.contributor = contributor
        self.collection_date = collection_date
        self.upload_date = upload_date
        self.language = language
        self.licence = licence
        self.url = url
        self.default_target_attribute = default_target_attribute
        self.row_id_attribute = row_id_attribute

        self.version_label = version_label
        self.citation = citation
        self.tag = tag
        self.visibility = visibility
        self.original_data_url = original_data_url
        self.paper_url = paper_url
        self.update_comment = update_comment
        self.md5_checksum = md5_checksum
        self.data_file = data_file
        self.parquet_file = parquet_file
        self._dataset = dataset
        self._parquet_url = parquet_url

        self._features: dict[int, OpenMLDataFeature] | None = None
        self._qualities: dict[str, float] | None = None
        self._no_qualities_found = False

        if features_file is not None:
            self._features = _read_features(Path(features_file))

        # "" was the old default value by `get_dataset` and maybe still used by some
        if qualities_file == "":
            # TODO(0.15): to switch to "qualities_file is not None" below and remove warning
            warnings.warn(
                "Starting from Version 0.15 `qualities_file` must be None and not an empty string "
                "to avoid reading the qualities from file. Set `qualities_file` to None to avoid "
                "this warning.",
                FutureWarning,
                stacklevel=2,
            )
            qualities_file = None

        if qualities_file is not None:
            self._qualities = _read_qualities(Path(qualities_file))

        if data_file is not None:
            data_pickle, data_feather, feather_attribute = self._compressed_cache_file_paths(
                Path(data_file)
            )
            self.data_pickle_file = data_pickle if Path(data_pickle).exists() else None
            self.data_feather_file = data_feather if Path(data_feather).exists() else None
            self.feather_attribute_file = feather_attribute if Path(feather_attribute) else None
        else:
            self.data_pickle_file = None
            self.data_feather_file = None
            self.feather_attribute_file = None

    @property
    def features(self) -> dict[int, OpenMLDataFeature]:
        """Get the features of this dataset."""
        if self._features is None:
            # TODO(eddiebergman): These should return a value so we can set it to be not None
            self._load_features()

        assert self._features is not None
        return self._features

    @property
    def qualities(self) -> dict[str, float] | None:
        """Get the qualities of this dataset."""
        # TODO(eddiebergman): Better docstring, I don't know what qualities means

        # We have to check `_no_qualities_found` as there might not be qualities for a dataset
        if self._qualities is None and (not self._no_qualities_found):
            self._load_qualities()

        return self._qualities

    @property
    def id(self) -> int | None:
        """Get the dataset numeric id."""
        return self.dataset_id

    def _get_repr_body_fields(self) -> Sequence[tuple[str, str | int | None]]:
        """Collect all information to display in the __repr__ body."""
        # Obtain number of features in accordance with lazy loading.
        n_features: int | None = None
        if self._qualities is not None and self._qualities["NumberOfFeatures"] is not None:
            n_features = int(self._qualities["NumberOfFeatures"])
        elif self._features is not None:
            n_features = len(self._features)

        fields: dict[str, int | str | None] = {
            "Name": self.name,
            "Version": self.version,
            "Format": self.format,
            "Licence": self.licence,
            "Download URL": self.url,
            "Data file": str(self.data_file) if self.data_file is not None else None,
            "Pickle file": (
                str(self.data_pickle_file) if self.data_pickle_file is not None else None
            ),
            "# of features": n_features,
        }
        if self.upload_date is not None:
            fields["Upload Date"] = self.upload_date.replace("T", " ")
        if self.dataset_id is not None:
            fields["OpenML URL"] = self.openml_url
        if self._qualities is not None and self._qualities["NumberOfInstances"] is not None:
            fields["# of instances"] = int(self._qualities["NumberOfInstances"])

        # determines the order in which the information will be printed
        order = [
            "Name",
            "Version",
            "Format",
            "Upload Date",
            "Licence",
            "Download URL",
            "OpenML URL",
            "Data File",
            "Pickle File",
            "# of features",
            "# of instances",
        ]
        return [(key, fields[key]) for key in order if key in fields]

    def __eq__(self, other: Any) -> bool:
        if not isinstance(other, OpenMLDataset):
            return False

        server_fields = {
            "dataset_id",
            "version",
            "upload_date",
            "url",
            "_parquet_url",
            "dataset",
            "data_file",
            "format",
            "cache_format",
        }

        cache_fields = {
            "_dataset",
            "data_file",
            "data_pickle_file",
            "data_feather_file",
            "feather_attribute_file",
            "parquet_file",
        }

        # check that common keys and values are identical
        ignore_fields = server_fields | cache_fields
        self_keys = set(self.__dict__.keys()) - ignore_fields
        other_keys = set(other.__dict__.keys()) - ignore_fields
        return self_keys == other_keys and all(
            self.__dict__[key] == other.__dict__[key] for key in self_keys
        )

    def _download_data(self) -> None:
        """Download ARFF data file to standard cache directory. Set `self.data_file`."""
        # import required here to avoid circular import.
        from .functions import _get_dataset_arff, _get_dataset_parquet

        skip_parquet = os.environ.get(OPENML_SKIP_PARQUET_ENV_VAR, "false").casefold() == "true"
        if self._parquet_url is not None and not skip_parquet:
            parquet_file = _get_dataset_parquet(self)
            self.parquet_file = None if parquet_file is None else str(parquet_file)
        if self.parquet_file is None:
            self.data_file = str(_get_dataset_arff(self))

    def _get_arff(self, format: str) -> dict:  # noqa: A002
        """Read ARFF file and return decoded arff.

        Reads the file referenced in self.data_file.

        Parameters
        ----------
        format : str
            Format of the ARFF file.
            Must be one of 'arff' or 'sparse_arff' or a string that will be either of those
            when converted to lower case.



        Returns
        -------
        dict
            Decoded arff.

        """
        # TODO: add a partial read method which only returns the attribute
        # headers of the corresponding .arff file!
        import struct

        filename = self.data_file
        assert filename is not None
        filepath = Path(filename)

        bits = 8 * struct.calcsize("P")

        # Files can be considered too large on a 32-bit system,
        # if it exceeds 120mb (slightly more than covtype dataset size)
        # This number is somewhat arbitrary.
        if bits != 64:
            MB_120 = 120_000_000
            file_size = filepath.stat().st_size
            if file_size > MB_120:
                raise NotImplementedError(
                    f"File {filename} too big for {file_size}-bit system ({bits} bytes).",
                )

        if format.lower() == "arff":
            return_type = arff.DENSE
        elif format.lower() == "sparse_arff":
            return_type = arff.COO
        else:
            raise ValueError(f"Unknown data format {format}")

        def decode_arff(fh: Any) -> dict:
            decoder = arff.ArffDecoder()
            return decoder.decode(fh, encode_nominal=True, return_type=return_type)  # type: ignore

        if filepath.suffix.endswith(".gz"):
            with gzip.open(filename) as zipfile:
                return decode_arff(zipfile)
        else:
            with filepath.open(encoding="utf8") as fh:
                return decode_arff(fh)

    def _parse_data_from_arff(  # noqa: C901, PLR0912, PLR0915
        self,
        arff_file_path: Path,
    ) -> tuple[pd.DataFrame | scipy.sparse.csr_matrix, list[bool], list[str]]:
        """Parse all required data from arff file.

        Parameters
        ----------
        arff_file_path : str
            Path to the file on disk.

        Returns
        -------
        Tuple[Union[pd.DataFrame, scipy.sparse.csr_matrix], List[bool], List[str]]
            DataFrame or csr_matrix: dataset
            List[bool]: List indicating which columns contain categorical variables.
            List[str]: List of column names.
        """
        try:
            data = self._get_arff(self.format)
        except OSError as e:
            logger.critical(
                f"Please check that the data file {arff_file_path} is there and can be read.",
            )
            raise e

        ARFF_DTYPES_TO_PD_DTYPE = {
            "INTEGER": "integer",
            "REAL": "floating",
            "NUMERIC": "floating",
            "STRING": "string",
        }
        attribute_dtype = {}
        attribute_names = []
        categories_names = {}
        categorical = []
        for name, type_ in data["attributes"]:
            # if the feature is nominal and a sparse matrix is
            # requested, the categories need to be numeric
            if isinstance(type_, list) and self.format.lower() == "sparse_arff":
                try:
                    # checks if the strings which should be the class labels
                    # can be encoded into integers
                    pd.factorize(type_)[0]
                except ValueError as e:
                    raise ValueError(
                        "Categorical data needs to be numeric when using sparse ARFF."
                    ) from e

            # string can only be supported with pandas DataFrame
            elif type_ == "STRING" and self.format.lower() == "sparse_arff":
                raise ValueError("Dataset containing strings is not supported with sparse ARFF.")

            # infer the dtype from the ARFF header
            if isinstance(type_, list):
                categorical.append(True)
                categories_names[name] = type_
                if len(type_) == 2:
                    type_norm = [cat.lower().capitalize() for cat in type_]
                    if {"True", "False"} == set(type_norm):
                        categories_names[name] = [cat == "True" for cat in type_norm]
                        attribute_dtype[name] = "boolean"
                    else:
                        attribute_dtype[name] = "categorical"
                else:
                    attribute_dtype[name] = "categorical"
            else:
                categorical.append(False)
                attribute_dtype[name] = ARFF_DTYPES_TO_PD_DTYPE[type_]
            attribute_names.append(name)

        if self.format.lower() == "sparse_arff":
            X = data["data"]
            X_shape = (max(X[1]) + 1, max(X[2]) + 1)
            X = scipy.sparse.coo_matrix((X[0], (X[1], X[2])), shape=X_shape, dtype=np.float32)
            X = X.tocsr()
        elif self.format.lower() == "arff":
            X = pd.DataFrame(data["data"], columns=attribute_names)

            col = []
            for column_name in X.columns:
                if attribute_dtype[column_name] in ("categorical", "boolean"):
                    categories = self._unpack_categories(
                        X[column_name],  # type: ignore
                        categories_names[column_name],
                    )
                    col.append(categories)
                elif attribute_dtype[column_name] in ("floating", "integer"):
                    X_col = X[column_name]
                    if X_col.min() >= 0 and X_col.max() <= 255:
                        try:
                            X_col_uint = X_col.astype("uint8")
                            if (X_col == X_col_uint).all():
                                col.append(X_col_uint)
                                continue
                        except ValueError:
                            pass
                    col.append(X[column_name])
                else:
                    col.append(X[column_name])
            X = pd.concat(col, axis=1)
        else:
            raise ValueError(f"Dataset format '{self.format}' is not a valid format.")

        return X, categorical, attribute_names  # type: ignore

    def _compressed_cache_file_paths(self, data_file: Path) -> tuple[Path, Path, Path]:
        data_pickle_file = data_file.with_suffix(".pkl.py3")
        data_feather_file = data_file.with_suffix(".feather")
        feather_attribute_file = data_file.with_suffix(".feather.attributes.pkl.py3")
        return data_pickle_file, data_feather_file, feather_attribute_file

    def _cache_compressed_file_from_file(
        self,
        data_file: Path,
    ) -> tuple[pd.DataFrame | scipy.sparse.csr_matrix, list[bool], list[str]]:
        """Store data from the local file in compressed format.

        If a local parquet file is present it will be used instead of the arff file.
        Sets cache_format to 'pickle' if data is sparse.
        """
        (
            data_pickle_file,
            data_feather_file,
            feather_attribute_file,
        ) = self._compressed_cache_file_paths(data_file)

        attribute_names, categorical, data = self._parse_data_from_file(data_file)

        # Feather format does not work for sparse datasets, so we use pickle for sparse datasets
        if scipy.sparse.issparse(data):
            self.cache_format = "pickle"

        logger.info(f"{self.cache_format} write {self.name}")
        if self.cache_format == "feather":
            assert isinstance(data, pd.DataFrame)

            data.to_feather(data_feather_file)
            with open(feather_attribute_file, "wb") as fh:  # noqa: PTH123
                pickle.dump((categorical, attribute_names), fh, pickle.HIGHEST_PROTOCOL)
            self.data_feather_file = data_feather_file
            self.feather_attribute_file = feather_attribute_file

        else:
            with open(data_pickle_file, "wb") as fh:  # noqa: PTH123
                pickle.dump((data, categorical, attribute_names), fh, pickle.HIGHEST_PROTOCOL)
            self.data_pickle_file = data_pickle_file

        data_file = data_pickle_file if self.cache_format == "pickle" else data_feather_file
        logger.debug(f"Saved dataset {int(self.dataset_id or -1)}: {self.name} to file {data_file}")

        return data, categorical, attribute_names

    def _parse_data_from_file(
        self,
        data_file: Path,
    ) -> tuple[list[str], list[bool], pd.DataFrame | scipy.sparse.csr_matrix]:
        if data_file.suffix == ".arff":
            data, categorical, attribute_names = self._parse_data_from_arff(data_file)
        elif data_file.suffix == ".pq":
            attribute_names, categorical, data = self._parse_data_from_pq(data_file)
        else:
            raise ValueError(f"Unknown file type for file '{data_file}'.")

        return attribute_names, categorical, data

    def _parse_data_from_pq(self, data_file: Path) -> tuple[list[str], list[bool], pd.DataFrame]:
        try:
            data = pd.read_parquet(data_file)
        except Exception as e:
            raise Exception(f"File: {data_file}") from e
        categorical = [data[c].dtype.name == "category" for c in data.columns]
        attribute_names = list(data.columns)
        return attribute_names, categorical, data

    def _load_data(self) -> tuple[pd.DataFrame, list[bool], list[str]]:  # noqa: PLR0912, C901, PLR0915
        """Load data from compressed format or arff. Download data if not present on disk."""
        need_to_create_pickle = self.cache_format == "pickle" and self.data_pickle_file is None
        need_to_create_feather = self.cache_format == "feather" and self.data_feather_file is None

        if need_to_create_pickle or need_to_create_feather:
            if self.data_file is None:
                self._download_data()

            file_to_load = self.data_file if self.parquet_file is None else self.parquet_file
            assert file_to_load is not None
            data, cats, attrs = self._cache_compressed_file_from_file(Path(file_to_load))
            return _ensure_dataframe(data, attrs), cats, attrs

        # helper variable to help identify where errors occur
        fpath = self.data_feather_file if self.cache_format == "feather" else self.data_pickle_file
        logger.info(f"{self.cache_format} load data {self.name}")
        try:
            if self.cache_format == "feather":
                assert self.data_feather_file is not None
                assert self.feather_attribute_file is not None

                data = pd.read_feather(self.data_feather_file)
                fpath = self.feather_attribute_file
                with self.feather_attribute_file.open("rb") as fh:
                    categorical, attribute_names = pickle.load(fh)  # noqa: S301
            else:
                assert self.data_pickle_file is not None
                with self.data_pickle_file.open("rb") as fh:
                    data, categorical, attribute_names = pickle.load(fh)  # noqa: S301

        except FileNotFoundError as e:
            raise ValueError(
                f"Cannot find file for dataset {self.name} at location '{fpath}'."
            ) from e
        except (EOFError, ModuleNotFoundError, ValueError, AttributeError) as e:
            error_message = getattr(e, "message", e.args[0])
            hint = ""

            if isinstance(e, EOFError):
                readable_error = "Detected a corrupt cache file"
            elif isinstance(e, (ModuleNotFoundError, AttributeError)):
                readable_error = "Detected likely dependency issues"
                hint = (
                    "This can happen if the cache was constructed with a different pandas version "
                    "than the one that is used to load the data. See also "
                )
                if isinstance(e, ModuleNotFoundError):
                    hint += "https://github.com/openml/openml-python/issues/918. "
                elif isinstance(e, AttributeError):
                    hint += "https://github.com/openml/openml-python/pull/1121. "

            elif isinstance(e, ValueError) and "unsupported pickle protocol" in e.args[0]:
                readable_error = "Encountered unsupported pickle protocol"
            else:
                raise e

            logger.warning(
                f"{readable_error} when loading dataset {self.id} from '{fpath}'. "
                f"{hint}"
                f"Error message was: {error_message}. "
                "We will continue loading data from the arff-file, "
                "but this will be much slower for big datasets. "
                "Please manually delete the cache file if you want OpenML-Python "
                "to attempt to reconstruct it.",
            )
            file_to_load = self.data_file if self.parquet_file is None else self.parquet_file
            assert file_to_load is not None
            attr, cat, df = self._parse_data_from_file(Path(file_to_load))
            return _ensure_dataframe(df), cat, attr

        data_up_to_date = isinstance(data, pd.DataFrame) or scipy.sparse.issparse(data)
        if self.cache_format == "pickle" and not data_up_to_date:
            logger.info("Updating outdated pickle file.")
            file_to_load = self.data_file if self.parquet_file is None else self.parquet_file
            assert file_to_load is not None

            data, cats, attrs = self._cache_compressed_file_from_file(Path(file_to_load))

        return _ensure_dataframe(data, attribute_names), categorical, attribute_names

    @staticmethod
    def _unpack_categories(series: pd.Series, categories: list) -> pd.Series:
        # nan-likes can not be explicitly specified as a category
        def valid_category(cat: Any) -> bool:
            return isinstance(cat, str) or (cat is not None and not np.isnan(cat))

        filtered_categories = [c for c in categories if valid_category(c)]
        col = []
        for x in series:
            try:
                col.append(categories[int(x)])
            except (TypeError, ValueError):
                col.append(np.nan)

        # We require two lines to create a series of categories as detailed here:
        # https://pandas.pydata.org/pandas-docs/version/0.24/user_guide/categorical.html#series-creation
        raw_cat = pd.Categorical(col, ordered=True, categories=filtered_categories)
        return pd.Series(raw_cat, index=series.index, name=series.name)

    def get_data(  # noqa: C901
        self,
        target: list[str] | str | None = None,
        include_row_id: bool = False,  # noqa: FBT001, FBT002
        include_ignore_attribute: bool = False,  # noqa: FBT001, FBT002
    ) -> tuple[pd.DataFrame, pd.Series | None, list[bool], list[str]]:
        """Returns dataset content as dataframes.

        Parameters
        ----------
        target : string, List[str] or None (default=None)
            Name of target column to separate from the data.
            Splitting multiple columns is currently not supported.
        include_row_id : boolean (default=False)
            Whether to include row ids in the returned dataset.
        include_ignore_attribute : boolean (default=False)
            Whether to include columns that are marked as "ignore"
            on the server in the dataset.


        Returns
        -------
        X : dataframe, shape (n_samples, n_columns)
            Dataset, may have sparse dtypes in the columns if required.
        y : pd.Series, shape (n_samples, ) or None
            Target column
        categorical_indicator : list[bool]
            Mask that indicate categorical features.
        attribute_names : list[str]
            List of attribute names.
        """
        data, categorical_mask, attribute_names = self._load_data()

        to_exclude = []
        if not include_row_id and self.row_id_attribute is not None:
            if isinstance(self.row_id_attribute, str):
                to_exclude.append(self.row_id_attribute)
            elif isinstance(self.row_id_attribute, Iterable):
                to_exclude.extend(self.row_id_attribute)

        if not include_ignore_attribute and self.ignore_attribute is not None:
            if isinstance(self.ignore_attribute, str):
                to_exclude.append(self.ignore_attribute)
            elif isinstance(self.ignore_attribute, Iterable):
                to_exclude.extend(self.ignore_attribute)

        if len(to_exclude) > 0:
            logger.info(f"Going to remove the following attributes: {to_exclude}")
            keep = np.array([column not in to_exclude for column in attribute_names])
            data = data.drop(columns=to_exclude)
            categorical_mask = [cat for cat, k in zip(categorical_mask, keep) if k]
            attribute_names = [att for att, k in zip(attribute_names, keep) if k]

        if target is None:
            return data, None, categorical_mask, attribute_names

        if isinstance(target, str):
            target_names = target.split(",") if "," in target else [target]
        else:
            target_names = target

        # All the assumptions below for the target are dependant on the number of targets being 1
        n_targets = len(target_names)
        if n_targets > 1:
            raise NotImplementedError(f"Number of targets {n_targets} not implemented.")

        target_name = target_names[0]
        x = data.drop(columns=[target_name])
        y = data[target_name].squeeze()

        # Finally, remove the target from the list of attributes and categorical mask
        target_index = attribute_names.index(target_name)
        categorical_mask.pop(target_index)
        attribute_names.remove(target_name)

        assert isinstance(y, pd.Series)
        return x, y, categorical_mask, attribute_names

    def _load_features(self) -> None:
        """Load the features metadata from the server and store it in the dataset object."""
        # Delayed Import to avoid circular imports or having to import all of dataset.functions to
        # import OpenMLDataset.
        from openml.datasets.functions import _get_dataset_features_file

        if self.dataset_id is None:
            raise ValueError(
                "No dataset id specified. Please set the dataset id. Otherwise we cannot load "
                "metadata.",
            )

        features_file = _get_dataset_features_file(None, self.dataset_id)
        self._features = _read_features(features_file)

    def _load_qualities(self) -> None:
        """Load qualities information from the server and store it in the dataset object."""
        # same reason as above for _load_features
        from openml.datasets.functions import _get_dataset_qualities_file

        if self.dataset_id is None:
            raise ValueError(
                "No dataset id specified. Please set the dataset id. Otherwise we cannot load "
                "metadata.",
            )

        qualities_file = _get_dataset_qualities_file(None, self.dataset_id)

        if qualities_file is None:
            self._no_qualities_found = True
        else:
            self._qualities = _read_qualities(qualities_file)

    def retrieve_class_labels(self, target_name: str = "class") -> None | list[str]:
        """Reads the datasets arff to determine the class-labels.

        If the task has no class labels (for example a regression problem)
        it returns None. Necessary because the data returned by get_data
        only contains the indices of the classes, while OpenML needs the real
        classname when uploading the results of a run.

        Parameters
        ----------
        target_name : str
            Name of the target attribute

        Returns
        -------
        list
        """
        for feature in self.features.values():
            if feature.name == target_name:
                if feature.data_type == "nominal":
                    return feature.nominal_values

                if feature.data_type == "string":
                    # Rel.: #1311
                    # The target is invalid for a classification task if the feature type is string
                    # and not nominal. For such miss-configured tasks, we silently fix it here as
                    # we can safely interpreter string as nominal.
                    df, *_ = self.get_data()
                    return list(df[feature.name].unique())

        return None

    def get_features_by_type(  # noqa: C901
        self,
        data_type: str,
        exclude: list[str] | None = None,
        exclude_ignore_attribute: bool = True,  # noqa: FBT002, FBT001
        exclude_row_id_attribute: bool = True,  # noqa: FBT002, FBT001
    ) -> list[int]:
        """
        Return indices of features of a given type, e.g. all nominal features.
        Optional parameters to exclude various features by index or ontology.

        Parameters
        ----------
        data_type : str
            The data type to return (e.g., nominal, numeric, date, string)
        exclude : list(int)
            List of columns to exclude from the return value
        exclude_ignore_attribute : bool
            Whether to exclude the defined ignore attributes (and adapt the
            return values as if these indices are not present)
        exclude_row_id_attribute : bool
            Whether to exclude the defined row id attributes (and adapt the
            return values as if these indices are not present)

        Returns
        -------
        result : list
            a list of indices that have the specified data type
        """
        if data_type not in OpenMLDataFeature.LEGAL_DATA_TYPES:
            raise TypeError("Illegal feature type requested")
        if self.ignore_attribute is not None and not isinstance(self.ignore_attribute, list):
            raise TypeError("ignore_attribute should be a list")
        if self.row_id_attribute is not None and not isinstance(self.row_id_attribute, str):
            raise TypeError("row id attribute should be a str")
        if exclude is not None and not isinstance(exclude, list):
            raise TypeError("Exclude should be a list")
            # assert all(isinstance(elem, str) for elem in exclude),
            #            "Exclude should be a list of strings"
        to_exclude = []
        if exclude is not None:
            to_exclude.extend(exclude)
        if exclude_ignore_attribute and self.ignore_attribute is not None:
            to_exclude.extend(self.ignore_attribute)
        if exclude_row_id_attribute and self.row_id_attribute is not None:
            to_exclude.append(self.row_id_attribute)

        result = []
        offset = 0
        # this function assumes that everything in to_exclude will
        # be 'excluded' from the dataset (hence the offset)
        for idx in self.features:
            name = self.features[idx].name
            if name in to_exclude:
                offset += 1
            elif self.features[idx].data_type == data_type:
                result.append(idx - offset)
        return result

    def _get_file_elements(self) -> dict:
        """Adds the 'dataset' to file elements."""
        file_elements: dict = {}
        path = None if self.data_file is None else Path(self.data_file).absolute()

        if self._dataset is not None:
            file_elements["dataset"] = self._dataset
        elif path is not None and path.exists():
            with path.open("rb") as fp:
                file_elements["dataset"] = fp.read()

            try:
                dataset_utf8 = str(file_elements["dataset"], encoding="utf8")
                arff.ArffDecoder().decode(dataset_utf8, encode_nominal=True)
            except arff.ArffException as e:
                raise ValueError("The file you have provided is not a valid arff file.") from e

        elif self.url is None:
            raise ValueError("No valid url/path to the data file was given.")
        return file_elements

    def _parse_publish_response(self, xml_response: dict) -> None:
        """Parse the id from the xml_response and assign it to self."""
        self.dataset_id = int(xml_response["oml:upload_data_set"]["oml:id"])

    def _to_dict(self) -> dict[str, dict]:
        """Creates a dictionary representation of self."""
        props = [
            "id",
            "name",
            "version",
            "description",
            "format",
            "creator",
            "contributor",
            "collection_date",
            "upload_date",
            "language",
            "licence",
            "url",
            "default_target_attribute",
            "row_id_attribute",
            "ignore_attribute",
            "version_label",
            "citation",
            "tag",
            "visibility",
            "original_data_url",
            "paper_url",
            "update_comment",
            "md5_checksum",
        ]

        prop_values = {}
        for prop in props:
            content = getattr(self, prop, None)
            if content is not None:
                prop_values["oml:" + prop] = content

        return {
            "oml:data_set_description": {
                "@xmlns:oml": "http://openml.org/openml",
                **prop_values,
            }
        }

`features: dict[int, OpenMLDataFeature]` `property` ¶

Get the features of this dataset.

`id: int | None` `property` ¶

Get the dataset numeric id.

`qualities: dict[str, float] | None` `property` ¶

Get the qualities of this dataset.

`get_data(target=None, include_row_id=False, include_ignore_attribute=False)` ¶

Returns dataset content as dataframes.

Parameters:

Name	Type	Description	Default
`target`	`(string, List[str] or None(default=None))`	Name of target column to separate from the data. Splitting multiple columns is currently not supported.	`None`
`include_row_id`	`boolean(default=False)`	Whether to include row ids in the returned dataset.	`False`
`include_ignore_attribute`	`boolean(default=False)`	Whether to include columns that are marked as "ignore" on the server in the dataset.	`False`

Returns:

Name	Type	Description
`X`	`(dataframe, shape(n_samples, n_columns))`	Dataset, may have sparse dtypes in the columns if required.
`y`	`(Series, shape(n_samples) or None)`	Target column
`categorical_indicator`	`list[bool]`	Mask that indicate categorical features.
`attribute_names`	`list[str]`	List of attribute names.

Source code in openml/datasets/dataset.py

def get_data(  # noqa: C901
    self,
    target: list[str] | str | None = None,
    include_row_id: bool = False,  # noqa: FBT001, FBT002
    include_ignore_attribute: bool = False,  # noqa: FBT001, FBT002
) -> tuple[pd.DataFrame, pd.Series | None, list[bool], list[str]]:
    """Returns dataset content as dataframes.

    Parameters
    ----------
    target : string, List[str] or None (default=None)
        Name of target column to separate from the data.
        Splitting multiple columns is currently not supported.
    include_row_id : boolean (default=False)
        Whether to include row ids in the returned dataset.
    include_ignore_attribute : boolean (default=False)
        Whether to include columns that are marked as "ignore"
        on the server in the dataset.


    Returns
    -------
    X : dataframe, shape (n_samples, n_columns)
        Dataset, may have sparse dtypes in the columns if required.
    y : pd.Series, shape (n_samples, ) or None
        Target column
    categorical_indicator : list[bool]
        Mask that indicate categorical features.
    attribute_names : list[str]
        List of attribute names.
    """
    data, categorical_mask, attribute_names = self._load_data()

    to_exclude = []
    if not include_row_id and self.row_id_attribute is not None:
        if isinstance(self.row_id_attribute, str):
            to_exclude.append(self.row_id_attribute)
        elif isinstance(self.row_id_attribute, Iterable):
            to_exclude.extend(self.row_id_attribute)

    if not include_ignore_attribute and self.ignore_attribute is not None:
        if isinstance(self.ignore_attribute, str):
            to_exclude.append(self.ignore_attribute)
        elif isinstance(self.ignore_attribute, Iterable):
            to_exclude.extend(self.ignore_attribute)

    if len(to_exclude) > 0:
        logger.info(f"Going to remove the following attributes: {to_exclude}")
        keep = np.array([column not in to_exclude for column in attribute_names])
        data = data.drop(columns=to_exclude)
        categorical_mask = [cat for cat, k in zip(categorical_mask, keep) if k]
        attribute_names = [att for att, k in zip(attribute_names, keep) if k]

    if target is None:
        return data, None, categorical_mask, attribute_names

    if isinstance(target, str):
        target_names = target.split(",") if "," in target else [target]
    else:
        target_names = target

    # All the assumptions below for the target are dependant on the number of targets being 1
    n_targets = len(target_names)
    if n_targets > 1:
        raise NotImplementedError(f"Number of targets {n_targets} not implemented.")

    target_name = target_names[0]
    x = data.drop(columns=[target_name])
    y = data[target_name].squeeze()

    # Finally, remove the target from the list of attributes and categorical mask
    target_index = attribute_names.index(target_name)
    categorical_mask.pop(target_index)
    attribute_names.remove(target_name)

    assert isinstance(y, pd.Series)
    return x, y, categorical_mask, attribute_names

`get_features_by_type(data_type, exclude=None, exclude_ignore_attribute=True, exclude_row_id_attribute=True)` ¶

Return indices of features of a given type, e.g. all nominal features. Optional parameters to exclude various features by index or ontology.

Parameters:

Name	Type	Description	Default
`data_type`	`str`	The data type to return (e.g., nominal, numeric, date, string)	required
`exclude`	`list(int)`	List of columns to exclude from the return value	`None`
`exclude_ignore_attribute`	`bool`	Whether to exclude the defined ignore attributes (and adapt the return values as if these indices are not present)	`True`
`exclude_row_id_attribute`	`bool`	Whether to exclude the defined row id attributes (and adapt the return values as if these indices are not present)	`True`

Returns:

Name	Type	Description
`result`	`list`	a list of indices that have the specified data type

Source code in openml/datasets/dataset.py

def get_features_by_type(  # noqa: C901
    self,
    data_type: str,
    exclude: list[str] | None = None,
    exclude_ignore_attribute: bool = True,  # noqa: FBT002, FBT001
    exclude_row_id_attribute: bool = True,  # noqa: FBT002, FBT001
) -> list[int]:
    """
    Return indices of features of a given type, e.g. all nominal features.
    Optional parameters to exclude various features by index or ontology.

    Parameters
    ----------
    data_type : str
        The data type to return (e.g., nominal, numeric, date, string)
    exclude : list(int)
        List of columns to exclude from the return value
    exclude_ignore_attribute : bool
        Whether to exclude the defined ignore attributes (and adapt the
        return values as if these indices are not present)
    exclude_row_id_attribute : bool
        Whether to exclude the defined row id attributes (and adapt the
        return values as if these indices are not present)

    Returns
    -------
    result : list
        a list of indices that have the specified data type
    """
    if data_type not in OpenMLDataFeature.LEGAL_DATA_TYPES:
        raise TypeError("Illegal feature type requested")
    if self.ignore_attribute is not None and not isinstance(self.ignore_attribute, list):
        raise TypeError("ignore_attribute should be a list")
    if self.row_id_attribute is not None and not isinstance(self.row_id_attribute, str):
        raise TypeError("row id attribute should be a str")
    if exclude is not None and not isinstance(exclude, list):
        raise TypeError("Exclude should be a list")
        # assert all(isinstance(elem, str) for elem in exclude),
        #            "Exclude should be a list of strings"
    to_exclude = []
    if exclude is not None:
        to_exclude.extend(exclude)
    if exclude_ignore_attribute and self.ignore_attribute is not None:
        to_exclude.extend(self.ignore_attribute)
    if exclude_row_id_attribute and self.row_id_attribute is not None:
        to_exclude.append(self.row_id_attribute)

    result = []
    offset = 0
    # this function assumes that everything in to_exclude will
    # be 'excluded' from the dataset (hence the offset)
    for idx in self.features:
        name = self.features[idx].name
        if name in to_exclude:
            offset += 1
        elif self.features[idx].data_type == data_type:
            result.append(idx - offset)
    return result

`retrieve_class_labels(target_name='class')` ¶

Reads the datasets arff to determine the class-labels.

If the task has no class labels (for example a regression problem) it returns None. Necessary because the data returned by get_data only contains the indices of the classes, while OpenML needs the real classname when uploading the results of a run.

Parameters:

Name	Type	Description	Default
`target_name`	`str`	Name of the target attribute	`'class'`

Returns:

Type	Description
`list`

Source code in openml/datasets/dataset.py

def retrieve_class_labels(self, target_name: str = "class") -> None | list[str]:
    """Reads the datasets arff to determine the class-labels.

    If the task has no class labels (for example a regression problem)
    it returns None. Necessary because the data returned by get_data
    only contains the indices of the classes, while OpenML needs the real
    classname when uploading the results of a run.

    Parameters
    ----------
    target_name : str
        Name of the target attribute

    Returns
    -------
    list
    """
    for feature in self.features.values():
        if feature.name == target_name:
            if feature.data_type == "nominal":
                return feature.nominal_values

            if feature.data_type == "string":
                # Rel.: #1311
                # The target is invalid for a classification task if the feature type is string
                # and not nominal. For such miss-configured tasks, we silently fix it here as
                # we can safely interpreter string as nominal.
                df, *_ = self.get_data()
                return list(df[feature.name].unique())

    return None

dataset

OpenMLDataset ¶

features: dict[int, OpenMLDataFeature] property ¶

id: int | None property ¶

qualities: dict[str, float] | None property ¶

get_data(target=None, include_row_id=False, include_ignore_attribute=False) ¶

get_features_by_type(data_type, exclude=None, exclude_ignore_attribute=True, exclude_row_id_attribute=True) ¶

retrieve_class_labels(target_name='class') ¶

`OpenMLDataset` ¶

`features: dict[int, OpenMLDataFeature]` `property` ¶

`id: int | None` `property` ¶

`qualities: dict[str, float] | None` `property` ¶

`get_data(target=None, include_row_id=False, include_ignore_attribute=False)` ¶

`get_features_by_type(data_type, exclude=None, exclude_ignore_attribute=True, exclude_row_id_attribute=True)` ¶

`retrieve_class_labels(target_name='class')` ¶