openml

openml #

The OpenML module implements a python interface to OpenML <https://www.openml.org>_, a collaborative platform for machine learning. OpenML can be used to

store, download and analyze datasets
make experiments and their results (e.g. models, predictions) accesible and reproducible for everybody
analyze experiments (uploaded by you and other collaborators) and conduct meta studies

In particular, this module implements a python interface for the OpenML REST API <https://www.openml.org/guide#!rest_services> (REST on wikipedia <https://en.wikipedia.org/wiki/Representational_state_transfer>).

OpenMLBenchmarkSuite #

OpenMLBenchmarkSuite(suite_id: int | None, alias: str | None, name: str, description: str, status: str | None, creation_date: str | None, creator: int | None, tags: list[dict] | None, data: list[int] | None, tasks: list[int] | None)

Bases: BaseStudy

An OpenMLBenchmarkSuite represents the OpenML concept of a suite (a collection of tasks).

It contains the following information: name, id, description, creation date, creator id and the task ids.

According to this list of task ids, the suite object receives a list of OpenML object ids (datasets).

Parameters#

suite_id : int the study id alias : str (optional) a string ID, unique on server (url-friendly) main_entity_type : str the entity type (e.g., task, run) that is core in this study. only entities of this type can be added explicitly name : str the name of the study (meta-info) description : str brief description (meta-info) status : str Whether the study is in preparation, active or deactivated creation_date : str date of creation (meta-info) creator : int openml user id of the owner / creator tags : list(dict) The list of tags shows which tags are associated with the study. Each tag is a dict of (tag) name, window_start and write_access. data : list a list of data ids associated with this study tasks : list a list of task ids associated with this study

Source code in openml/study/study.py

def __init__(  # noqa: PLR0913
    self,
    suite_id: int | None,
    alias: str | None,
    name: str,
    description: str,
    status: str | None,
    creation_date: str | None,
    creator: int | None,
    tags: list[dict] | None,
    data: list[int] | None,
    tasks: list[int] | None,
):
    super().__init__(
        study_id=suite_id,
        alias=alias,
        main_entity_type="task",
        benchmark_suite=None,
        name=name,
        description=description,
        status=status,
        creation_date=creation_date,
        creator=creator,
        tags=tags,
        data=data,
        tasks=tasks,
        flows=None,
        runs=None,
        setups=None,
    )

id `property` #

id: int | None

Return the id of the study.

openml_url `property` #

openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

open_in_browser #

open_in_browser() -> None

Opens the OpenML web page corresponding to this object in your default browser.

Source code in openml/base.py

def open_in_browser(self) -> None:
    """Opens the OpenML web page corresponding to this object in your default browser."""
    if self.openml_url is None:
        raise ValueError(
            "Cannot open element on OpenML.org when attribute `openml_url` is `None`",
        )

    webbrowser.open(self.openml_url)

publish #

publish() -> OpenMLBase

Publish the object on the OpenML server.

Source code in openml/base.py

def publish(self) -> OpenMLBase:
    """Publish the object on the OpenML server."""
    file_elements = self._get_file_elements()

    if "description" not in file_elements:
        file_elements["description"] = self._to_xml()

    call = f"{_get_rest_api_type_alias(self)}/"
    response_text = openml._api_calls._perform_api_call(
        call,
        "post",
        file_elements=file_elements,
    )
    xml_response = xmltodict.parse(response_text)

    self._parse_publish_response(xml_response)
    return self

push_tag #

push_tag(tag: str) -> None

Add a tag to the study.

Source code in openml/study/study.py

def push_tag(self, tag: str) -> None:
    """Add a tag to the study."""
    raise NotImplementedError("Tags for studies is not (yet) supported.")

remove_tag #

remove_tag(tag: str) -> None

Remove a tag from the study.

Source code in openml/study/study.py

def remove_tag(self, tag: str) -> None:
    """Remove a tag from the study."""
    raise NotImplementedError("Tags for studies is not (yet) supported.")

url_for_id `classmethod` #

url_for_id(id_: int) -> str

Return the OpenML URL for the object of the class entity with the given id.

Source code in openml/base.py

@classmethod
def url_for_id(cls, id_: int) -> str:
    """Return the OpenML URL for the object of the class entity with the given id."""
    # Sample url for a flow: openml.org/f/123
    return f"{openml.config.get_server_base_url()}/{cls._entity_letter()}/{id_}"

OpenMLClassificationTask #

OpenMLClassificationTask(task_type_id: TaskType, task_type: str, data_set_id: int, target_name: str, estimation_procedure_id: int = 1, estimation_procedure_type: str | None = None, estimation_parameters: dict[str, str] | None = None, evaluation_measure: str | None = None, data_splits_url: str | None = None, task_id: int | None = None, class_labels: list[str] | None = None, cost_matrix: ndarray | None = None)

Bases: OpenMLSupervisedTask

OpenML Classification object.

Parameters#

task_type_id : TaskType ID of the Classification task type. task_type : str Name of the Classification task type. data_set_id : int ID of the OpenML dataset associated with the Classification task. target_name : str Name of the target variable. estimation_procedure_id : int, default=None ID of the estimation procedure for the Classification task. estimation_procedure_type : str, default=None Type of the estimation procedure. estimation_parameters : dict, default=None Estimation parameters for the Classification task. evaluation_measure : str, default=None Name of the evaluation measure. data_splits_url : str, default=None URL of the data splits for the Classification task. task_id : Union[int, None] ID of the Classification task (if it already exists on OpenML). class_labels : List of str, default=None A list of class labels (for classification tasks). cost_matrix : array, default=None A cost matrix (for classification tasks).

Source code in openml/tasks/task.py

def __init__(  # noqa: PLR0913
    self,
    task_type_id: TaskType,
    task_type: str,
    data_set_id: int,
    target_name: str,
    estimation_procedure_id: int = 1,
    estimation_procedure_type: str | None = None,
    estimation_parameters: dict[str, str] | None = None,
    evaluation_measure: str | None = None,
    data_splits_url: str | None = None,
    task_id: int | None = None,
    class_labels: list[str] | None = None,
    cost_matrix: np.ndarray | None = None,
):
    super().__init__(
        task_id=task_id,
        task_type_id=task_type_id,
        task_type=task_type,
        data_set_id=data_set_id,
        estimation_procedure_id=estimation_procedure_id,
        estimation_procedure_type=estimation_procedure_type,
        estimation_parameters=estimation_parameters,
        evaluation_measure=evaluation_measure,
        target_name=target_name,
        data_splits_url=data_splits_url,
    )
    self.class_labels = class_labels
    self.cost_matrix = cost_matrix

    if cost_matrix is not None:
        raise NotImplementedError("Costmatrix")

estimation_parameters `property` `writable` #

estimation_parameters: dict[str, str] | None

Return the estimation parameters for the task.

id `property` #

id: int | None

Return the OpenML ID of this task.

openml_url `property` #

openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

download_split #

download_split() -> OpenMLSplit

Download the OpenML split for a given task.

Source code in openml/tasks/task.py

def download_split(self) -> OpenMLSplit:
    """Download the OpenML split for a given task."""
    # TODO(eddiebergman): Can this every be `None`?
    assert self.task_id is not None
    cache_dir = _create_cache_directory_for_id("tasks", self.task_id)
    cached_split_file = cache_dir / "datasplits.arff"

    try:
        split = OpenMLSplit._from_arff_file(cached_split_file)
    except OSError:
        # Next, download and cache the associated split file
        self._download_split(cached_split_file)
        split = OpenMLSplit._from_arff_file(cached_split_file)

    return split

get_X_and_y #

get_X_and_y() -> tuple[DataFrame, Series | DataFrame | None]

Get data associated with the current task.

Returns#

tuple - X and y

Source code in openml/tasks/task.py

def get_X_and_y(self) -> tuple[pd.DataFrame, pd.Series | pd.DataFrame | None]:
    """Get data associated with the current task.

    Returns
    -------
    tuple - X and y

    """
    dataset = self.get_dataset()
    if self.task_type_id not in (
        TaskType.SUPERVISED_CLASSIFICATION,
        TaskType.SUPERVISED_REGRESSION,
        TaskType.LEARNING_CURVE,
    ):
        raise NotImplementedError(self.task_type)

    X, y, _, _ = dataset.get_data(target=self.target_name)
    return X, y

get_dataset #

get_dataset(**kwargs: Any) -> OpenMLDataset

Download dataset associated with task.

Accepts the same keyword arguments as the openml.datasets.get_dataset.

Source code in openml/tasks/task.py

def get_dataset(self, **kwargs: Any) -> datasets.OpenMLDataset:
    """Download dataset associated with task.

    Accepts the same keyword arguments as the `openml.datasets.get_dataset`.
    """
    return datasets.get_dataset(self.dataset_id, **kwargs)

get_split_dimensions #

get_split_dimensions() -> tuple[int, int, int]

Get the (repeats, folds, samples) of the split for a given task.

Source code in openml/tasks/task.py

def get_split_dimensions(self) -> tuple[int, int, int]:
    """Get the (repeats, folds, samples) of the split for a given task."""
    if self.split is None:
        self.split = self.download_split()

    return self.split.repeats, self.split.folds, self.split.samples

get_train_test_split_indices #

get_train_test_split_indices(fold: int = 0, repeat: int = 0, sample: int = 0) -> tuple[ndarray, ndarray]

Get the indices of the train and test splits for a given task.

Source code in openml/tasks/task.py

def get_train_test_split_indices(
    self,
    fold: int = 0,
    repeat: int = 0,
    sample: int = 0,
) -> tuple[np.ndarray, np.ndarray]:
    """Get the indices of the train and test splits for a given task."""
    # Replace with retrieve from cache
    if self.split is None:
        self.split = self.download_split()

    return self.split.get(repeat=repeat, fold=fold, sample=sample)

open_in_browser #

open_in_browser() -> None

Opens the OpenML web page corresponding to this object in your default browser.

Source code in openml/base.py

def open_in_browser(self) -> None:
    """Opens the OpenML web page corresponding to this object in your default browser."""
    if self.openml_url is None:
        raise ValueError(
            "Cannot open element on OpenML.org when attribute `openml_url` is `None`",
        )

    webbrowser.open(self.openml_url)

publish #

publish() -> OpenMLBase

Publish the object on the OpenML server.

Source code in openml/base.py

def publish(self) -> OpenMLBase:
    """Publish the object on the OpenML server."""
    file_elements = self._get_file_elements()

    if "description" not in file_elements:
        file_elements["description"] = self._to_xml()

    call = f"{_get_rest_api_type_alias(self)}/"
    response_text = openml._api_calls._perform_api_call(
        call,
        "post",
        file_elements=file_elements,
    )
    xml_response = xmltodict.parse(response_text)

    self._parse_publish_response(xml_response)
    return self

push_tag #

push_tag(tag: str) -> None

Annotates this entity with a tag on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def push_tag(self, tag: str) -> None:
    """Annotates this entity with a tag on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag)

remove_tag #

remove_tag(tag: str) -> None

Removes a tag from this entity on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def remove_tag(self, tag: str) -> None:
    """Removes a tag from this entity on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag, untag=True)

url_for_id `classmethod` #

url_for_id(id_: int) -> str

Return the OpenML URL for the object of the class entity with the given id.

Source code in openml/base.py

@classmethod
def url_for_id(cls, id_: int) -> str:
    """Return the OpenML URL for the object of the class entity with the given id."""
    # Sample url for a flow: openml.org/f/123
    return f"{openml.config.get_server_base_url()}/{cls._entity_letter()}/{id_}"

OpenMLClusteringTask #

OpenMLClusteringTask(task_type_id: TaskType, task_type: str, data_set_id: int, estimation_procedure_id: int = 17, task_id: int | None = None, estimation_procedure_type: str | None = None, estimation_parameters: dict[str, str] | None = None, data_splits_url: str | None = None, evaluation_measure: str | None = None, target_name: str | None = None)

Bases: OpenMLTask

OpenML Clustering object.

Parameters#

task_type_id : TaskType Task type ID of the OpenML clustering task. task_type : str Task type of the OpenML clustering task. data_set_id : int ID of the OpenML dataset used in clustering the task. estimation_procedure_id : int, default=None ID of the OpenML estimation procedure. task_id : Union[int, None] ID of the OpenML clustering task. estimation_procedure_type : str, default=None Type of the OpenML estimation procedure used in the clustering task. estimation_parameters : dict, default=None Parameters used by the OpenML estimation procedure. data_splits_url : str, default=None URL of the OpenML data splits for the clustering task. evaluation_measure : str, default=None Evaluation measure used in the clustering task. target_name : str, default=None Name of the target feature (class) that is not part of the feature set for the clustering task.

Source code in openml/tasks/task.py

def __init__(  # noqa: PLR0913
    self,
    task_type_id: TaskType,
    task_type: str,
    data_set_id: int,
    estimation_procedure_id: int = 17,
    task_id: int | None = None,
    estimation_procedure_type: str | None = None,
    estimation_parameters: dict[str, str] | None = None,
    data_splits_url: str | None = None,
    evaluation_measure: str | None = None,
    target_name: str | None = None,
):
    super().__init__(
        task_id=task_id,
        task_type_id=task_type_id,
        task_type=task_type,
        data_set_id=data_set_id,
        evaluation_measure=evaluation_measure,
        estimation_procedure_id=estimation_procedure_id,
        estimation_procedure_type=estimation_procedure_type,
        estimation_parameters=estimation_parameters,
        data_splits_url=data_splits_url,
    )

    self.target_name = target_name

id `property` #

id: int | None

Return the OpenML ID of this task.

openml_url `property` #

openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

download_split #

download_split() -> OpenMLSplit

Download the OpenML split for a given task.

Source code in openml/tasks/task.py

def download_split(self) -> OpenMLSplit:
    """Download the OpenML split for a given task."""
    # TODO(eddiebergman): Can this every be `None`?
    assert self.task_id is not None
    cache_dir = _create_cache_directory_for_id("tasks", self.task_id)
    cached_split_file = cache_dir / "datasplits.arff"

    try:
        split = OpenMLSplit._from_arff_file(cached_split_file)
    except OSError:
        # Next, download and cache the associated split file
        self._download_split(cached_split_file)
        split = OpenMLSplit._from_arff_file(cached_split_file)

    return split

get_X #

get_X() -> DataFrame

Get data associated with the current task.

Returns#

The X data as a dataframe

Source code in openml/tasks/task.py

def get_X(self) -> pd.DataFrame:
    """Get data associated with the current task.

    Returns
    -------
    The X data as a dataframe
    """
    dataset = self.get_dataset()
    data, *_ = dataset.get_data(target=None)
    return data

get_dataset #

get_dataset(**kwargs: Any) -> OpenMLDataset

Download dataset associated with task.

Accepts the same keyword arguments as the openml.datasets.get_dataset.

Source code in openml/tasks/task.py

def get_dataset(self, **kwargs: Any) -> datasets.OpenMLDataset:
    """Download dataset associated with task.

    Accepts the same keyword arguments as the `openml.datasets.get_dataset`.
    """
    return datasets.get_dataset(self.dataset_id, **kwargs)

get_split_dimensions #

get_split_dimensions() -> tuple[int, int, int]

Get the (repeats, folds, samples) of the split for a given task.

Source code in openml/tasks/task.py

def get_split_dimensions(self) -> tuple[int, int, int]:
    """Get the (repeats, folds, samples) of the split for a given task."""
    if self.split is None:
        self.split = self.download_split()

    return self.split.repeats, self.split.folds, self.split.samples

get_train_test_split_indices #

get_train_test_split_indices(fold: int = 0, repeat: int = 0, sample: int = 0) -> tuple[ndarray, ndarray]

Get the indices of the train and test splits for a given task.

Source code in openml/tasks/task.py

def get_train_test_split_indices(
    self,
    fold: int = 0,
    repeat: int = 0,
    sample: int = 0,
) -> tuple[np.ndarray, np.ndarray]:
    """Get the indices of the train and test splits for a given task."""
    # Replace with retrieve from cache
    if self.split is None:
        self.split = self.download_split()

    return self.split.get(repeat=repeat, fold=fold, sample=sample)

open_in_browser #

open_in_browser() -> None

Opens the OpenML web page corresponding to this object in your default browser.

Source code in openml/base.py

def open_in_browser(self) -> None:
    """Opens the OpenML web page corresponding to this object in your default browser."""
    if self.openml_url is None:
        raise ValueError(
            "Cannot open element on OpenML.org when attribute `openml_url` is `None`",
        )

    webbrowser.open(self.openml_url)

publish #

publish() -> OpenMLBase

Publish the object on the OpenML server.

Source code in openml/base.py

def publish(self) -> OpenMLBase:
    """Publish the object on the OpenML server."""
    file_elements = self._get_file_elements()

    if "description" not in file_elements:
        file_elements["description"] = self._to_xml()

    call = f"{_get_rest_api_type_alias(self)}/"
    response_text = openml._api_calls._perform_api_call(
        call,
        "post",
        file_elements=file_elements,
    )
    xml_response = xmltodict.parse(response_text)

    self._parse_publish_response(xml_response)
    return self

push_tag #

push_tag(tag: str) -> None

Annotates this entity with a tag on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def push_tag(self, tag: str) -> None:
    """Annotates this entity with a tag on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag)

remove_tag #

remove_tag(tag: str) -> None

Removes a tag from this entity on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def remove_tag(self, tag: str) -> None:
    """Removes a tag from this entity on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag, untag=True)

url_for_id `classmethod` #

url_for_id(id_: int) -> str

Return the OpenML URL for the object of the class entity with the given id.

Source code in openml/base.py

@classmethod
def url_for_id(cls, id_: int) -> str:
    """Return the OpenML URL for the object of the class entity with the given id."""
    # Sample url for a flow: openml.org/f/123
    return f"{openml.config.get_server_base_url()}/{cls._entity_letter()}/{id_}"

OpenMLDataFeature #

OpenMLDataFeature(index: int, name: str, data_type: str, nominal_values: list[str], number_missing_values: int, ontologies: list[str] | None = None)

Data Feature (a.k.a. Attribute) object.

Parameters#

index : int The index of this feature name : str Name of the feature data_type : str can be nominal, numeric, string, date (corresponds to arff) nominal_values : list(str) list of the possible values, in case of nominal attribute number_missing_values : int Number of rows that have a missing value for this feature. ontologies : list(str) list of ontologies attached to this feature. An ontology describes the concept that are described in a feature. An ontology is defined by an URL where the information is provided.

Source code in openml/datasets/data_feature.py

def __init__(  # noqa: PLR0913
    self,
    index: int,
    name: str,
    data_type: str,
    nominal_values: list[str],
    number_missing_values: int,
    ontologies: list[str] | None = None,
):
    if not isinstance(index, int):
        raise TypeError(f"Index must be `int` but is {type(index)}")

    if data_type not in self.LEGAL_DATA_TYPES:
        raise ValueError(
            f"data type should be in {self.LEGAL_DATA_TYPES!s}, found: {data_type}",
        )

    if data_type == "nominal":
        if nominal_values is None:
            raise TypeError(
                "Dataset features require attribute `nominal_values` for nominal "
                "feature type.",
            )

        if not isinstance(nominal_values, list):
            raise TypeError(
                "Argument `nominal_values` is of wrong datatype, should be list, "
                f"but is {type(nominal_values)}",
            )
    elif nominal_values is not None:
        raise TypeError("Argument `nominal_values` must be None for non-nominal feature.")

    if not isinstance(number_missing_values, int):
        msg = f"number_missing_values must be int but is {type(number_missing_values)}"
        raise TypeError(msg)

    self.index = index
    self.name = str(name)
    self.data_type = str(data_type)
    self.nominal_values = nominal_values
    self.number_missing_values = number_missing_values
    self.ontologies = ontologies

OpenMLDataset #

OpenMLDataset(name: str, description: str | None, data_format: Literal['arff', 'sparse_arff'] = 'arff', cache_format: Literal['feather', 'pickle'] = 'pickle', dataset_id: int | None = None, version: int | None = None, creator: str | None = None, contributor: str | None = None, collection_date: str | None = None, upload_date: str | None = None, language: str | None = None, licence: str | None = None, url: str | None = None, default_target_attribute: str | None = None, row_id_attribute: str | None = None, ignore_attribute: str | list[str] | None = None, version_label: str | None = None, citation: str | None = None, tag: str | None = None, visibility: str | None = None, original_data_url: str | None = None, paper_url: str | None = None, update_comment: str | None = None, md5_checksum: str | None = None, data_file: str | None = None, features_file: str | None = None, qualities_file: str | None = None, dataset: str | None = None, parquet_url: str | None = None, parquet_file: str | None = None)

Bases: OpenMLBase

Dataset object.

Allows fetching and uploading datasets to OpenML.

Parameters#

name : str Name of the dataset. description : str Description of the dataset. data_format : str Format of the dataset which can be either 'arff' or 'sparse_arff'. cache_format : str Format for caching the dataset which can be either 'feather' or 'pickle'. dataset_id : int, optional Id autogenerated by the server. version : int, optional Version of this dataset. '1' for original version. Auto-incremented by server. creator : str, optional The person who created the dataset. contributor : str, optional People who contributed to the current version of the dataset. collection_date : str, optional The date the data was originally collected, given by the uploader. upload_date : str, optional The date-time when the dataset was uploaded, generated by server. language : str, optional Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. 'English'. licence : str, optional License of the data. url : str, optional Valid URL, points to actual data file. The file can be on the OpenML server or another dataset repository. default_target_attribute : str, optional The default target attribute, if it exists. Can have multiple values, comma separated. row_id_attribute : str, optional The attribute that represents the row-id column, if present in the dataset. ignore_attribute : str | list, optional Attributes that should be excluded in modelling, such as identifiers and indexes. version_label : str, optional Version label provided by user. Can be a date, hash, or some other type of id. citation : str, optional Reference(s) that should be cited when building on this data. tag : str, optional Tags, describing the algorithms. visibility : str, optional Who can see the dataset. Typical values: 'Everyone','All my friends','Only me'. Can also be any of the user's circles. original_data_url : str, optional For derived data, the url to the original dataset. paper_url : str, optional Link to a paper describing the dataset. update_comment : str, optional An explanation for when the dataset is uploaded. md5_checksum : str, optional MD5 checksum to check if the dataset is downloaded without corruption. data_file : str, optional Path to where the dataset is located. features_file : dict, optional A dictionary of dataset features, which maps a feature index to a OpenMLDataFeature. qualities_file : dict, optional A dictionary of dataset qualities, which maps a quality name to a quality value. dataset: string, optional Serialized arff dataset string. parquet_url: string, optional This is the URL to the storage location where the dataset files are hosted. This can be a MinIO bucket URL. If specified, the data will be accessed from this URL when reading the files. parquet_file: string, optional Path to the local file.

Source code in openml/datasets/dataset.py

def __init__(  # noqa: C901, PLR0912, PLR0913, PLR0915
    self,
    name: str,
    description: str | None,
    data_format: Literal["arff", "sparse_arff"] = "arff",
    cache_format: Literal["feather", "pickle"] = "pickle",
    dataset_id: int | None = None,
    version: int | None = None,
    creator: str | None = None,
    contributor: str | None = None,
    collection_date: str | None = None,
    upload_date: str | None = None,
    language: str | None = None,
    licence: str | None = None,
    url: str | None = None,
    default_target_attribute: str | None = None,
    row_id_attribute: str | None = None,
    ignore_attribute: str | list[str] | None = None,
    version_label: str | None = None,
    citation: str | None = None,
    tag: str | None = None,
    visibility: str | None = None,
    original_data_url: str | None = None,
    paper_url: str | None = None,
    update_comment: str | None = None,
    md5_checksum: str | None = None,
    data_file: str | None = None,
    features_file: str | None = None,
    qualities_file: str | None = None,
    dataset: str | None = None,
    parquet_url: str | None = None,
    parquet_file: str | None = None,
):
    if cache_format not in ["feather", "pickle"]:
        raise ValueError(
            "cache_format must be one of 'feather' or 'pickle. "
            f"Invalid format specified: {cache_format}",
        )

    def find_invalid_characters(string: str, pattern: str) -> str:
        invalid_chars = set()
        regex = re.compile(pattern)
        for char in string:
            if not regex.match(char):
                invalid_chars.add(char)
        return ",".join(
            [f"'{char}'" if char != "'" else f'"{char}"' for char in invalid_chars],
        )

    if dataset_id is None:
        pattern = "^[\x00-\x7f]*$"
        if description and not re.match(pattern, description):
            # not basiclatin (XSD complains)
            invalid_characters = find_invalid_characters(description, pattern)
            raise ValueError(
                f"Invalid symbols {invalid_characters} in description: {description}",
            )
        pattern = "^[\x00-\x7f]*$"
        if citation and not re.match(pattern, citation):
            # not basiclatin (XSD complains)
            invalid_characters = find_invalid_characters(citation, pattern)
            raise ValueError(
                f"Invalid symbols {invalid_characters} in citation: {citation}",
            )
        pattern = "^[a-zA-Z0-9_\\-\\.\\(\\),]+$"
        if not re.match(pattern, name):
            # regex given by server in error message
            invalid_characters = find_invalid_characters(name, pattern)
            raise ValueError(f"Invalid symbols {invalid_characters} in name: {name}")

    self.ignore_attribute: list[str] | None = None
    if isinstance(ignore_attribute, str):
        self.ignore_attribute = [ignore_attribute]
    elif isinstance(ignore_attribute, list) or ignore_attribute is None:
        self.ignore_attribute = ignore_attribute
    else:
        raise ValueError("Wrong data type for ignore_attribute. Should be list.")

    # TODO add function to check if the name is casual_string128
    # Attributes received by querying the RESTful API
    self.dataset_id = int(dataset_id) if dataset_id is not None else None
    self.name = name
    self.version = int(version) if version is not None else None
    self.description = description
    self.cache_format = cache_format
    # Has to be called format, otherwise there will be an XML upload error
    self.format = data_format
    self.creator = creator
    self.contributor = contributor
    self.collection_date = collection_date
    self.upload_date = upload_date
    self.language = language
    self.licence = licence
    self.url = url
    self.default_target_attribute = default_target_attribute
    self.row_id_attribute = row_id_attribute

    self.version_label = version_label
    self.citation = citation
    self.tag = tag
    self.visibility = visibility
    self.original_data_url = original_data_url
    self.paper_url = paper_url
    self.update_comment = update_comment
    self.md5_checksum = md5_checksum
    self.data_file = data_file
    self.parquet_file = parquet_file
    self._dataset = dataset
    self._parquet_url = parquet_url

    self._features: dict[int, OpenMLDataFeature] | None = None
    self._qualities: dict[str, float] | None = None
    self._no_qualities_found = False

    if features_file is not None:
        self._features = _read_features(Path(features_file))

    # "" was the old default value by `get_dataset` and maybe still used by some
    if qualities_file == "":
        # TODO(0.15): to switch to "qualities_file is not None" below and remove warning
        warnings.warn(
            "Starting from Version 0.15 `qualities_file` must be None and not an empty string "
            "to avoid reading the qualities from file. Set `qualities_file` to None to avoid "
            "this warning.",
            FutureWarning,
            stacklevel=2,
        )
        qualities_file = None

    if qualities_file is not None:
        self._qualities = _read_qualities(Path(qualities_file))

    if data_file is not None:
        data_pickle, data_feather, feather_attribute = self._compressed_cache_file_paths(
            Path(data_file)
        )
        self.data_pickle_file = data_pickle if Path(data_pickle).exists() else None
        self.data_feather_file = data_feather if Path(data_feather).exists() else None
        self.feather_attribute_file = feather_attribute if Path(feather_attribute) else None
    else:
        self.data_pickle_file = None
        self.data_feather_file = None
        self.feather_attribute_file = None

features `property` #

features: dict[int, OpenMLDataFeature]

Get the features of this dataset.

id `property` #

id: int | None

Get the dataset numeric id.

openml_url `property` #

openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

qualities `property` #

qualities: dict[str, float] | None

Get the qualities of this dataset.

get_data #

get_data(target: list[str] | str | None = None, include_row_id: bool = False, include_ignore_attribute: bool = False) -> tuple[DataFrame, Series | None, list[bool], list[str]]

Returns dataset content as dataframes.

Parameters#

target : string, List[str] or None (default=None) Name of target column to separate from the data. Splitting multiple columns is currently not supported. include_row_id : boolean (default=False) Whether to include row ids in the returned dataset. include_ignore_attribute : boolean (default=False) Whether to include columns that are marked as "ignore" on the server in the dataset.

Returns#

X : dataframe, shape (n_samples, n_columns) Dataset, may have sparse dtypes in the columns if required. y : pd.Series, shape (n_samples, ) or None Target column categorical_indicator : list[bool] Mask that indicate categorical features. attribute_names : list[str] List of attribute names.

Source code in openml/datasets/dataset.py

def get_data(  # noqa: C901
    self,
    target: list[str] | str | None = None,
    include_row_id: bool = False,  # noqa: FBT001, FBT002
    include_ignore_attribute: bool = False,  # noqa: FBT001, FBT002
) -> tuple[pd.DataFrame, pd.Series | None, list[bool], list[str]]:
    """Returns dataset content as dataframes.

    Parameters
    ----------
    target : string, List[str] or None (default=None)
        Name of target column to separate from the data.
        Splitting multiple columns is currently not supported.
    include_row_id : boolean (default=False)
        Whether to include row ids in the returned dataset.
    include_ignore_attribute : boolean (default=False)
        Whether to include columns that are marked as "ignore"
        on the server in the dataset.


    Returns
    -------
    X : dataframe, shape (n_samples, n_columns)
        Dataset, may have sparse dtypes in the columns if required.
    y : pd.Series, shape (n_samples, ) or None
        Target column
    categorical_indicator : list[bool]
        Mask that indicate categorical features.
    attribute_names : list[str]
        List of attribute names.
    """
    data, categorical_mask, attribute_names = self._load_data()

    to_exclude = []
    if not include_row_id and self.row_id_attribute is not None:
        if isinstance(self.row_id_attribute, str):
            to_exclude.append(self.row_id_attribute)
        elif isinstance(self.row_id_attribute, Iterable):
            to_exclude.extend(self.row_id_attribute)

    if not include_ignore_attribute and self.ignore_attribute is not None:
        if isinstance(self.ignore_attribute, str):
            to_exclude.append(self.ignore_attribute)
        elif isinstance(self.ignore_attribute, Iterable):
            to_exclude.extend(self.ignore_attribute)

    if len(to_exclude) > 0:
        logger.info(f"Going to remove the following attributes: {to_exclude}")
        keep = np.array([column not in to_exclude for column in attribute_names])
        data = data.drop(columns=to_exclude)
        categorical_mask = [cat for cat, k in zip(categorical_mask, keep) if k]
        attribute_names = [att for att, k in zip(attribute_names, keep) if k]

    if target is None:
        return data, None, categorical_mask, attribute_names

    if isinstance(target, str):
        target_names = target.split(",") if "," in target else [target]
    else:
        target_names = target

    # All the assumptions below for the target are dependant on the number of targets being 1
    n_targets = len(target_names)
    if n_targets > 1:
        raise NotImplementedError(f"Number of targets {n_targets} not implemented.")

    target_name = target_names[0]
    x = data.drop(columns=[target_name])
    y = data[target_name].squeeze()

    # Finally, remove the target from the list of attributes and categorical mask
    target_index = attribute_names.index(target_name)
    categorical_mask.pop(target_index)
    attribute_names.remove(target_name)

    assert isinstance(y, pd.Series)
    return x, y, categorical_mask, attribute_names

get_features_by_type #

get_features_by_type(data_type: str, exclude: list[str] | None = None, exclude_ignore_attribute: bool = True, exclude_row_id_attribute: bool = True) -> list[int]

Return indices of features of a given type, e.g. all nominal features. Optional parameters to exclude various features by index or ontology.

Parameters#

data_type : str The data type to return (e.g., nominal, numeric, date, string) exclude : list(int) List of columns to exclude from the return value exclude_ignore_attribute : bool Whether to exclude the defined ignore attributes (and adapt the return values as if these indices are not present) exclude_row_id_attribute : bool Whether to exclude the defined row id attributes (and adapt the return values as if these indices are not present)

Returns#

result : list a list of indices that have the specified data type

Source code in openml/datasets/dataset.py

def get_features_by_type(  # noqa: C901
    self,
    data_type: str,
    exclude: list[str] | None = None,
    exclude_ignore_attribute: bool = True,  # noqa: FBT002, FBT001
    exclude_row_id_attribute: bool = True,  # noqa: FBT002, FBT001
) -> list[int]:
    """
    Return indices of features of a given type, e.g. all nominal features.
    Optional parameters to exclude various features by index or ontology.

    Parameters
    ----------
    data_type : str
        The data type to return (e.g., nominal, numeric, date, string)
    exclude : list(int)
        List of columns to exclude from the return value
    exclude_ignore_attribute : bool
        Whether to exclude the defined ignore attributes (and adapt the
        return values as if these indices are not present)
    exclude_row_id_attribute : bool
        Whether to exclude the defined row id attributes (and adapt the
        return values as if these indices are not present)

    Returns
    -------
    result : list
        a list of indices that have the specified data type
    """
    if data_type not in OpenMLDataFeature.LEGAL_DATA_TYPES:
        raise TypeError("Illegal feature type requested")
    if self.ignore_attribute is not None and not isinstance(self.ignore_attribute, list):
        raise TypeError("ignore_attribute should be a list")
    if self.row_id_attribute is not None and not isinstance(self.row_id_attribute, str):
        raise TypeError("row id attribute should be a str")
    if exclude is not None and not isinstance(exclude, list):
        raise TypeError("Exclude should be a list")
        # assert all(isinstance(elem, str) for elem in exclude),
        #            "Exclude should be a list of strings"
    to_exclude = []
    if exclude is not None:
        to_exclude.extend(exclude)
    if exclude_ignore_attribute and self.ignore_attribute is not None:
        to_exclude.extend(self.ignore_attribute)
    if exclude_row_id_attribute and self.row_id_attribute is not None:
        to_exclude.append(self.row_id_attribute)

    result = []
    offset = 0
    # this function assumes that everything in to_exclude will
    # be 'excluded' from the dataset (hence the offset)
    for idx in self.features:
        name = self.features[idx].name
        if name in to_exclude:
            offset += 1
        elif self.features[idx].data_type == data_type:
            result.append(idx - offset)
    return result

open_in_browser #

open_in_browser() -> None

Opens the OpenML web page corresponding to this object in your default browser.

Source code in openml/base.py

def open_in_browser(self) -> None:
    """Opens the OpenML web page corresponding to this object in your default browser."""
    if self.openml_url is None:
        raise ValueError(
            "Cannot open element on OpenML.org when attribute `openml_url` is `None`",
        )

    webbrowser.open(self.openml_url)

publish #

publish() -> OpenMLBase

Publish the object on the OpenML server.

Source code in openml/base.py

def publish(self) -> OpenMLBase:
    """Publish the object on the OpenML server."""
    file_elements = self._get_file_elements()

    if "description" not in file_elements:
        file_elements["description"] = self._to_xml()

    call = f"{_get_rest_api_type_alias(self)}/"
    response_text = openml._api_calls._perform_api_call(
        call,
        "post",
        file_elements=file_elements,
    )
    xml_response = xmltodict.parse(response_text)

    self._parse_publish_response(xml_response)
    return self

push_tag #

push_tag(tag: str) -> None

Annotates this entity with a tag on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def push_tag(self, tag: str) -> None:
    """Annotates this entity with a tag on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag)

remove_tag #

remove_tag(tag: str) -> None

Removes a tag from this entity on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def remove_tag(self, tag: str) -> None:
    """Removes a tag from this entity on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag, untag=True)

retrieve_class_labels #

retrieve_class_labels(target_name: str = 'class') -> None | list[str]

Reads the datasets arff to determine the class-labels.

If the task has no class labels (for example a regression problem) it returns None. Necessary because the data returned by get_data only contains the indices of the classes, while OpenML needs the real classname when uploading the results of a run.

Parameters#

target_name : str Name of the target attribute

Returns#

list

Source code in openml/datasets/dataset.py

def retrieve_class_labels(self, target_name: str = "class") -> None | list[str]:
    """Reads the datasets arff to determine the class-labels.

    If the task has no class labels (for example a regression problem)
    it returns None. Necessary because the data returned by get_data
    only contains the indices of the classes, while OpenML needs the real
    classname when uploading the results of a run.

    Parameters
    ----------
    target_name : str
        Name of the target attribute

    Returns
    -------
    list
    """
    for feature in self.features.values():
        if feature.name == target_name:
            if feature.data_type == "nominal":
                return feature.nominal_values

            if feature.data_type == "string":
                # Rel.: #1311
                # The target is invalid for a classification task if the feature type is string
                # and not nominal. For such miss-configured tasks, we silently fix it here as
                # we can safely interpreter string as nominal.
                df, *_ = self.get_data()
                return list(df[feature.name].unique())

    return None

url_for_id `classmethod` #

url_for_id(id_: int) -> str

Return the OpenML URL for the object of the class entity with the given id.

Source code in openml/base.py

@classmethod
def url_for_id(cls, id_: int) -> str:
    """Return the OpenML URL for the object of the class entity with the given id."""
    # Sample url for a flow: openml.org/f/123
    return f"{openml.config.get_server_base_url()}/{cls._entity_letter()}/{id_}"

OpenMLEvaluation #

OpenMLEvaluation(run_id: int, task_id: int, setup_id: int, flow_id: int, flow_name: str, data_id: int, data_name: str, function: str, upload_time: str, uploader: int, uploader_name: str, value: float | None, values: list[float] | None, array_data: str | None = None)

Contains all meta-information about a run / evaluation combination, according to the evaluation/list function

Parameters#

run_id : int Refers to the run. task_id : int Refers to the task. setup_id : int Refers to the setup. flow_id : int Refers to the flow. flow_name : str Name of the referred flow. data_id : int Refers to the dataset. data_name : str The name of the dataset. function : str The evaluation metric of this item (e.g., accuracy). upload_time : str The time of evaluation. uploader: int Uploader ID (user ID) upload_name : str Name of the uploader of this evaluation value : float The value (score) of this evaluation. values : List[float] The values (scores) per repeat and fold (if requested) array_data : str list of information per class. (e.g., in case of precision, auroc, recall)

Source code in openml/evaluations/evaluation.py

def __init__(  # noqa: PLR0913
    self,
    run_id: int,
    task_id: int,
    setup_id: int,
    flow_id: int,
    flow_name: str,
    data_id: int,
    data_name: str,
    function: str,
    upload_time: str,
    uploader: int,
    uploader_name: str,
    value: float | None,
    values: list[float] | None,
    array_data: str | None = None,
):
    self.run_id = run_id
    self.task_id = task_id
    self.setup_id = setup_id
    self.flow_id = flow_id
    self.flow_name = flow_name
    self.data_id = data_id
    self.data_name = data_name
    self.function = function
    self.upload_time = upload_time
    self.uploader = uploader
    self.uploader_name = uploader_name
    self.value = value
    self.values = values
    self.array_data = array_data

OpenMLFlow #

OpenMLFlow(name: str, description: str, model: object, components: dict, parameters: dict, parameters_meta_info: dict, external_version: str, tags: list, language: str, dependencies: str, class_name: str | None = None, custom_name: str | None = None, binary_url: str | None = None, binary_format: str | None = None, binary_md5: str | None = None, uploader: str | None = None, upload_date: str | None = None, flow_id: int | None = None, extension: Extension | None = None, version: str | None = None)

Bases: OpenMLBase

OpenML Flow. Stores machine learning models.

Flows should not be generated manually, but by the function :meth:openml.flows.create_flow_from_model. Using this helper function ensures that all relevant fields are filled in.

Implements openml.implementation.upload.xsd <https://github.com/openml/openml/blob/master/openml_OS/views/pages/api_new/v1/xsd/ openml.implementation.upload.xsd>_.

Parameters#

name : str Name of the flow. Is used together with the attribute external_version as a unique identifier of the flow. description : str Human-readable description of the flow (free text). model : object ML model which is described by this flow. components : OrderedDict Mapping from component identifier to an OpenMLFlow object. Components are usually subfunctions of an algorithm (e.g. kernels), base learners in ensemble algorithms (decision tree in adaboost) or building blocks of a machine learning pipeline. Components are modeled as independent flows and can be shared between flows (different pipelines can use the same components). parameters : OrderedDict Mapping from parameter name to the parameter default value. The parameter default value must be of type str, so that the respective toolbox plugin can take care of casting the parameter default value to the correct type. parameters_meta_info : OrderedDict Mapping from parameter name to dict. Stores additional information for each parameter. Required keys are data_type and description. external_version : str Version number of the software the flow is implemented in. Is used together with the attribute name as a uniquer identifier of the flow. tags : list List of tags. Created on the server by other API calls. language : str Natural language the flow is described in (not the programming language). dependencies : str A list of dependencies necessary to run the flow. This field should contain all libraries the flow depends on. To allow reproducibility it should also specify the exact version numbers. class_name : str, optional The development language name of the class which is described by this flow. custom_name : str, optional Custom name of the flow given by the owner. binary_url : str, optional Url from which the binary can be downloaded. Added by the server. Ignored when uploaded manually. Will not be used by the python API because binaries aren't compatible across machines. binary_format : str, optional Format in which the binary code was uploaded. Will not be used by the python API because binaries aren't compatible across machines. binary_md5 : str, optional MD5 checksum to check if the binary code was correctly downloaded. Will not be used by the python API because binaries aren't compatible across machines. uploader : str, optional OpenML user ID of the uploader. Filled in by the server. upload_date : str, optional Date the flow was uploaded. Filled in by the server. flow_id : int, optional Flow ID. Assigned by the server. extension : Extension, optional The extension for a flow (e.g., sklearn). version : str, optional OpenML version of the flow. Assigned by the server.

Source code in openml/flows/flow.py

def __init__(  # noqa: PLR0913
    self,
    name: str,
    description: str,
    model: object,
    components: dict,
    parameters: dict,
    parameters_meta_info: dict,
    external_version: str,
    tags: list,
    language: str,
    dependencies: str,
    class_name: str | None = None,
    custom_name: str | None = None,
    binary_url: str | None = None,
    binary_format: str | None = None,
    binary_md5: str | None = None,
    uploader: str | None = None,
    upload_date: str | None = None,
    flow_id: int | None = None,
    extension: Extension | None = None,
    version: str | None = None,
):
    self.name = name
    self.description = description
    self.model = model

    for variable, variable_name in [
        [components, "components"],
        [parameters, "parameters"],
        [parameters_meta_info, "parameters_meta_info"],
    ]:
        if not isinstance(variable, (OrderedDict, dict)):
            raise TypeError(
                f"{variable_name} must be of type OrderedDict or dict, "
                f"but is {type(variable)}.",
            )

    self.components = components
    self.parameters = parameters
    self.parameters_meta_info = parameters_meta_info
    self.class_name = class_name

    keys_parameters = set(parameters.keys())
    keys_parameters_meta_info = set(parameters_meta_info.keys())
    if len(keys_parameters.difference(keys_parameters_meta_info)) > 0:
        raise ValueError(
            f"Parameter {keys_parameters.difference(keys_parameters_meta_info)!s} only in "
            "parameters, but not in parameters_meta_info.",
        )
    if len(keys_parameters_meta_info.difference(keys_parameters)) > 0:
        raise ValueError(
            f"Parameter {keys_parameters_meta_info.difference(keys_parameters)!s} only in "
            " parameters_meta_info, but not in parameters.",
        )

    self.external_version = external_version
    self.uploader = uploader

    self.custom_name = custom_name
    self.tags = tags if tags is not None else []
    self.binary_url = binary_url
    self.binary_format = binary_format
    self.binary_md5 = binary_md5
    self.version = version
    self.upload_date = upload_date
    self.language = language
    self.dependencies = dependencies
    self.flow_id = flow_id
    self._extension = extension

extension `property` #

extension: Extension

The extension of the flow (e.g., sklearn).

id `property` #

id: int | None

The ID of the flow.

openml_url `property` #

openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

from_filesystem `classmethod` #

from_filesystem(input_directory: str | Path) -> OpenMLFlow

Read a flow from an XML in input_directory on the filesystem.

Source code in openml/flows/flow.py

@classmethod
def from_filesystem(cls, input_directory: str | Path) -> OpenMLFlow:
    """Read a flow from an XML in input_directory on the filesystem."""
    input_directory = Path(input_directory) / "flow.xml"
    with input_directory.open() as f:
        xml_string = f.read()
    return OpenMLFlow._from_dict(xmltodict.parse(xml_string))

get_structure #

get_structure(key_item: str) -> dict[str, list[str]]

Returns for each sub-component of the flow the path of identifiers that should be traversed to reach this component. The resulting dict maps a key (identifying a flow by either its id, name or fullname) to the parameter prefix.

Parameters#

key_item: str The flow attribute that will be used to identify flows in the structure. Allowed values {flow_id, name}

Returns#

dict[str, List[str]] The flow structure

Source code in openml/flows/flow.py

def get_structure(self, key_item: str) -> dict[str, list[str]]:
    """
    Returns for each sub-component of the flow the path of identifiers
    that should be traversed to reach this component. The resulting dict
    maps a key (identifying a flow by either its id, name or fullname) to
    the parameter prefix.

    Parameters
    ----------
    key_item: str
        The flow attribute that will be used to identify flows in the
        structure. Allowed values {flow_id, name}

    Returns
    -------
    dict[str, List[str]]
        The flow structure
    """
    if key_item not in ["flow_id", "name"]:
        raise ValueError("key_item should be in {flow_id, name}")
    structure = {}
    for key, sub_flow in self.components.items():
        sub_structure = sub_flow.get_structure(key_item)
        for flow_name, flow_sub_structure in sub_structure.items():
            structure[flow_name] = [key, *flow_sub_structure]
    structure[getattr(self, key_item)] = []
    return structure

get_subflow #

get_subflow(structure: list[str]) -> OpenMLFlow

Returns a subflow from the tree of dependencies.

Parameters#

structure: list[str] A list of strings, indicating the location of the subflow

Returns#

OpenMLFlow The OpenMLFlow that corresponds to the structure

Source code in openml/flows/flow.py

def get_subflow(self, structure: list[str]) -> OpenMLFlow:
    """
    Returns a subflow from the tree of dependencies.

    Parameters
    ----------
    structure: list[str]
        A list of strings, indicating the location of the subflow

    Returns
    -------
    OpenMLFlow
        The OpenMLFlow that corresponds to the structure
    """
    # make a copy of structure, as we don't want to change it in the
    # outer scope
    structure = list(structure)
    if len(structure) < 1:
        raise ValueError("Please provide a structure list of size >= 1")
    sub_identifier = structure[0]
    if sub_identifier not in self.components:
        raise ValueError(
            f"Flow {self.name} does not contain component with " f"identifier {sub_identifier}",
        )
    if len(structure) == 1:
        return self.components[sub_identifier]  # type: ignore

    structure.pop(0)
    return self.components[sub_identifier].get_subflow(structure)  # type: ignore

open_in_browser #

open_in_browser() -> None

Opens the OpenML web page corresponding to this object in your default browser.

Source code in openml/base.py

def open_in_browser(self) -> None:
    """Opens the OpenML web page corresponding to this object in your default browser."""
    if self.openml_url is None:
        raise ValueError(
            "Cannot open element on OpenML.org when attribute `openml_url` is `None`",
        )

    webbrowser.open(self.openml_url)

publish #

publish(raise_error_if_exists: bool = False) -> OpenMLFlow

Publish this flow to OpenML server.

Raises a PyOpenMLError if the flow exists on the server, but self.flow_id does not match the server known flow id.

Parameters#

raise_error_if_exists : bool, optional (default=False) If True, raise PyOpenMLError if the flow exists on the server. If False, update the local flow to match the server flow.

Returns#

self : OpenMLFlow

Source code in openml/flows/flow.py

def publish(self, raise_error_if_exists: bool = False) -> OpenMLFlow:  # noqa: FBT001, FBT002
    """Publish this flow to OpenML server.

    Raises a PyOpenMLError if the flow exists on the server, but
    `self.flow_id` does not match the server known flow id.

    Parameters
    ----------
    raise_error_if_exists : bool, optional (default=False)
        If True, raise PyOpenMLError if the flow exists on the server.
        If False, update the local flow to match the server flow.

    Returns
    -------
    self : OpenMLFlow

    """
    # Import at top not possible because of cyclic dependencies. In
    # particular, flow.py tries to import functions.py in order to call
    # get_flow(), while functions.py tries to import flow.py in order to
    # instantiate an OpenMLFlow.
    import openml.flows.functions

    flow_id = openml.flows.functions.flow_exists(self.name, self.external_version)
    if not flow_id:
        if self.flow_id:
            raise openml.exceptions.PyOpenMLError(
                "Flow does not exist on the server, " "but 'flow.flow_id' is not None.",
            )
        super().publish()
        assert self.flow_id is not None  # for mypy
        flow_id = self.flow_id
    elif raise_error_if_exists:
        error_message = f"This OpenMLFlow already exists with id: {flow_id}."
        raise openml.exceptions.PyOpenMLError(error_message)
    elif self.flow_id is not None and self.flow_id != flow_id:
        raise openml.exceptions.PyOpenMLError(
            "Local flow_id does not match server flow_id: " f"'{self.flow_id}' vs '{flow_id}'",
        )

    flow = openml.flows.functions.get_flow(flow_id)
    _copy_server_fields(flow, self)
    try:
        openml.flows.functions.assert_flows_equal(
            self,
            flow,
            flow.upload_date,
            ignore_parameter_values=True,
            ignore_custom_name_if_none=True,
        )
    except ValueError as e:
        message = e.args[0]
        raise ValueError(
            "The flow on the server is inconsistent with the local flow. "
            f"The server flow ID is {flow_id}. Please check manually and remove "
            f"the flow if necessary! Error is:\n'{message}'",
        ) from e
    return self

push_tag #

push_tag(tag: str) -> None

Annotates this entity with a tag on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def push_tag(self, tag: str) -> None:
    """Annotates this entity with a tag on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag)

remove_tag #

remove_tag(tag: str) -> None

Removes a tag from this entity on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def remove_tag(self, tag: str) -> None:
    """Removes a tag from this entity on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag, untag=True)

to_filesystem #

to_filesystem(output_directory: str | Path) -> None

Write a flow to the filesystem as XML to output_directory.

Source code in openml/flows/flow.py

def to_filesystem(self, output_directory: str | Path) -> None:
    """Write a flow to the filesystem as XML to output_directory."""
    output_directory = Path(output_directory)
    output_directory.mkdir(parents=True, exist_ok=True)

    output_path = output_directory / "flow.xml"
    if output_path.exists():
        raise ValueError("Output directory already contains a flow.xml file.")

    run_xml = self._to_xml()
    with output_path.open("w") as f:
        f.write(run_xml)

url_for_id `classmethod` #

url_for_id(id_: int) -> str

Return the OpenML URL for the object of the class entity with the given id.

Source code in openml/base.py

@classmethod
def url_for_id(cls, id_: int) -> str:
    """Return the OpenML URL for the object of the class entity with the given id."""
    # Sample url for a flow: openml.org/f/123
    return f"{openml.config.get_server_base_url()}/{cls._entity_letter()}/{id_}"

OpenMLLearningCurveTask #

OpenMLLearningCurveTask(task_type_id: TaskType, task_type: str, data_set_id: int, target_name: str, estimation_procedure_id: int = 13, estimation_procedure_type: str | None = None, estimation_parameters: dict[str, str] | None = None, data_splits_url: str | None = None, task_id: int | None = None, evaluation_measure: str | None = None, class_labels: list[str] | None = None, cost_matrix: ndarray | None = None)

Bases: OpenMLClassificationTask

OpenML Learning Curve object.

Parameters#

task_type_id : TaskType ID of the Learning Curve task. task_type : str Name of the Learning Curve task. data_set_id : int ID of the dataset that this task is associated with. target_name : str Name of the target feature in the dataset. estimation_procedure_id : int, default=None ID of the estimation procedure to use for evaluating models. estimation_procedure_type : str, default=None Type of the estimation procedure. estimation_parameters : dict, default=None Additional parameters for the estimation procedure. data_splits_url : str, default=None URL of the file containing the data splits for Learning Curve task. task_id : Union[int, None] ID of the Learning Curve task. evaluation_measure : str, default=None Name of the evaluation measure to use for evaluating models. class_labels : list of str, default=None Class labels for Learning Curve tasks. cost_matrix : numpy array, default=None Cost matrix for Learning Curve tasks.

Source code in openml/tasks/task.py

def __init__(  # noqa: PLR0913
    self,
    task_type_id: TaskType,
    task_type: str,
    data_set_id: int,
    target_name: str,
    estimation_procedure_id: int = 13,
    estimation_procedure_type: str | None = None,
    estimation_parameters: dict[str, str] | None = None,
    data_splits_url: str | None = None,
    task_id: int | None = None,
    evaluation_measure: str | None = None,
    class_labels: list[str] | None = None,
    cost_matrix: np.ndarray | None = None,
):
    super().__init__(
        task_id=task_id,
        task_type_id=task_type_id,
        task_type=task_type,
        data_set_id=data_set_id,
        estimation_procedure_id=estimation_procedure_id,
        estimation_procedure_type=estimation_procedure_type,
        estimation_parameters=estimation_parameters,
        evaluation_measure=evaluation_measure,
        target_name=target_name,
        data_splits_url=data_splits_url,
        class_labels=class_labels,
        cost_matrix=cost_matrix,
    )

estimation_parameters `property` `writable` #

estimation_parameters: dict[str, str] | None

Return the estimation parameters for the task.

id `property` #

id: int | None

Return the OpenML ID of this task.

openml_url `property` #

openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

download_split #

download_split() -> OpenMLSplit

Download the OpenML split for a given task.

Source code in openml/tasks/task.py

def download_split(self) -> OpenMLSplit:
    """Download the OpenML split for a given task."""
    # TODO(eddiebergman): Can this every be `None`?
    assert self.task_id is not None
    cache_dir = _create_cache_directory_for_id("tasks", self.task_id)
    cached_split_file = cache_dir / "datasplits.arff"

    try:
        split = OpenMLSplit._from_arff_file(cached_split_file)
    except OSError:
        # Next, download and cache the associated split file
        self._download_split(cached_split_file)
        split = OpenMLSplit._from_arff_file(cached_split_file)

    return split

get_X_and_y #

get_X_and_y() -> tuple[DataFrame, Series | DataFrame | None]

Get data associated with the current task.

Returns#

tuple - X and y

Source code in openml/tasks/task.py

def get_X_and_y(self) -> tuple[pd.DataFrame, pd.Series | pd.DataFrame | None]:
    """Get data associated with the current task.

    Returns
    -------
    tuple - X and y

    """
    dataset = self.get_dataset()
    if self.task_type_id not in (
        TaskType.SUPERVISED_CLASSIFICATION,
        TaskType.SUPERVISED_REGRESSION,
        TaskType.LEARNING_CURVE,
    ):
        raise NotImplementedError(self.task_type)

    X, y, _, _ = dataset.get_data(target=self.target_name)
    return X, y

get_dataset #

get_dataset(**kwargs: Any) -> OpenMLDataset

Download dataset associated with task.

Accepts the same keyword arguments as the openml.datasets.get_dataset.

Source code in openml/tasks/task.py

def get_dataset(self, **kwargs: Any) -> datasets.OpenMLDataset:
    """Download dataset associated with task.

    Accepts the same keyword arguments as the `openml.datasets.get_dataset`.
    """
    return datasets.get_dataset(self.dataset_id, **kwargs)

get_split_dimensions #

get_split_dimensions() -> tuple[int, int, int]

Get the (repeats, folds, samples) of the split for a given task.

Source code in openml/tasks/task.py

def get_split_dimensions(self) -> tuple[int, int, int]:
    """Get the (repeats, folds, samples) of the split for a given task."""
    if self.split is None:
        self.split = self.download_split()

    return self.split.repeats, self.split.folds, self.split.samples

get_train_test_split_indices #

get_train_test_split_indices(fold: int = 0, repeat: int = 0, sample: int = 0) -> tuple[ndarray, ndarray]

Get the indices of the train and test splits for a given task.

Source code in openml/tasks/task.py

def get_train_test_split_indices(
    self,
    fold: int = 0,
    repeat: int = 0,
    sample: int = 0,
) -> tuple[np.ndarray, np.ndarray]:
    """Get the indices of the train and test splits for a given task."""
    # Replace with retrieve from cache
    if self.split is None:
        self.split = self.download_split()

    return self.split.get(repeat=repeat, fold=fold, sample=sample)

open_in_browser #

open_in_browser() -> None

Opens the OpenML web page corresponding to this object in your default browser.

Source code in openml/base.py

def open_in_browser(self) -> None:
    """Opens the OpenML web page corresponding to this object in your default browser."""
    if self.openml_url is None:
        raise ValueError(
            "Cannot open element on OpenML.org when attribute `openml_url` is `None`",
        )

    webbrowser.open(self.openml_url)

publish #

publish() -> OpenMLBase

Publish the object on the OpenML server.

Source code in openml/base.py

def publish(self) -> OpenMLBase:
    """Publish the object on the OpenML server."""
    file_elements = self._get_file_elements()

    if "description" not in file_elements:
        file_elements["description"] = self._to_xml()

    call = f"{_get_rest_api_type_alias(self)}/"
    response_text = openml._api_calls._perform_api_call(
        call,
        "post",
        file_elements=file_elements,
    )
    xml_response = xmltodict.parse(response_text)

    self._parse_publish_response(xml_response)
    return self

push_tag #

push_tag(tag: str) -> None

Annotates this entity with a tag on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def push_tag(self, tag: str) -> None:
    """Annotates this entity with a tag on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag)

remove_tag #

remove_tag(tag: str) -> None

Removes a tag from this entity on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def remove_tag(self, tag: str) -> None:
    """Removes a tag from this entity on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag, untag=True)

url_for_id `classmethod` #

url_for_id(id_: int) -> str

Return the OpenML URL for the object of the class entity with the given id.

Source code in openml/base.py

@classmethod
def url_for_id(cls, id_: int) -> str:
    """Return the OpenML URL for the object of the class entity with the given id."""
    # Sample url for a flow: openml.org/f/123
    return f"{openml.config.get_server_base_url()}/{cls._entity_letter()}/{id_}"

OpenMLParameter #

OpenMLParameter(input_id: int, flow_id: int, flow_name: str, full_name: str, parameter_name: str, data_type: str, default_value: str, value: str)

Parameter object (used in setup).

Parameters#

input_id : int The input id from the openml database flow id : int The flow to which this parameter is associated flow name : str The name of the flow (no version number) to which this parameter is associated full_name : str The name of the flow and parameter combined parameter_name : str The name of the parameter data_type : str The datatype of the parameter. generally unused for sklearn flows default_value : str The default value. For sklearn parameters, this is unknown and a default value is selected arbitrarily value : str If the parameter was set, the value that it was set to.

Source code in openml/setups/setup.py

def __init__(  # noqa: PLR0913
    self,
    input_id: int,
    flow_id: int,
    flow_name: str,
    full_name: str,
    parameter_name: str,
    data_type: str,
    default_value: str,
    value: str,
):
    self.id = input_id
    self.flow_id = flow_id
    self.flow_name = flow_name
    self.full_name = full_name
    self.parameter_name = parameter_name
    self.data_type = data_type
    self.default_value = default_value
    self.value = value

OpenMLRegressionTask #

OpenMLRegressionTask(task_type_id: TaskType, task_type: str, data_set_id: int, target_name: str, estimation_procedure_id: int = 7, estimation_procedure_type: str | None = None, estimation_parameters: dict[str, str] | None = None, data_splits_url: str | None = None, task_id: int | None = None, evaluation_measure: str | None = None)

Bases: OpenMLSupervisedTask

OpenML Regression object.

Parameters#

task_type_id : TaskType Task type ID of the OpenML Regression task. task_type : str Task type of the OpenML Regression task. data_set_id : int ID of the OpenML dataset. target_name : str Name of the target feature used in the Regression task. estimation_procedure_id : int, default=None ID of the OpenML estimation procedure. estimation_procedure_type : str, default=None Type of the OpenML estimation procedure. estimation_parameters : dict, default=None Parameters used by the OpenML estimation procedure. data_splits_url : str, default=None URL of the OpenML data splits for the Regression task. task_id : Union[int, None] ID of the OpenML Regression task. evaluation_measure : str, default=None Evaluation measure used in the Regression task.

Source code in openml/tasks/task.py

def __init__(  # noqa: PLR0913
    self,
    task_type_id: TaskType,
    task_type: str,
    data_set_id: int,
    target_name: str,
    estimation_procedure_id: int = 7,
    estimation_procedure_type: str | None = None,
    estimation_parameters: dict[str, str] | None = None,
    data_splits_url: str | None = None,
    task_id: int | None = None,
    evaluation_measure: str | None = None,
):
    super().__init__(
        task_id=task_id,
        task_type_id=task_type_id,
        task_type=task_type,
        data_set_id=data_set_id,
        estimation_procedure_id=estimation_procedure_id,
        estimation_procedure_type=estimation_procedure_type,
        estimation_parameters=estimation_parameters,
        evaluation_measure=evaluation_measure,
        target_name=target_name,
        data_splits_url=data_splits_url,
    )

estimation_parameters `property` `writable` #

estimation_parameters: dict[str, str] | None

Return the estimation parameters for the task.

id `property` #

id: int | None

Return the OpenML ID of this task.

openml_url `property` #

openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

download_split #

download_split() -> OpenMLSplit

Download the OpenML split for a given task.

Source code in openml/tasks/task.py

def download_split(self) -> OpenMLSplit:
    """Download the OpenML split for a given task."""
    # TODO(eddiebergman): Can this every be `None`?
    assert self.task_id is not None
    cache_dir = _create_cache_directory_for_id("tasks", self.task_id)
    cached_split_file = cache_dir / "datasplits.arff"

    try:
        split = OpenMLSplit._from_arff_file(cached_split_file)
    except OSError:
        # Next, download and cache the associated split file
        self._download_split(cached_split_file)
        split = OpenMLSplit._from_arff_file(cached_split_file)

    return split

get_X_and_y #

get_X_and_y() -> tuple[DataFrame, Series | DataFrame | None]

Get data associated with the current task.

Returns#

tuple - X and y

Source code in openml/tasks/task.py

def get_X_and_y(self) -> tuple[pd.DataFrame, pd.Series | pd.DataFrame | None]:
    """Get data associated with the current task.

    Returns
    -------
    tuple - X and y

    """
    dataset = self.get_dataset()
    if self.task_type_id not in (
        TaskType.SUPERVISED_CLASSIFICATION,
        TaskType.SUPERVISED_REGRESSION,
        TaskType.LEARNING_CURVE,
    ):
        raise NotImplementedError(self.task_type)

    X, y, _, _ = dataset.get_data(target=self.target_name)
    return X, y

get_dataset #

get_dataset(**kwargs: Any) -> OpenMLDataset

Download dataset associated with task.

Accepts the same keyword arguments as the openml.datasets.get_dataset.

Source code in openml/tasks/task.py

def get_dataset(self, **kwargs: Any) -> datasets.OpenMLDataset:
    """Download dataset associated with task.

    Accepts the same keyword arguments as the `openml.datasets.get_dataset`.
    """
    return datasets.get_dataset(self.dataset_id, **kwargs)

get_split_dimensions #

get_split_dimensions() -> tuple[int, int, int]

Get the (repeats, folds, samples) of the split for a given task.

Source code in openml/tasks/task.py

def get_split_dimensions(self) -> tuple[int, int, int]:
    """Get the (repeats, folds, samples) of the split for a given task."""
    if self.split is None:
        self.split = self.download_split()

    return self.split.repeats, self.split.folds, self.split.samples

get_train_test_split_indices #

get_train_test_split_indices(fold: int = 0, repeat: int = 0, sample: int = 0) -> tuple[ndarray, ndarray]

Get the indices of the train and test splits for a given task.

Source code in openml/tasks/task.py

def get_train_test_split_indices(
    self,
    fold: int = 0,
    repeat: int = 0,
    sample: int = 0,
) -> tuple[np.ndarray, np.ndarray]:
    """Get the indices of the train and test splits for a given task."""
    # Replace with retrieve from cache
    if self.split is None:
        self.split = self.download_split()

    return self.split.get(repeat=repeat, fold=fold, sample=sample)

open_in_browser #

open_in_browser() -> None

Opens the OpenML web page corresponding to this object in your default browser.

Source code in openml/base.py

def open_in_browser(self) -> None:
    """Opens the OpenML web page corresponding to this object in your default browser."""
    if self.openml_url is None:
        raise ValueError(
            "Cannot open element on OpenML.org when attribute `openml_url` is `None`",
        )

    webbrowser.open(self.openml_url)

publish #

publish() -> OpenMLBase

Publish the object on the OpenML server.

Source code in openml/base.py

def publish(self) -> OpenMLBase:
    """Publish the object on the OpenML server."""
    file_elements = self._get_file_elements()

    if "description" not in file_elements:
        file_elements["description"] = self._to_xml()

    call = f"{_get_rest_api_type_alias(self)}/"
    response_text = openml._api_calls._perform_api_call(
        call,
        "post",
        file_elements=file_elements,
    )
    xml_response = xmltodict.parse(response_text)

    self._parse_publish_response(xml_response)
    return self

push_tag #

push_tag(tag: str) -> None

Annotates this entity with a tag on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def push_tag(self, tag: str) -> None:
    """Annotates this entity with a tag on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag)

remove_tag #

remove_tag(tag: str) -> None

Removes a tag from this entity on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def remove_tag(self, tag: str) -> None:
    """Removes a tag from this entity on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag, untag=True)

url_for_id `classmethod` #

url_for_id(id_: int) -> str

Return the OpenML URL for the object of the class entity with the given id.

Source code in openml/base.py

@classmethod
def url_for_id(cls, id_: int) -> str:
    """Return the OpenML URL for the object of the class entity with the given id."""
    # Sample url for a flow: openml.org/f/123
    return f"{openml.config.get_server_base_url()}/{cls._entity_letter()}/{id_}"

OpenMLRun #

OpenMLRun(task_id: int, flow_id: int | None, dataset_id: int | None, setup_string: str | None = None, output_files: dict[str, int] | None = None, setup_id: int | None = None, tags: list[str] | None = None, uploader: int | None = None, uploader_name: str | None = None, evaluations: dict | None = None, fold_evaluations: dict | None = None, sample_evaluations: dict | None = None, data_content: list[list] | None = None, trace: OpenMLRunTrace | None = None, model: object | None = None, task_type: str | None = None, task_evaluation_measure: str | None = None, flow_name: str | None = None, parameter_settings: list[dict[str, Any]] | None = None, predictions_url: str | None = None, task: OpenMLTask | None = None, flow: OpenMLFlow | None = None, run_id: int | None = None, description_text: str | None = None, run_details: str | None = None)

Bases: OpenMLBase

OpenML Run: result of running a model on an OpenML dataset.

Parameters#

task_id: int The ID of the OpenML task associated with the run. flow_id: int The ID of the OpenML flow associated with the run. dataset_id: int The ID of the OpenML dataset used for the run. setup_string: str The setup string of the run. output_files: Dict[str, int] Specifies where each related file can be found. setup_id: int An integer representing the ID of the setup used for the run. tags: List[str] Representing the tags associated with the run. uploader: int User ID of the uploader. uploader_name: str The name of the person who uploaded the run. evaluations: Dict Representing the evaluations of the run. fold_evaluations: Dict The evaluations of the run for each fold. sample_evaluations: Dict The evaluations of the run for each sample. data_content: List[List] The predictions generated from executing this run. trace: OpenMLRunTrace The trace containing information on internal model evaluations of this run. model: object The untrained model that was evaluated in the run. task_type: str The type of the OpenML task associated with the run. task_evaluation_measure: str The evaluation measure used for the task. flow_name: str The name of the OpenML flow associated with the run. parameter_settings: list[OrderedDict] Representing the parameter settings used for the run. predictions_url: str The URL of the predictions file. task: OpenMLTask An instance of the OpenMLTask class, representing the OpenML task associated with the run. flow: OpenMLFlow An instance of the OpenMLFlow class, representing the OpenML flow associated with the run. run_id: int The ID of the run. description_text: str, optional Description text to add to the predictions file. If left None, is set to the time the arff file is generated. run_details: str, optional (default=None) Description of the run stored in the run meta-data.

Source code in openml/runs/run.py

def __init__(  # noqa: PLR0913
    self,
    task_id: int,
    flow_id: int | None,
    dataset_id: int | None,
    setup_string: str | None = None,
    output_files: dict[str, int] | None = None,
    setup_id: int | None = None,
    tags: list[str] | None = None,
    uploader: int | None = None,
    uploader_name: str | None = None,
    evaluations: dict | None = None,
    fold_evaluations: dict | None = None,
    sample_evaluations: dict | None = None,
    data_content: list[list] | None = None,
    trace: OpenMLRunTrace | None = None,
    model: object | None = None,
    task_type: str | None = None,
    task_evaluation_measure: str | None = None,
    flow_name: str | None = None,
    parameter_settings: list[dict[str, Any]] | None = None,
    predictions_url: str | None = None,
    task: OpenMLTask | None = None,
    flow: OpenMLFlow | None = None,
    run_id: int | None = None,
    description_text: str | None = None,
    run_details: str | None = None,
):
    self.uploader = uploader
    self.uploader_name = uploader_name
    self.task_id = task_id
    self.task_type = task_type
    self.task_evaluation_measure = task_evaluation_measure
    self.flow_id = flow_id
    self.flow_name = flow_name
    self.setup_id = setup_id
    self.setup_string = setup_string
    self.parameter_settings = parameter_settings
    self.dataset_id = dataset_id
    self.evaluations = evaluations
    self.fold_evaluations = fold_evaluations
    self.sample_evaluations = sample_evaluations
    self.data_content = data_content
    self.output_files = output_files
    self.trace = trace
    self.error_message = None
    self.task = task
    self.flow = flow
    self.run_id = run_id
    self.model = model
    self.tags = tags
    self.predictions_url = predictions_url
    self.description_text = description_text
    self.run_details = run_details
    self._predictions = None

id `property` #

id: int | None

The ID of the run, None if not uploaded to the server yet.

openml_url `property` #

openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

predictions `property` #

predictions: DataFrame

Return a DataFrame with predictions for this run

from_filesystem `classmethod` #

from_filesystem(directory: str | Path, expect_model: bool = True) -> OpenMLRun

The inverse of the to_filesystem method. Instantiates an OpenMLRun object based on files stored on the file system.

Parameters#

directory : str a path leading to the folder where the results are stored

bool

if True, it requires the model pickle to be present, and an error will be thrown if not. Otherwise, the model might or might not be present.

Returns#

run : OpenMLRun the re-instantiated run object

Source code in openml/runs/run.py

@classmethod
def from_filesystem(cls, directory: str | Path, expect_model: bool = True) -> OpenMLRun:  # noqa: FBT001, FBT002
    """
    The inverse of the to_filesystem method. Instantiates an OpenMLRun
    object based on files stored on the file system.

    Parameters
    ----------
    directory : str
        a path leading to the folder where the results
        are stored

    expect_model : bool
        if True, it requires the model pickle to be present, and an error
        will be thrown if not. Otherwise, the model might or might not
        be present.

    Returns
    -------
    run : OpenMLRun
        the re-instantiated run object
    """
    # Avoiding cyclic imports
    import openml.runs.functions

    directory = Path(directory)
    if not directory.is_dir():
        raise ValueError("Could not find folder")

    description_path = directory / "description.xml"
    predictions_path = directory / "predictions.arff"
    trace_path = directory / "trace.arff"
    model_path = directory / "model.pkl"

    if not description_path.is_file():
        raise ValueError("Could not find description.xml")
    if not predictions_path.is_file():
        raise ValueError("Could not find predictions.arff")
    if (not model_path.is_file()) and expect_model:
        raise ValueError("Could not find model.pkl")

    with description_path.open() as fht:
        xml_string = fht.read()
    run = openml.runs.functions._create_run_from_xml(xml_string, from_server=False)

    if run.flow_id is None:
        flow = openml.flows.OpenMLFlow.from_filesystem(directory)
        run.flow = flow
        run.flow_name = flow.name

    with predictions_path.open() as fht:
        predictions = arff.load(fht)
        run.data_content = predictions["data"]

    if model_path.is_file():
        # note that it will load the model if the file exists, even if
        # expect_model is False
        with model_path.open("rb") as fhb:
            run.model = pickle.load(fhb)  # noqa: S301

    if trace_path.is_file():
        run.trace = openml.runs.OpenMLRunTrace._from_filesystem(trace_path)

    return run

get_metric_fn #

get_metric_fn(sklearn_fn: Callable, kwargs: dict | None = None) -> ndarray

Calculates metric scores based on predicted values. Assumes the run has been executed locally (and contains run_data). Furthermore, it assumes that the 'correct' or 'truth' attribute is specified in the arff (which is an optional field, but always the case for openml-python runs)

Parameters#

sklearn_fn : function a function pointer to a sklearn function that accepts y_true, y_pred and **kwargs kwargs : dict kwargs for the function

Returns#

scores : ndarray of scores of length num_folds * num_repeats metric results

Source code in openml/runs/run.py

def get_metric_fn(self, sklearn_fn: Callable, kwargs: dict | None = None) -> np.ndarray:  # noqa: PLR0915, PLR0912, C901
    """Calculates metric scores based on predicted values. Assumes the
    run has been executed locally (and contains run_data). Furthermore,
    it assumes that the 'correct' or 'truth' attribute is specified in
    the arff (which is an optional field, but always the case for
    openml-python runs)

    Parameters
    ----------
    sklearn_fn : function
        a function pointer to a sklearn function that
        accepts ``y_true``, ``y_pred`` and ``**kwargs``
    kwargs : dict
        kwargs for the function

    Returns
    -------
    scores : ndarray of scores of length num_folds * num_repeats
        metric results
    """
    kwargs = kwargs if kwargs else {}
    if self.data_content is not None and self.task_id is not None:
        predictions_arff = self._generate_arff_dict()
    elif (self.output_files is not None) and ("predictions" in self.output_files):
        predictions_file_url = openml._api_calls._file_id_to_url(
            self.output_files["predictions"],
            "predictions.arff",
        )
        response = openml._api_calls._download_text_file(predictions_file_url)
        predictions_arff = arff.loads(response)
        # TODO: make this a stream reader
    else:
        raise ValueError(
            "Run should have been locally executed or " "contain outputfile reference.",
        )

    # Need to know more about the task to compute scores correctly
    task = get_task(self.task_id)

    attribute_names = [att[0] for att in predictions_arff["attributes"]]
    if (
        task.task_type_id in [TaskType.SUPERVISED_CLASSIFICATION, TaskType.LEARNING_CURVE]
        and "correct" not in attribute_names
    ):
        raise ValueError('Attribute "correct" should be set for ' "classification task runs")
    if task.task_type_id == TaskType.SUPERVISED_REGRESSION and "truth" not in attribute_names:
        raise ValueError('Attribute "truth" should be set for ' "regression task runs")
    if task.task_type_id != TaskType.CLUSTERING and "prediction" not in attribute_names:
        raise ValueError('Attribute "predict" should be set for ' "supervised task runs")

    def _attribute_list_to_dict(attribute_list):  # type: ignore
        # convenience function: Creates a mapping to map from the name of
        # attributes present in the arff prediction file to their index.
        # This is necessary because the number of classes can be different
        # for different tasks.
        res = OrderedDict()
        for idx in range(len(attribute_list)):
            res[attribute_list[idx][0]] = idx
        return res

    attribute_dict = _attribute_list_to_dict(predictions_arff["attributes"])

    repeat_idx = attribute_dict["repeat"]
    fold_idx = attribute_dict["fold"]
    predicted_idx = attribute_dict["prediction"]  # Assume supervised task

    if task.task_type_id in (TaskType.SUPERVISED_CLASSIFICATION, TaskType.LEARNING_CURVE):
        correct_idx = attribute_dict["correct"]
    elif task.task_type_id == TaskType.SUPERVISED_REGRESSION:
        correct_idx = attribute_dict["truth"]
    has_samples = False
    if "sample" in attribute_dict:
        sample_idx = attribute_dict["sample"]
        has_samples = True

    if (
        predictions_arff["attributes"][predicted_idx][1]
        != predictions_arff["attributes"][correct_idx][1]
    ):
        pred = predictions_arff["attributes"][predicted_idx][1]
        corr = predictions_arff["attributes"][correct_idx][1]
        raise ValueError(
            "Predicted and Correct do not have equal values:" f" {pred!s} Vs. {corr!s}",
        )

    # TODO: these could be cached
    values_predict: dict[int, dict[int, dict[int, list[float]]]] = {}
    values_correct: dict[int, dict[int, dict[int, list[float]]]] = {}
    for _line_idx, line in enumerate(predictions_arff["data"]):
        rep = line[repeat_idx]
        fold = line[fold_idx]
        samp = line[sample_idx] if has_samples else 0

        if task.task_type_id in [
            TaskType.SUPERVISED_CLASSIFICATION,
            TaskType.LEARNING_CURVE,
        ]:
            prediction = predictions_arff["attributes"][predicted_idx][1].index(
                line[predicted_idx],
            )
            correct = predictions_arff["attributes"][predicted_idx][1].index(line[correct_idx])
        elif task.task_type_id == TaskType.SUPERVISED_REGRESSION:
            prediction = line[predicted_idx]
            correct = line[correct_idx]
        if rep not in values_predict:
            values_predict[rep] = OrderedDict()
            values_correct[rep] = OrderedDict()
        if fold not in values_predict[rep]:
            values_predict[rep][fold] = OrderedDict()
            values_correct[rep][fold] = OrderedDict()
        if samp not in values_predict[rep][fold]:
            values_predict[rep][fold][samp] = []
            values_correct[rep][fold][samp] = []

        values_predict[rep][fold][samp].append(prediction)
        values_correct[rep][fold][samp].append(correct)

    scores = []
    for rep in values_predict:
        for fold in values_predict[rep]:
            last_sample = len(values_predict[rep][fold]) - 1
            y_pred = values_predict[rep][fold][last_sample]
            y_true = values_correct[rep][fold][last_sample]
            scores.append(sklearn_fn(y_true, y_pred, **kwargs))
    return np.array(scores)

open_in_browser #

open_in_browser() -> None

Opens the OpenML web page corresponding to this object in your default browser.

Source code in openml/base.py

def open_in_browser(self) -> None:
    """Opens the OpenML web page corresponding to this object in your default browser."""
    if self.openml_url is None:
        raise ValueError(
            "Cannot open element on OpenML.org when attribute `openml_url` is `None`",
        )

    webbrowser.open(self.openml_url)

publish #

publish() -> OpenMLBase

Publish the object on the OpenML server.

Source code in openml/base.py

def publish(self) -> OpenMLBase:
    """Publish the object on the OpenML server."""
    file_elements = self._get_file_elements()

    if "description" not in file_elements:
        file_elements["description"] = self._to_xml()

    call = f"{_get_rest_api_type_alias(self)}/"
    response_text = openml._api_calls._perform_api_call(
        call,
        "post",
        file_elements=file_elements,
    )
    xml_response = xmltodict.parse(response_text)

    self._parse_publish_response(xml_response)
    return self

push_tag #

push_tag(tag: str) -> None

Annotates this entity with a tag on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def push_tag(self, tag: str) -> None:
    """Annotates this entity with a tag on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag)

remove_tag #

remove_tag(tag: str) -> None

Removes a tag from this entity on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def remove_tag(self, tag: str) -> None:
    """Removes a tag from this entity on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag, untag=True)

to_filesystem #

to_filesystem(directory: str | Path, store_model: bool = True) -> None

The inverse of the from_filesystem method. Serializes a run on the filesystem, to be uploaded later.

Parameters#

directory : str a path leading to the folder where the results will be stored. Should be empty

bool, optional (default=True)

if True, a model will be pickled as well. As this is the most storage expensive part, it is often desirable to not store the model.

Source code in openml/runs/run.py

def to_filesystem(
    self,
    directory: str | Path,
    store_model: bool = True,  # noqa: FBT001, FBT002
) -> None:
    """
    The inverse of the from_filesystem method. Serializes a run
    on the filesystem, to be uploaded later.

    Parameters
    ----------
    directory : str
        a path leading to the folder where the results
        will be stored. Should be empty

    store_model : bool, optional (default=True)
        if True, a model will be pickled as well. As this is the most
        storage expensive part, it is often desirable to not store the
        model.
    """
    if self.data_content is None or self.model is None:
        raise ValueError("Run should have been executed (and contain " "model / predictions)")
    directory = Path(directory)
    directory.mkdir(exist_ok=True, parents=True)

    if any(directory.iterdir()):
        raise ValueError(f"Output directory {directory.expanduser().resolve()} should be empty")

    run_xml = self._to_xml()
    predictions_arff = arff.dumps(self._generate_arff_dict())

    # It seems like typing does not allow to define the same variable multiple times
    with (directory / "description.xml").open("w") as fh:
        fh.write(run_xml)
    with (directory / "predictions.arff").open("w") as fh:
        fh.write(predictions_arff)
    if store_model:
        with (directory / "model.pkl").open("wb") as fh_b:
            pickle.dump(self.model, fh_b)

    if self.flow_id is None and self.flow is not None:
        self.flow.to_filesystem(directory)

    if self.trace is not None:
        self.trace._to_filesystem(directory)

url_for_id `classmethod` #

url_for_id(id_: int) -> str

Return the OpenML URL for the object of the class entity with the given id.

Source code in openml/base.py

@classmethod
def url_for_id(cls, id_: int) -> str:
    """Return the OpenML URL for the object of the class entity with the given id."""
    # Sample url for a flow: openml.org/f/123
    return f"{openml.config.get_server_base_url()}/{cls._entity_letter()}/{id_}"

OpenMLSetup #

OpenMLSetup(setup_id: int, flow_id: int, parameters: dict[int, Any] | None)

Setup object (a.k.a. Configuration).

Parameters#

setup_id : int The OpenML setup id flow_id : int The flow that it is build upon parameters : dict The setting of the parameters

Source code in openml/setups/setup.py

def __init__(self, setup_id: int, flow_id: int, parameters: dict[int, Any] | None):
    if not isinstance(setup_id, int):
        raise ValueError("setup id should be int")

    if not isinstance(flow_id, int):
        raise ValueError("flow id should be int")

    if parameters is not None and not isinstance(parameters, dict):
        raise ValueError("parameters should be dict")

    self.setup_id = setup_id
    self.flow_id = flow_id
    self.parameters = parameters

OpenMLSplit #

OpenMLSplit(name: int | str, description: str, split: dict[int, dict[int, dict[int, tuple[ndarray, ndarray]]]])

OpenML Split object.

This class manages train-test splits for a dataset across multiple repetitions, folds, and samples.

Parameters#

name : int or str The name or ID of the split. description : str A description of the split. split : dict A dictionary containing the splits organized by repetition, fold, and sample.

Source code in openml/tasks/split.py

def __init__(
    self,
    name: int | str,
    description: str,
    split: dict[int, dict[int, dict[int, tuple[np.ndarray, np.ndarray]]]],
):
    self.description = description
    self.name = name
    self.split: dict[int, dict[int, dict[int, tuple[np.ndarray, np.ndarray]]]] = {}

    # Add splits according to repetition
    for repetition in split:
        _rep = int(repetition)
        self.split[_rep] = OrderedDict()
        for fold in split[_rep]:
            self.split[_rep][fold] = OrderedDict()
            for sample in split[_rep][fold]:
                self.split[_rep][fold][sample] = split[_rep][fold][sample]

    self.repeats = len(self.split)

    # TODO(eddiebergman): Better error message
    if any(len(self.split[0]) != len(self.split[i]) for i in range(self.repeats)):
        raise ValueError("")

    self.folds = len(self.split[0])
    self.samples = len(self.split[0][0])

get #

get(repeat: int = 0, fold: int = 0, sample: int = 0) -> tuple[ndarray, ndarray]

Returns the specified data split from the CrossValidationSplit object.

Parameters#

repeat : int Index of the repeat to retrieve. fold : int Index of the fold to retrieve. sample : int Index of the sample to retrieve.

Returns#

numpy.ndarray The data split for the specified repeat, fold, and sample.

Raises#

ValueError If the specified repeat, fold, or sample is not known.

Source code in openml/tasks/split.py

def get(self, repeat: int = 0, fold: int = 0, sample: int = 0) -> tuple[np.ndarray, np.ndarray]:
    """Returns the specified data split from the CrossValidationSplit object.

    Parameters
    ----------
    repeat : int
        Index of the repeat to retrieve.
    fold : int
        Index of the fold to retrieve.
    sample : int
        Index of the sample to retrieve.

    Returns
    -------
    numpy.ndarray
        The data split for the specified repeat, fold, and sample.

    Raises
    ------
    ValueError
        If the specified repeat, fold, or sample is not known.
    """
    if repeat not in self.split:
        raise ValueError(f"Repeat {repeat!s} not known")
    if fold not in self.split[repeat]:
        raise ValueError(f"Fold {fold!s} not known")
    if sample not in self.split[repeat][fold]:
        raise ValueError(f"Sample {sample!s} not known")
    return self.split[repeat][fold][sample]

OpenMLStudy #

OpenMLStudy(study_id: int | None, alias: str | None, benchmark_suite: int | None, name: str, description: str, status: str | None, creation_date: str | None, creator: int | None, tags: list[dict] | None, data: list[int] | None, tasks: list[int] | None, flows: list[int] | None, runs: list[int] | None, setups: list[int] | None)

Bases: BaseStudy

An OpenMLStudy represents the OpenML concept of a study (a collection of runs).

It contains the following information: name, id, description, creation date, creator id and a list of run ids.

According to this list of run ids, the study object receives a list of OpenML object ids (datasets, flows, tasks and setups).

Parameters#

study_id : int the study id alias : str (optional) a string ID, unique on server (url-friendly) benchmark_suite : int (optional) the benchmark suite (another study) upon which this study is ran. can only be active if main entity type is runs. name : str the name of the study (meta-info) description : str brief description (meta-info) status : str Whether the study is in preparation, active or deactivated creation_date : str date of creation (meta-info) creator : int openml user id of the owner / creator tags : list(dict) The list of tags shows which tags are associated with the study. Each tag is a dict of (tag) name, window_start and write_access. data : list a list of data ids associated with this study tasks : list a list of task ids associated with this study flows : list a list of flow ids associated with this study runs : list a list of run ids associated with this study setups : list a list of setup ids associated with this study

Source code in openml/study/study.py

def __init__(  # noqa: PLR0913
    self,
    study_id: int | None,
    alias: str | None,
    benchmark_suite: int | None,
    name: str,
    description: str,
    status: str | None,
    creation_date: str | None,
    creator: int | None,
    tags: list[dict] | None,
    data: list[int] | None,
    tasks: list[int] | None,
    flows: list[int] | None,
    runs: list[int] | None,
    setups: list[int] | None,
):
    super().__init__(
        study_id=study_id,
        alias=alias,
        main_entity_type="run",
        benchmark_suite=benchmark_suite,
        name=name,
        description=description,
        status=status,
        creation_date=creation_date,
        creator=creator,
        tags=tags,
        data=data,
        tasks=tasks,
        flows=flows,
        runs=runs,
        setups=setups,
    )

id `property` #

id: int | None

Return the id of the study.

openml_url `property` #

openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

open_in_browser #

open_in_browser() -> None

Opens the OpenML web page corresponding to this object in your default browser.

Source code in openml/base.py

def open_in_browser(self) -> None:
    """Opens the OpenML web page corresponding to this object in your default browser."""
    if self.openml_url is None:
        raise ValueError(
            "Cannot open element on OpenML.org when attribute `openml_url` is `None`",
        )

    webbrowser.open(self.openml_url)

publish #

publish() -> OpenMLBase

Publish the object on the OpenML server.

Source code in openml/base.py

def publish(self) -> OpenMLBase:
    """Publish the object on the OpenML server."""
    file_elements = self._get_file_elements()

    if "description" not in file_elements:
        file_elements["description"] = self._to_xml()

    call = f"{_get_rest_api_type_alias(self)}/"
    response_text = openml._api_calls._perform_api_call(
        call,
        "post",
        file_elements=file_elements,
    )
    xml_response = xmltodict.parse(response_text)

    self._parse_publish_response(xml_response)
    return self

push_tag #

push_tag(tag: str) -> None

Add a tag to the study.

Source code in openml/study/study.py

def push_tag(self, tag: str) -> None:
    """Add a tag to the study."""
    raise NotImplementedError("Tags for studies is not (yet) supported.")

remove_tag #

remove_tag(tag: str) -> None

Remove a tag from the study.

Source code in openml/study/study.py

def remove_tag(self, tag: str) -> None:
    """Remove a tag from the study."""
    raise NotImplementedError("Tags for studies is not (yet) supported.")

url_for_id `classmethod` #

url_for_id(id_: int) -> str

Return the OpenML URL for the object of the class entity with the given id.

Source code in openml/base.py

@classmethod
def url_for_id(cls, id_: int) -> str:
    """Return the OpenML URL for the object of the class entity with the given id."""
    # Sample url for a flow: openml.org/f/123
    return f"{openml.config.get_server_base_url()}/{cls._entity_letter()}/{id_}"

OpenMLSupervisedTask #

OpenMLSupervisedTask(task_type_id: TaskType, task_type: str, data_set_id: int, target_name: str, estimation_procedure_id: int = 1, estimation_procedure_type: str | None = None, estimation_parameters: dict[str, str] | None = None, evaluation_measure: str | None = None, data_splits_url: str | None = None, task_id: int | None = None)

Bases: OpenMLTask, ABC

OpenML Supervised Classification object.

Parameters#

task_type_id : TaskType ID of the task type. task_type : str Name of the task type. data_set_id : int ID of the OpenML dataset associated with the task. target_name : str Name of the target feature (the class variable). estimation_procedure_id : int, default=None ID of the estimation procedure for the task. estimation_procedure_type : str, default=None Type of the estimation procedure for the task. estimation_parameters : dict, default=None Estimation parameters for the task. evaluation_measure : str, default=None Name of the evaluation measure for the task. data_splits_url : str, default=None URL of the data splits for the task. task_id: Union[int, None] Refers to the unique identifier of task.

Source code in openml/tasks/task.py

def __init__(  # noqa: PLR0913
    self,
    task_type_id: TaskType,
    task_type: str,
    data_set_id: int,
    target_name: str,
    estimation_procedure_id: int = 1,
    estimation_procedure_type: str | None = None,
    estimation_parameters: dict[str, str] | None = None,
    evaluation_measure: str | None = None,
    data_splits_url: str | None = None,
    task_id: int | None = None,
):
    super().__init__(
        task_id=task_id,
        task_type_id=task_type_id,
        task_type=task_type,
        data_set_id=data_set_id,
        estimation_procedure_id=estimation_procedure_id,
        estimation_procedure_type=estimation_procedure_type,
        estimation_parameters=estimation_parameters,
        evaluation_measure=evaluation_measure,
        data_splits_url=data_splits_url,
    )

    self.target_name = target_name

estimation_parameters `property` `writable` #

estimation_parameters: dict[str, str] | None

Return the estimation parameters for the task.

id `property` #

id: int | None

Return the OpenML ID of this task.

openml_url `property` #

openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

download_split #

download_split() -> OpenMLSplit

Download the OpenML split for a given task.

Source code in openml/tasks/task.py

def download_split(self) -> OpenMLSplit:
    """Download the OpenML split for a given task."""
    # TODO(eddiebergman): Can this every be `None`?
    assert self.task_id is not None
    cache_dir = _create_cache_directory_for_id("tasks", self.task_id)
    cached_split_file = cache_dir / "datasplits.arff"

    try:
        split = OpenMLSplit._from_arff_file(cached_split_file)
    except OSError:
        # Next, download and cache the associated split file
        self._download_split(cached_split_file)
        split = OpenMLSplit._from_arff_file(cached_split_file)

    return split

get_X_and_y #

get_X_and_y() -> tuple[DataFrame, Series | DataFrame | None]

Get data associated with the current task.

Returns#

tuple - X and y

Source code in openml/tasks/task.py

def get_X_and_y(self) -> tuple[pd.DataFrame, pd.Series | pd.DataFrame | None]:
    """Get data associated with the current task.

    Returns
    -------
    tuple - X and y

    """
    dataset = self.get_dataset()
    if self.task_type_id not in (
        TaskType.SUPERVISED_CLASSIFICATION,
        TaskType.SUPERVISED_REGRESSION,
        TaskType.LEARNING_CURVE,
    ):
        raise NotImplementedError(self.task_type)

    X, y, _, _ = dataset.get_data(target=self.target_name)
    return X, y

get_dataset #

get_dataset(**kwargs: Any) -> OpenMLDataset

Download dataset associated with task.

Accepts the same keyword arguments as the openml.datasets.get_dataset.

Source code in openml/tasks/task.py

def get_dataset(self, **kwargs: Any) -> datasets.OpenMLDataset:
    """Download dataset associated with task.

    Accepts the same keyword arguments as the `openml.datasets.get_dataset`.
    """
    return datasets.get_dataset(self.dataset_id, **kwargs)

get_split_dimensions #

get_split_dimensions() -> tuple[int, int, int]

Get the (repeats, folds, samples) of the split for a given task.

Source code in openml/tasks/task.py

def get_split_dimensions(self) -> tuple[int, int, int]:
    """Get the (repeats, folds, samples) of the split for a given task."""
    if self.split is None:
        self.split = self.download_split()

    return self.split.repeats, self.split.folds, self.split.samples

get_train_test_split_indices #

get_train_test_split_indices(fold: int = 0, repeat: int = 0, sample: int = 0) -> tuple[ndarray, ndarray]

Get the indices of the train and test splits for a given task.

Source code in openml/tasks/task.py

def get_train_test_split_indices(
    self,
    fold: int = 0,
    repeat: int = 0,
    sample: int = 0,
) -> tuple[np.ndarray, np.ndarray]:
    """Get the indices of the train and test splits for a given task."""
    # Replace with retrieve from cache
    if self.split is None:
        self.split = self.download_split()

    return self.split.get(repeat=repeat, fold=fold, sample=sample)

open_in_browser #

open_in_browser() -> None

Opens the OpenML web page corresponding to this object in your default browser.

Source code in openml/base.py

def open_in_browser(self) -> None:
    """Opens the OpenML web page corresponding to this object in your default browser."""
    if self.openml_url is None:
        raise ValueError(
            "Cannot open element on OpenML.org when attribute `openml_url` is `None`",
        )

    webbrowser.open(self.openml_url)

publish #

publish() -> OpenMLBase

Publish the object on the OpenML server.

Source code in openml/base.py

def publish(self) -> OpenMLBase:
    """Publish the object on the OpenML server."""
    file_elements = self._get_file_elements()

    if "description" not in file_elements:
        file_elements["description"] = self._to_xml()

    call = f"{_get_rest_api_type_alias(self)}/"
    response_text = openml._api_calls._perform_api_call(
        call,
        "post",
        file_elements=file_elements,
    )
    xml_response = xmltodict.parse(response_text)

    self._parse_publish_response(xml_response)
    return self

push_tag #

push_tag(tag: str) -> None

Annotates this entity with a tag on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def push_tag(self, tag: str) -> None:
    """Annotates this entity with a tag on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag)

remove_tag #

remove_tag(tag: str) -> None

Removes a tag from this entity on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def remove_tag(self, tag: str) -> None:
    """Removes a tag from this entity on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag, untag=True)

url_for_id `classmethod` #

url_for_id(id_: int) -> str

Return the OpenML URL for the object of the class entity with the given id.

Source code in openml/base.py

@classmethod
def url_for_id(cls, id_: int) -> str:
    """Return the OpenML URL for the object of the class entity with the given id."""
    # Sample url for a flow: openml.org/f/123
    return f"{openml.config.get_server_base_url()}/{cls._entity_letter()}/{id_}"

OpenMLTask #

OpenMLTask(task_id: int | None, task_type_id: TaskType, task_type: str, data_set_id: int, estimation_procedure_id: int = 1, estimation_procedure_type: str | None = None, estimation_parameters: dict[str, str] | None = None, evaluation_measure: str | None = None, data_splits_url: str | None = None)

Bases: OpenMLBase

OpenML Task object.

Parameters#

task_id: Union[int, None] Refers to the unique identifier of OpenML task. task_type_id: TaskType Refers to the type of OpenML task. task_type: str Refers to the OpenML task. data_set_id: int Refers to the data. estimation_procedure_id: int Refers to the type of estimates used. estimation_procedure_type: str, default=None Refers to the type of estimation procedure used for the OpenML task. estimation_parameters: [Dict[str, str]], default=None Estimation parameters used for the OpenML task. evaluation_measure: str, default=None Refers to the evaluation measure. data_splits_url: str, default=None Refers to the URL of the data splits used for the OpenML task.

Source code in openml/tasks/task.py

def __init__(  # noqa: PLR0913
    self,
    task_id: int | None,
    task_type_id: TaskType,
    task_type: str,
    data_set_id: int,
    estimation_procedure_id: int = 1,
    estimation_procedure_type: str | None = None,
    estimation_parameters: dict[str, str] | None = None,
    evaluation_measure: str | None = None,
    data_splits_url: str | None = None,
):
    self.task_id = int(task_id) if task_id is not None else None
    self.task_type_id = task_type_id
    self.task_type = task_type
    self.dataset_id = int(data_set_id)
    self.evaluation_measure = evaluation_measure
    self.estimation_procedure: _EstimationProcedure = {
        "type": estimation_procedure_type,
        "parameters": estimation_parameters,
        "data_splits_url": data_splits_url,
    }
    self.estimation_procedure_id = estimation_procedure_id
    self.split: OpenMLSplit | None = None

id `property` #

id: int | None

Return the OpenML ID of this task.

openml_url `property` #

openml_url: str | None

The URL of the object on the server, if it was uploaded, else None.

download_split #

download_split() -> OpenMLSplit

Download the OpenML split for a given task.

Source code in openml/tasks/task.py

def download_split(self) -> OpenMLSplit:
    """Download the OpenML split for a given task."""
    # TODO(eddiebergman): Can this every be `None`?
    assert self.task_id is not None
    cache_dir = _create_cache_directory_for_id("tasks", self.task_id)
    cached_split_file = cache_dir / "datasplits.arff"

    try:
        split = OpenMLSplit._from_arff_file(cached_split_file)
    except OSError:
        # Next, download and cache the associated split file
        self._download_split(cached_split_file)
        split = OpenMLSplit._from_arff_file(cached_split_file)

    return split

get_dataset #

get_dataset(**kwargs: Any) -> OpenMLDataset

Download dataset associated with task.

Accepts the same keyword arguments as the openml.datasets.get_dataset.

Source code in openml/tasks/task.py

def get_dataset(self, **kwargs: Any) -> datasets.OpenMLDataset:
    """Download dataset associated with task.

    Accepts the same keyword arguments as the `openml.datasets.get_dataset`.
    """
    return datasets.get_dataset(self.dataset_id, **kwargs)

get_split_dimensions #

get_split_dimensions() -> tuple[int, int, int]

Get the (repeats, folds, samples) of the split for a given task.

Source code in openml/tasks/task.py

def get_split_dimensions(self) -> tuple[int, int, int]:
    """Get the (repeats, folds, samples) of the split for a given task."""
    if self.split is None:
        self.split = self.download_split()

    return self.split.repeats, self.split.folds, self.split.samples

get_train_test_split_indices #

get_train_test_split_indices(fold: int = 0, repeat: int = 0, sample: int = 0) -> tuple[ndarray, ndarray]

Get the indices of the train and test splits for a given task.

Source code in openml/tasks/task.py

def get_train_test_split_indices(
    self,
    fold: int = 0,
    repeat: int = 0,
    sample: int = 0,
) -> tuple[np.ndarray, np.ndarray]:
    """Get the indices of the train and test splits for a given task."""
    # Replace with retrieve from cache
    if self.split is None:
        self.split = self.download_split()

    return self.split.get(repeat=repeat, fold=fold, sample=sample)

open_in_browser #

open_in_browser() -> None

Opens the OpenML web page corresponding to this object in your default browser.

Source code in openml/base.py

def open_in_browser(self) -> None:
    """Opens the OpenML web page corresponding to this object in your default browser."""
    if self.openml_url is None:
        raise ValueError(
            "Cannot open element on OpenML.org when attribute `openml_url` is `None`",
        )

    webbrowser.open(self.openml_url)

publish #

publish() -> OpenMLBase

Publish the object on the OpenML server.

Source code in openml/base.py

def publish(self) -> OpenMLBase:
    """Publish the object on the OpenML server."""
    file_elements = self._get_file_elements()

    if "description" not in file_elements:
        file_elements["description"] = self._to_xml()

    call = f"{_get_rest_api_type_alias(self)}/"
    response_text = openml._api_calls._perform_api_call(
        call,
        "post",
        file_elements=file_elements,
    )
    xml_response = xmltodict.parse(response_text)

    self._parse_publish_response(xml_response)
    return self

push_tag #

push_tag(tag: str) -> None

Annotates this entity with a tag on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def push_tag(self, tag: str) -> None:
    """Annotates this entity with a tag on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag)

remove_tag #

remove_tag(tag: str) -> None

Removes a tag from this entity on the server.

Parameters#

tag : str Tag to attach to the flow.

Source code in openml/base.py

def remove_tag(self, tag: str) -> None:
    """Removes a tag from this entity on the server.

    Parameters
    ----------
    tag : str
        Tag to attach to the flow.
    """
    _tag_openml_base(self, tag, untag=True)

url_for_id `classmethod` #

url_for_id(id_: int) -> str

Return the OpenML URL for the object of the class entity with the given id.

Source code in openml/base.py

@classmethod
def url_for_id(cls, id_: int) -> str:
    """Return the OpenML URL for the object of the class entity with the given id."""
    # Sample url for a flow: openml.org/f/123
    return f"{openml.config.get_server_base_url()}/{cls._entity_letter()}/{id_}"

populate_cache #

populate_cache(task_ids: list[int] | None = None, dataset_ids: list[int | str] | None = None, flow_ids: list[int] | None = None, run_ids: list[int] | None = None) -> None

Populate a cache for offline and parallel usage of the OpenML connector.

Parameters#

task_ids : iterable

dataset_ids : iterable

flow_ids : iterable

run_ids : iterable

Returns#

None

Source code in openml/__init__.py

def populate_cache(
    task_ids: list[int] | None = None,
    dataset_ids: list[int | str] | None = None,
    flow_ids: list[int] | None = None,
    run_ids: list[int] | None = None,
) -> None:
    """
    Populate a cache for offline and parallel usage of the OpenML connector.

    Parameters
    ----------
    task_ids : iterable

    dataset_ids : iterable

    flow_ids : iterable

    run_ids : iterable

    Returns
    -------
    None
    """
    if task_ids is not None:
        for task_id in task_ids:
            tasks.functions.get_task(task_id)

    if dataset_ids is not None:
        for dataset_id in dataset_ids:
            datasets.functions.get_dataset(dataset_id)

    if flow_ids is not None:
        for flow_id in flow_ids:
            flows.functions.get_flow(flow_id)

    if run_ids is not None:
        for run_id in run_ids:
            runs.functions.get_run(run_id)

openml

openml #

OpenMLBenchmarkSuite #

Parameters#

id property #

openml_url property #

open_in_browser #

publish #

push_tag #

remove_tag #

url_for_id classmethod #

OpenMLClassificationTask #

Parameters#

estimation_parameters property writable #

id property #

openml_url property #

download_split #

get_X_and_y #

Returns#

get_dataset #

get_split_dimensions #

get_train_test_split_indices #

open_in_browser #

publish #

push_tag #

Parameters#

remove_tag #

Parameters#

url_for_id classmethod #

OpenMLClusteringTask #

Parameters#

id property #

openml_url property #

download_split #

get_X #

Returns#

get_dataset #

get_split_dimensions #

get_train_test_split_indices #

open_in_browser #

publish #

push_tag #

Parameters#

remove_tag #

Parameters#

url_for_id classmethod #

OpenMLDataFeature #

Parameters#

OpenMLDataset #

Parameters#

features property #

id property #

openml_url property #

qualities property #

get_data #

Parameters#

Returns#

get_features_by_type #

Parameters#

Returns#

open_in_browser #

publish #

push_tag #

Parameters#

remove_tag #

Parameters#

retrieve_class_labels #

Parameters#

Returns#

url_for_id classmethod #

OpenMLEvaluation #

Parameters#

OpenMLFlow #

Parameters#

extension property #

id property #

openml_url property #

from_filesystem classmethod #

get_structure #

Parameters#

id `property` #

openml_url `property` #

url_for_id `classmethod` #

estimation_parameters `property` `writable` #

id `property` #

openml_url `property` #

url_for_id `classmethod` #

id `property` #

openml_url `property` #

url_for_id `classmethod` #

features `property` #

id `property` #

openml_url `property` #

qualities `property` #

url_for_id `classmethod` #

extension `property` #

id `property` #

openml_url `property` #

from_filesystem `classmethod` #

url_for_id `classmethod` #

estimation_parameters `property` `writable` #

id `property` #

openml_url `property` #

url_for_id `classmethod` #

estimation_parameters `property` `writable` #

id `property` #

openml_url `property` #

url_for_id `classmethod` #

id `property` #

openml_url `property` #

predictions `property` #

from_filesystem `classmethod` #