functions
openml.datasets.functions
#
attributes_arff_from_df
#
Describe attributes of the dataframe according to ARFF specification.
Parameters#
df : DataFrame, shape (n_samples, n_features) The dataframe containing the data set.
Returns#
attributes_arff : list[str] The data set attributes as required by the ARFF format.
Source code in openml/datasets/functions.py
check_datasets_active
#
check_datasets_active(dataset_ids: list[int], raise_error_if_not_exist: bool = True) -> dict[int, bool]
Check if the dataset ids provided are active.
Raises an error if a dataset_id in the given list
of dataset_ids does not exist on the server and
raise_error_if_not_exist
is set to True (default).
Parameters#
dataset_ids : List[int] A list of integers representing dataset ids. raise_error_if_not_exist : bool (default=True) Flag that if activated can raise an error, if one or more of the given dataset ids do not exist on the server.
Returns#
dict A dictionary with items {did: bool}
Source code in openml/datasets/functions.py
create_dataset
#
create_dataset(name: str, description: str | None, creator: str | None, contributor: str | None, collection_date: str | None, language: str | None, licence: str | None, attributes: list[tuple[str, str | list[str]]] | dict[str, str | list[str]] | Literal['auto'], data: DataFrame | ndarray | coo_matrix, default_target_attribute: str, ignore_attribute: str | list[str] | None, citation: str, row_id_attribute: str | None = None, original_data_url: str | None = None, paper_url: str | None = None, update_comment: str | None = None, version_label: str | None = None) -> OpenMLDataset
Create a dataset.
This function creates an OpenMLDataset object. The OpenMLDataset object contains information related to the dataset and the actual data file.
Parameters#
name : str
Name of the dataset.
description : str
Description of the dataset.
creator : str
The person who created the dataset.
contributor : str
People who contributed to the current version of the dataset.
collection_date : str
The date the data was originally collected, given by the uploader.
language : str
Language in which the data is represented.
Starts with 1 upper case letter, rest lower case, e.g. 'English'.
licence : str
License of the data.
attributes : list, dict, or 'auto'
A list of tuples. Each tuple consists of the attribute name and type.
If passing a pandas DataFrame, the attributes can be automatically
inferred by passing 'auto'
. Specific attributes can be manually
specified by a passing a dictionary where the key is the name of the
attribute and the value is the data type of the attribute.
data : ndarray, list, dataframe, coo_matrix, shape (n_samples, n_features)
An array that contains both the attributes and the targets. When
providing a dataframe, the attribute names and type can be inferred by
passing attributes='auto'
.
The target feature is indicated as meta-data of the dataset.
default_target_attribute : str
The default target attribute, if it exists.
Can have multiple values, comma separated.
ignore_attribute : str | list
Attributes that should be excluded in modelling,
such as identifiers and indexes.
Can have multiple values, comma separated.
citation : str
Reference(s) that should be cited when building on this data.
version_label : str, optional
Version label provided by user.
Can be a date, hash, or some other type of id.
row_id_attribute : str, optional
The attribute that represents the row-id column, if present in the
dataset. If data
is a dataframe and row_id_attribute
is not
specified, the index of the dataframe will be used as the
row_id_attribute
. If the name of the index is None
, it will
be discarded.
.. versionadded: 0.8
Inference of ``row_id_attribute`` from a dataframe.
original_data_url : str, optional For derived data, the url to the original dataset. paper_url : str, optional Link to a paper describing the dataset. update_comment : str, optional An explanation for when the dataset is uploaded.
Returns#
class:openml.OpenMLDataset
Dataset description.
Source code in openml/datasets/functions.py
590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 |
|
data_feature_add_ontology
#
An ontology describes the concept that are described in a feature. An ontology is defined by an URL where the information is provided. Adds an ontology (URL) to a given dataset feature (defined by a dataset id and index). The dataset has to exists on OpenML and needs to have been processed by the evaluation engine.
Parameters#
data_id : int id of the dataset to which the feature belongs index : int index of the feature in dataset (0-based) ontology : str URL to ontology (max. 256 characters)
Returns#
True or throws an OpenML server exception
Source code in openml/datasets/functions.py
data_feature_remove_ontology
#
Removes an existing ontology (URL) from a given dataset feature (defined by a dataset id and index). The dataset has to exists on OpenML and needs to have been processed by the evaluation engine. Ontology needs to be attached to the specific fearure.
Parameters#
data_id : int id of the dataset to which the feature belongs index : int index of the feature in dataset (0-based) ontology : str URL to ontology (max. 256 characters)
Returns#
True or throws an OpenML server exception
Source code in openml/datasets/functions.py
delete_dataset
#
Delete dataset with id dataset_id
from the OpenML server.
This can only be done if you are the owner of the dataset and no tasks are attached to the dataset.
Parameters#
dataset_id : int OpenML id of the dataset
Returns#
bool True if the deletion was successful. False otherwise.
Source code in openml/datasets/functions.py
edit_dataset
#
edit_dataset(data_id: int, description: str | None = None, creator: str | None = None, contributor: str | None = None, collection_date: str | None = None, language: str | None = None, default_target_attribute: str | None = None, ignore_attribute: str | list[str] | None = None, citation: str | None = None, row_id_attribute: str | None = None, original_data_url: str | None = None, paper_url: str | None = None) -> int
Edits an OpenMLDataset.
In addition to providing the dataset id of the dataset to edit (through data_id), you must specify a value for at least one of the optional function arguments, i.e. one value for a field to edit.
This function allows editing of both non-critical and critical fields. Critical fields are default_target_attribute, ignore_attribute, row_id_attribute.
- Editing non-critical data fields is allowed for all authenticated users.
- Editing critical fields is allowed only for the owner, provided there are no tasks associated with this dataset.
If dataset has tasks or if the user is not the owner, the only way to edit critical fields is to use fork_dataset followed by edit_dataset.
Parameters#
data_id : int
ID of the dataset.
description : str
Description of the dataset.
creator : str
The person who created the dataset.
contributor : str
People who contributed to the current version of the dataset.
collection_date : str
The date the data was originally collected, given by the uploader.
language : str
Language in which the data is represented.
Starts with 1 upper case letter, rest lower case, e.g. 'English'.
default_target_attribute : str
The default target attribute, if it exists.
Can have multiple values, comma separated.
ignore_attribute : str | list
Attributes that should be excluded in modelling,
such as identifiers and indexes.
citation : str
Reference(s) that should be cited when building on this data.
row_id_attribute : str, optional
The attribute that represents the row-id column, if present in the
dataset. If data
is a dataframe and row_id_attribute
is not
specified, the index of the dataframe will be used as the
row_id_attribute
. If the name of the index is None
, it will
be discarded.
.. versionadded: 0.8
Inference of ``row_id_attribute`` from a dataframe.
original_data_url : str, optional For derived data, the url to the original dataset. paper_url : str, optional Link to a paper describing the dataset.
Returns#
Dataset id
Source code in openml/datasets/functions.py
821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 |
|
fork_dataset
#
Creates a new dataset version, with the authenticated user as the new owner. The forked dataset can have distinct dataset meta-data, but the actual data itself is shared with the original version.
This API is intended for use when a user is unable to edit the critical fields of a dataset through the edit_dataset API. (Critical fields are default_target_attribute, ignore_attribute, row_id_attribute.)
Specifically, this happens when the user is: 1. Not the owner of the dataset. 2. User is the owner of the dataset, but the dataset has tasks.
In these two cases the only way to edit critical fields is: 1. STEP 1: Fork the dataset using fork_dataset API 2. STEP 2: Call edit_dataset API on the forked version.
Parameters#
data_id : int id of the dataset to be forked
Returns#
Dataset id of the forked dataset
Source code in openml/datasets/functions.py
get_dataset
#
get_dataset(dataset_id: int | str, download_data: bool = False, version: int | None = None, error_if_multiple: bool = False, cache_format: Literal['pickle', 'feather'] = 'pickle', download_qualities: bool = False, download_features_meta_data: bool = False, download_all_files: bool = False, force_refresh_cache: bool = False) -> OpenMLDataset
Download the OpenML dataset representation, optionally also download actual data file.
This function is by default NOT thread/multiprocessing safe, as this function uses caching. A check will be performed to determine if the information has previously been downloaded to a cache, and if so be loaded from disk instead of retrieved from the server.
To make this function thread safe, you can install the python package oslo.concurrency
.
If oslo.concurrency
is installed get_dataset
becomes thread safe.
Alternatively, to make this function thread/multiprocessing safe initialize the cache first by
calling get_dataset(args)
once before calling get_dataset(args)
many times in parallel.
This will initialize the cache and later calls will use the cache in a thread/multiprocessing
safe way.
If dataset is retrieved by name, a version may be specified.
If no version is specified and multiple versions of the dataset exist,
the earliest version of the dataset that is still active will be returned.
If no version is specified, multiple versions of the dataset exist and
exception_if_multiple
is set to True
, this function will raise an exception.
Parameters#
dataset_id : int or str
Dataset ID (integer) or dataset name (string) of the dataset to download.
download_data : bool (default=False)
If True, also download the data file. Beware that some datasets are large and it might
make the operation noticeably slower. Metadata is also still retrieved.
If False, create the OpenMLDataset and only populate it with the metadata.
The data may later be retrieved through the OpenMLDataset.get_data
method.
version : int, optional (default=None)
Specifies the version if dataset_id
is specified by name.
If no version is specified, retrieve the least recent still active version.
error_if_multiple : bool (default=False)
If True
raise an error if multiple datasets are found with matching criteria.
cache_format : str (default='pickle') in {'pickle', 'feather'}
Format for caching the dataset - may be feather or pickle
Note that the default 'pickle' option may load slower than feather when
no.of.rows is very high.
download_qualities : bool (default=False)
Option to download 'qualities' meta-data in addition to the minimal dataset description.
If True, download and cache the qualities file.
If False, create the OpenMLDataset without qualities metadata. The data may later be added
to the OpenMLDataset through the OpenMLDataset.load_metadata(qualities=True)
method.
download_features_meta_data : bool (default=False)
Option to download 'features' meta-data in addition to the minimal dataset description.
If True, download and cache the features file.
If False, create the OpenMLDataset without features metadata. The data may later be added
to the OpenMLDataset through the OpenMLDataset.load_metadata(features=True)
method.
download_all_files: bool (default=False)
EXPERIMENTAL. Download all files related to the dataset that reside on the server.
Useful for datasets which refer to auxiliary files (e.g., meta-album).
force_refresh_cache : bool (default=False)
Force the cache to refreshed by deleting the cache directory and re-downloading the data.
Note, if force_refresh_cache
is True, get_dataset
is NOT thread/multiprocessing safe,
because this creates a race condition to creating and deleting the cache; as in general with
the cache.
Returns#
dataset : :class:openml.OpenMLDataset
The downloaded dataset.
Source code in openml/datasets/functions.py
377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 |
|
get_datasets
#
get_datasets(dataset_ids: list[str | int], download_data: bool = False, download_qualities: bool = False) -> list[OpenMLDataset]
Download datasets.
This function iterates :meth:openml.datasets.get_dataset
.
Parameters#
dataset_ids : iterable
Integers or strings representing dataset ids or dataset names.
If dataset names are specified, the least recent still active dataset version is returned.
download_data : bool, optional
If True, also download the data file. Beware that some datasets are large and it might
make the operation noticeably slower. Metadata is also still retrieved.
If False, create the OpenMLDataset and only populate it with the metadata.
The data may later be retrieved through the OpenMLDataset.get_data
method.
download_qualities : bool, optional (default=True)
If True, also download qualities.xml file. If False it skip the qualities.xml.
Returns#
datasets : list of datasets A list of dataset objects.
Source code in openml/datasets/functions.py
list_datasets
#
list_datasets(data_id: list[int] | None = None, offset: int | None = None, size: int | None = None, status: str | None = None, tag: str | None = None, data_name: str | None = None, data_version: int | None = None, number_instances: int | str | None = None, number_features: int | str | None = None, number_classes: int | str | None = None, number_missing_values: int | str | None = None) -> DataFrame
Return a dataframe of all dataset which are on OpenML.
Supports large amount of results.
Parameters#
data_id : list, optional A list of data ids, to specify which datasets should be listed offset : int, optional The number of datasets to skip, starting from the first. size : int, optional The maximum number of datasets to show. status : str, optional Should be {active, in_preparation, deactivated}. By default active datasets are returned, but also datasets from another status can be requested. tag : str, optional data_name : str, optional data_version : int, optional number_instances : int | str, optional number_features : int | str, optional number_classes : int | str, optional number_missing_values : int | str, optional
Returns#
datasets: dataframe Each row maps to a dataset Each column contains the following information: - dataset id - name - format - status If qualities are calculated for the dataset, some of these are also included as columns.
Source code in openml/datasets/functions.py
list_qualities
#
Return list of data qualities available.
The function performs an API call to retrieve the entire list of data qualities that are computed on the datasets uploaded.
Returns#
list
Source code in openml/datasets/functions.py
status_update
#
Updates the status of a dataset to either 'active' or 'deactivated'. Please see the OpenML API documentation for a description of the status and all legal status transitions: docs.openml.org/concepts/data/#dataset-status
Parameters#
data_id : int The data id of the dataset status : str, 'active' or 'deactivated'