dataset
openml.datasets.dataset
#
OpenMLDataset
#
OpenMLDataset(name: str, description: str | None, data_format: Literal['arff', 'sparse_arff'] = 'arff', cache_format: Literal['feather', 'pickle'] = 'pickle', dataset_id: int | None = None, version: int | None = None, creator: str | None = None, contributor: str | None = None, collection_date: str | None = None, upload_date: str | None = None, language: str | None = None, licence: str | None = None, url: str | None = None, default_target_attribute: str | None = None, row_id_attribute: str | None = None, ignore_attribute: str | list[str] | None = None, version_label: str | None = None, citation: str | None = None, tag: str | None = None, visibility: str | None = None, original_data_url: str | None = None, paper_url: str | None = None, update_comment: str | None = None, md5_checksum: str | None = None, data_file: str | None = None, features_file: str | None = None, qualities_file: str | None = None, dataset: str | None = None, parquet_url: str | None = None, parquet_file: str | None = None)
Bases: OpenMLBase
Dataset object.
Allows fetching and uploading datasets to OpenML.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
Name of the dataset.
TYPE:
|
description
|
Description of the dataset.
TYPE:
|
data_format
|
Format of the dataset which can be either 'arff' or 'sparse_arff'.
TYPE:
|
cache_format
|
Format for caching the dataset which can be either 'feather' or 'pickle'.
TYPE:
|
dataset_id
|
Id autogenerated by the server.
TYPE:
|
version
|
Version of this dataset. '1' for original version. Auto-incremented by server.
TYPE:
|
creator
|
The person who created the dataset.
TYPE:
|
contributor
|
People who contributed to the current version of the dataset.
TYPE:
|
collection_date
|
The date the data was originally collected, given by the uploader.
TYPE:
|
upload_date
|
The date-time when the dataset was uploaded, generated by server.
TYPE:
|
language
|
Language in which the data is represented. Starts with 1 upper case letter, rest lower case, e.g. 'English'.
TYPE:
|
licence
|
License of the data.
TYPE:
|
url
|
Valid URL, points to actual data file. The file can be on the OpenML server or another dataset repository.
TYPE:
|
default_target_attribute
|
The default target attribute, if it exists. Can have multiple values, comma separated.
TYPE:
|
row_id_attribute
|
The attribute that represents the row-id column, if present in the dataset.
TYPE:
|
ignore_attribute
|
Attributes that should be excluded in modelling, such as identifiers and indexes.
TYPE:
|
version_label
|
Version label provided by user. Can be a date, hash, or some other type of id.
TYPE:
|
citation
|
Reference(s) that should be cited when building on this data.
TYPE:
|
tag
|
Tags, describing the algorithms.
TYPE:
|
visibility
|
Who can see the dataset. Typical values: 'Everyone','All my friends','Only me'. Can also be any of the user's circles.
TYPE:
|
original_data_url
|
For derived data, the url to the original dataset.
TYPE:
|
paper_url
|
Link to a paper describing the dataset.
TYPE:
|
update_comment
|
An explanation for when the dataset is uploaded.
TYPE:
|
md5_checksum
|
MD5 checksum to check if the dataset is downloaded without corruption.
TYPE:
|
data_file
|
Path to where the dataset is located.
TYPE:
|
features_file
|
A dictionary of dataset features, which maps a feature index to a OpenMLDataFeature.
TYPE:
|
qualities_file
|
A dictionary of dataset qualities, which maps a quality name to a quality value.
TYPE:
|
dataset
|
Serialized arff dataset string.
TYPE:
|
parquet_url
|
This is the URL to the storage location where the dataset files are hosted. This can be a MinIO bucket URL. If specified, the data will be accessed from this URL when reading the files.
TYPE:
|
parquet_file
|
Path to the local file.
TYPE:
|
Source code in openml/datasets/dataset.py
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 | |
openml_url
property
#
The URL of the object on the server, if it was uploaded, else None.
get_data
#
get_data(target: list[str] | str | None = None, include_row_id: bool = False, include_ignore_attribute: bool = False) -> tuple[DataFrame, Series | None, list[bool], list[str]]
Returns dataset content as dataframes.
| PARAMETER | DESCRIPTION |
|---|---|
target
|
Name of target column to separate from the data. Splitting multiple columns is currently not supported.
TYPE:
|
include_row_id
|
Whether to include row ids in the returned dataset.
TYPE:
|
include_ignore_attribute
|
Whether to include columns that are marked as "ignore" on the server in the dataset.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
X
|
Dataset, may have sparse dtypes in the columns if required.
TYPE:
|
y
|
Target column
TYPE:
|
categorical_indicator
|
Mask that indicate categorical features.
TYPE:
|
attribute_names
|
List of attribute names.
TYPE:
|
Source code in openml/datasets/dataset.py
723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 | |
get_features_by_type
#
get_features_by_type(data_type: str, exclude: list[str] | None = None, exclude_ignore_attribute: bool = True, exclude_row_id_attribute: bool = True) -> list[int]
Return indices of features of a given type, e.g. all nominal features. Optional parameters to exclude various features by index or ontology.
| PARAMETER | DESCRIPTION |
|---|---|
data_type
|
The data type to return (e.g., nominal, numeric, date, string)
TYPE:
|
exclude
|
List of columns to exclude from the return value
TYPE:
|
exclude_ignore_attribute
|
Whether to exclude the defined ignore attributes (and adapt the return values as if these indices are not present)
TYPE:
|
exclude_row_id_attribute
|
Whether to exclude the defined row id attributes (and adapt the return values as if these indices are not present)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
result
|
a list of indices that have the specified data type
TYPE:
|
Source code in openml/datasets/dataset.py
open_in_browser
#
Opens the OpenML web page corresponding to this object in your default browser.
Source code in openml/base.py
publish
#
publish() -> OpenMLBase
Publish the object on the OpenML server.
Source code in openml/base.py
push_tag
#
Annotates this entity with a tag on the server.
| PARAMETER | DESCRIPTION |
|---|---|
tag
|
Tag to attach to the flow.
TYPE:
|
remove_tag
#
Removes a tag from this entity on the server.
| PARAMETER | DESCRIPTION |
|---|---|
tag
|
Tag to attach to the flow.
TYPE:
|
retrieve_class_labels
#
Reads the datasets arff to determine the class-labels.
If the task has no class labels (for example a regression problem) it returns None. Necessary because the data returned by get_data only contains the indices of the classes, while OpenML needs the real classname when uploading the results of a run.
| PARAMETER | DESCRIPTION |
|---|---|
target_name
|
Name of the target attribute
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list
|
|
Source code in openml/datasets/dataset.py
url_for_id
classmethod
#
Return the OpenML URL for the object of the class entity with the given id.