Flows and Runs

How to train/run a model and how to upload the results.

# License: BSD 3-Clause

import openml
from sklearn import compose, ensemble, impute, neighbors, preprocessing, pipeline, tree

Train machine learning models

Train a scikit-learn model on the data manually.

Warning

This example uploads data. For that reason, this example connects to the test server at test.openml.org. This prevents the main server from crowding with example datasets, tasks, runs, and so on.

openml.config.start_using_configuration_for_example()
# NOTE: We are using dataset 68 from the test server: https://test.openml.org/d/68
dataset = openml.datasets.get_dataset(68)
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format='array',
    target=dataset.default_target_attribute
)
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)

Out:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

You can also ask for meta-data to automatically preprocess the data.

  • e.g. categorical features -> do feature encoding

dataset = openml.datasets.get_dataset(17)
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format='array',
    target=dataset.default_target_attribute
)
print(f"Categorical features: {categorical_indicator}")
transformer = compose.ColumnTransformer(
    [('one_hot_encoder', preprocessing.OneHotEncoder(categories='auto'), categorical_indicator)])
X = transformer.fit_transform(X)
clf.fit(X, y)

Out:

Categorical features: [True, False, True, True, False, True, True, False, True, True, False, True, False, True, True, False, True, False, True, True]

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

Runs: Easily explore models

We can run (many) scikit-learn algorithms on (many) OpenML tasks.

# Get a task
task = openml.tasks.get_task(403)

# Build any classifier or pipeline
clf = tree.ExtraTreeClassifier()

# Run the flow
run = openml.runs.run_model_on_task(clf, task)

print(run)

Out:

OpenML Run
==========
Uploader Name: None
Metric.......: None
Run ID.......: None
Task ID......: 403
Task Type....: None
Task URL.....: https://www.openml.org/t/403
Flow ID......: 24258
Flow Name....: sklearn.tree.tree.ExtraTreeClassifier
Flow URL.....: https://www.openml.org/f/24258
Setup ID.....: None
Setup String.: Python_3.7.6. Sklearn_0.21.2. NumPy_1.18.1. SciPy_1.4.1. ExtraTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                    max_features='auto', max_leaf_nodes=None,
                    min_impurity_decrease=0.0, min_impurity_split=None,
                    min_samples_leaf=1, min_samples_split=2,
                    min_weight_fraction_leaf=0.0, random_state=9840,
                    splitter='random')
Dataset ID...: 68
Dataset URL..: https://www.openml.org/d/68

Share the run on the OpenML server

So far the run is only available locally. By calling the publish function, the run is sent to the OpenML server:

myrun = run.publish()
# For this tutorial, our configuration publishes to the test server
# as to not pollute the main server.
print("Uploaded to http://test.openml.org/r/" + str(myrun.run_id))

Out:

Uploaded to http://test.openml.org/r/37836

We can now also inspect the flow object which was automatically created:

flow = openml.flows.get_flow(run.flow_id)
print(flow)

Out:

OpenML Flow
===========
Flow ID.........: 24258 (version 4)
Flow URL........: https://www.openml.org/f/24258
Flow Name.......: sklearn.tree.tree.ExtraTreeClassifier
Flow Description: An extremely randomized tree classifier.

Extra-trees differ from classic decision trees in the way they are built.
When looking for the best split to separate the samples of a node into two
groups, random splits are drawn for each of the `max_features` randomly
selected features and the best split among those is chosen. When
`max_features` is set 1, this amounts to building a totally random
decision tree.

Warning: Extra-trees should only be used within ensemble methods.
Upload Date.....: 2019-11-07 11:10:01
Dependencies....: sklearn==0.21.2
numpy>=1.6.1
scipy>=0.9

It also works with pipelines

When you need to handle ‘dirty’ data, build pipelines to model then automatically.

task = openml.tasks.get_task(1)
features = task.get_dataset().features
nominal_feature_indices = [
    i for i in range(len(features))
    if features[i].name != task.target_name and features[i].data_type == 'nominal'
]
pipe = pipeline.Pipeline(steps=[
    (
        'Preprocessing',
        compose.ColumnTransformer([
            ('Nominal', pipeline.Pipeline(
                [
                    ('Imputer', impute.SimpleImputer(strategy='most_frequent')),
                    (
                        'Encoder',
                        preprocessing.OneHotEncoder(
                            sparse=False, handle_unknown='ignore',
                        )
                    ),
                ]),
                nominal_feature_indices,
             ),
        ]),
    ),
    ('Classifier', ensemble.RandomForestClassifier(n_estimators=10))
])

run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=False)
myrun = run.publish()
print("Uploaded to http://test.openml.org/r/" + str(myrun.run_id))

Out:

Uploaded to http://test.openml.org/r/37837

Running flows on tasks offline for later upload

For those scenarios where there is no access to internet, it is possible to run a model on a task without uploading results or flows to the server immediately.

# To perform the following line offline, it is required to have been called before
# such that the task is cached on the local openml cache directory:
task = openml.tasks.get_task(6)

# The following lines can then be executed offline:
run = openml.runs.run_model_on_task(
    pipe,
    task,
    avoid_duplicate_runs=False,
    upload_flow=False)

# The run may be stored offline, and the flow will be stored along with it:
run.to_filesystem(directory='myrun')

# They may be loaded and uploaded at a later time
run = openml.runs.OpenMLRun.from_filesystem(directory='myrun')
run.publish()

# Publishing the run will automatically upload the related flow if
# it does not yet exist on the server.

Out:

OpenML Run
==========
Uploader Name: None
Metric.......: None
Run ID.......: 37839
Run URL......: https://www.openml.org/r/37839
Task ID......: 6
Task Type....: None
Task URL.....: https://www.openml.org/t/6
Flow ID......: 24261
Flow Name....: sklearn.pipeline.Pipeline(Preprocessing=sklearn.compose._column_transformer.ColumnTransformer(Nominal=sklearn.pipeline.Pipeline(Imputer=sklearn.impute._base.SimpleImputer,Encoder=sklearn.preprocessing._encoders.OneHotEncoder)),Classifier=sklearn.ensemble.forest.RandomForestClassifier)
Flow URL.....: https://www.openml.org/f/24261
Setup ID.....: None
Setup String.: None
Dataset ID...: None
Dataset URL..: https://www.openml.org/d/None

Alternatively, one can also directly run flows.

# Get a task
task = openml.tasks.get_task(403)

# Build any classifier or pipeline
clf = tree.ExtraTreeClassifier()

# Obtain the scikit-learn extension interface to convert the classifier
# into a flow object.
extension = openml.extensions.get_extension_by_model(clf)
flow = extension.model_to_flow(clf)

run = openml.runs.run_flow_on_task(flow, task)

Challenge

Try to build the best possible models on several OpenML tasks, compare your results with the rest of the class and learn from them. Some tasks you could try (or browse openml.org):

  • EEG eye state: data_id:1471, task_id:14951

  • Volcanoes on Venus: data_id:1527, task_id:10103

  • Walking activity: data_id:1509, task_id:9945, 150k instances.

  • Covertype (Satellite): data_id:150, task_id:218, 500k instances.

  • Higgs (Physics): data_id:23512, task_id:52950, 100k instances, missing values.

# Easy benchmarking:
for task_id in [115, ]:  # Add further tasks. Disclaimer: they might take some time
    task = openml.tasks.get_task(task_id)
    data = openml.datasets.get_dataset(task.dataset_id)
    clf = neighbors.KNeighborsClassifier(n_neighbors=5)

    run = openml.runs.run_model_on_task(clf, task, avoid_duplicate_runs=False)
    myrun = run.publish()
    print(f"kNN on {data.name}: http://test.openml.org/r/{myrun.run_id}")

Out:

kNN on diabetes: http://test.openml.org/r/37840
openml.config.stop_using_configuration_for_example()

Total running time of the script: ( 0 minutes 17.937 seconds)

Gallery generated by Sphinx-Gallery