Note
Click here to download the full example code
Run Setup¶
By: Jan N. van Rijn
One of the key features of the openml-python library is that is allows to reinstantiate flows with hyperparameter settings that were uploaded before. This tutorial uses the concept of setups. Although setups are not extensively described in the OpenML documentation (because most users will not directly use them), they form a important concept within OpenML distinguishing between hyperparameter configurations. A setup is the combination of a flow with all its hyperparameters set.
A key requirement for reinstantiating a flow is to have the same scikit-learn version as the flow that was uploaded. However, this tutorial will upload the flow (that will later be reinstantiated) itself, so it can be ran with any scikit-learn version that is supported by this library. In this case, the requirement of the corresponding scikit-learn versions is automatically met.
- In this tutorial we will
Create a flow and use it to solve a task;
Download the flow, reinstantiate the model with same hyperparameters, and solve the same task again;
We will verify that the obtained results are exactly the same.
Warning
This example uploads data. For that reason, this example connects to the test server at test.openml.org. This prevents the main server from crowding with example datasets, tasks, runs, and so on.
import logging
import numpy as np
import openml
import sklearn.ensemble
import sklearn.impute
import sklearn.preprocessing
root = logging.getLogger()
root.setLevel(logging.INFO)
openml.config.start_using_configuration_for_example()
1) Create a flow and use it to solve a task¶
# first, let's download the task that we are interested in
task = openml.tasks.get_task(6)
# we will create a fairly complex model, with many preprocessing components and
# many potential hyperparameters. Of course, the model can be as complex and as
# easy as you want it to be
model_original = sklearn.pipeline.make_pipeline(
sklearn.impute.SimpleImputer(),
sklearn.ensemble.RandomForestClassifier()
)
# Let's change some hyperparameters. Of course, in any good application we
# would tune them using, e.g., Random Search or Bayesian Optimization, but for
# the purpose of this tutorial we set them to some specific values that might
# or might not be optimal
hyperparameters_original = {
'simpleimputer__strategy': 'median',
'randomforestclassifier__criterion': 'entropy',
'randomforestclassifier__max_features': 0.2,
'randomforestclassifier__min_samples_leaf': 1,
'randomforestclassifier__n_estimators': 16,
'randomforestclassifier__random_state': 42,
}
model_original.set_params(**hyperparameters_original)
# solve the task and upload the result (this implicitly creates the flow)
run = openml.runs.run_model_on_task(
model_original,
task,
avoid_duplicate_runs=False)
run_original = run.publish() # this implicitly uploads the flow
Out:
WARNING: Logging before flag parsing goes to stderr.
I0703 12:38:11.051397 139733218952960 functions.py:432] Going to execute flow 'sklearn.pipeline.Pipeline(simpleimputer=sklearn.impute.SimpleImputer,randomforestclassifier=sklearn.ensemble.forest.RandomForestClassifier)' on task 6 for repeat 0 fold 0 sample 0.
I0703 12:38:11.076056 139733218952960 functions.py:255] Executed Task 6 on local Flow with name sklearn.pipeline.Pipeline(simpleimputer=sklearn.impute.SimpleImputer,randomforestclassifier=sklearn.ensemble.forest.RandomForestClassifier).
2) Download the flow and solve the same task again.¶
# obtain setup id (note that the setup id is assigned by the OpenML server -
# therefore it was not yet available in our local copy of the run)
run_downloaded = openml.runs.get_run(run_original.run_id)
setup_id = run_downloaded.setup_id
# after this, we can easily reinstantiate the model
model_duplicate = openml.setups.initialize_model(setup_id)
# it will automatically have all the hyperparameters set
# and run the task again
run_duplicate = openml.runs.run_model_on_task(
model_duplicate, task, avoid_duplicate_runs=False)
Out:
I0703 12:38:14.420961 139733218952960 extension.py:148] - flow_to_sklearn START o=<openml.flows.flow.OpenMLFlow object at 0x7f15a9561b70>, components=None, init_defaults=False
I0703 12:38:14.421192 139733218952960 extension.py:661] - deserialize sklearn.pipeline.Pipeline(simpleimputer=sklearn.impute.SimpleImputer,randomforestclassifier=sklearn.ensemble.forest.RandomForestClassifier)
I0703 12:38:14.421445 139733218952960 extension.py:679] -- flow_parameter=memory, value=null
I0703 12:38:14.421531 139733218952960 extension.py:148] -- flow_to_sklearn START o=null, components=OrderedDict([('randomforestclassifier', <openml.flows.flow.OpenMLFlow object at 0x7f15a9561860>), ('simpleimputer', <openml.flows.flow.OpenMLFlow object at 0x7f15a95be438>)]), init_defaults=False
I0703 12:38:14.421608 139733218952960 extension.py:245] -- flow_to_sklearn END o=None, rval=None
I0703 12:38:14.421661 139733218952960 extension.py:679] -- flow_parameter=steps, value=[{"oml-python:serialized_object": "component_reference", "value": {"key": "simpleimputer", "step_name": "simpleimputer"}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "randomforestclassifier", "step_name": "randomforestclassifier"}}]
I0703 12:38:14.421717 139733218952960 extension.py:148] -- flow_to_sklearn START o=[{"oml-python:serialized_object": "component_reference", "value": {"key": "simpleimputer", "step_name": "simpleimputer"}}, {"oml-python:serialized_object": "component_reference", "value": {"key": "randomforestclassifier", "step_name": "randomforestclassifier"}}], components=OrderedDict([('randomforestclassifier', <openml.flows.flow.OpenMLFlow object at 0x7f15a9561860>), ('simpleimputer', <openml.flows.flow.OpenMLFlow object at 0x7f15a95be438>)]), init_defaults=False
I0703 12:38:14.421792 139733218952960 extension.py:148] --- flow_to_sklearn START o={'oml-python:serialized_object': 'component_reference', 'value': {'key': 'simpleimputer', 'step_name': 'simpleimputer'}}, components=OrderedDict([('randomforestclassifier', <openml.flows.flow.OpenMLFlow object at 0x7f15a9561860>), ('simpleimputer', <openml.flows.flow.OpenMLFlow object at 0x7f15a95be438>)]), init_defaults=False
I0703 12:38:14.421852 139733218952960 extension.py:148] ---- flow_to_sklearn START o={'key': 'simpleimputer', 'step_name': 'simpleimputer'}, components=None, init_defaults=False
I0703 12:38:14.421909 139733218952960 extension.py:148] ----- flow_to_sklearn START o=key, components=None, init_defaults=False
I0703 12:38:14.421977 139733218952960 extension.py:245] ----- flow_to_sklearn END o=key, rval=key
I0703 12:38:14.422029 139733218952960 extension.py:148] ----- flow_to_sklearn START o=simpleimputer, components=None, init_defaults=False
I0703 12:38:14.422087 139733218952960 extension.py:245] ----- flow_to_sklearn END o=simpleimputer, rval=simpleimputer
I0703 12:38:14.422138 139733218952960 extension.py:148] ----- flow_to_sklearn START o=step_name, components=None, init_defaults=False
I0703 12:38:14.422194 139733218952960 extension.py:245] ----- flow_to_sklearn END o=step_name, rval=step_name
I0703 12:38:14.422243 139733218952960 extension.py:148] ----- flow_to_sklearn START o=simpleimputer, components=None, init_defaults=False
I0703 12:38:14.422298 139733218952960 extension.py:245] ----- flow_to_sklearn END o=simpleimputer, rval=simpleimputer
I0703 12:38:14.422360 139733218952960 extension.py:245] ---- flow_to_sklearn END o={'key': 'simpleimputer', 'step_name': 'simpleimputer'}, rval=OrderedDict([('key', 'simpleimputer'), ('step_name', 'simpleimputer')])
I0703 12:38:14.422411 139733218952960 extension.py:148] ---- flow_to_sklearn START o=<openml.flows.flow.OpenMLFlow object at 0x7f15a95be438>, components=None, init_defaults=False
I0703 12:38:14.422462 139733218952960 extension.py:661] ---- deserialize sklearn.impute.SimpleImputer
I0703 12:38:14.422618 139733218952960 extension.py:679] ----- flow_parameter=copy, value=true
I0703 12:38:14.422678 139733218952960 extension.py:148] ----- flow_to_sklearn START o=true, components=OrderedDict(), init_defaults=False
I0703 12:38:14.422735 139733218952960 extension.py:245] ----- flow_to_sklearn END o=True, rval=True
I0703 12:38:14.422784 139733218952960 extension.py:679] ----- flow_parameter=fill_value, value=null
I0703 12:38:14.422835 139733218952960 extension.py:148] ----- flow_to_sklearn START o=null, components=OrderedDict(), init_defaults=False
I0703 12:38:14.422888 139733218952960 extension.py:245] ----- flow_to_sklearn END o=None, rval=None
I0703 12:38:14.422936 139733218952960 extension.py:679] ----- flow_parameter=missing_values, value=NaN
I0703 12:38:14.422983 139733218952960 extension.py:148] ----- flow_to_sklearn START o=NaN, components=OrderedDict(), init_defaults=False
I0703 12:38:14.423038 139733218952960 extension.py:245] ----- flow_to_sklearn END o=nan, rval=nan
I0703 12:38:14.423085 139733218952960 extension.py:679] ----- flow_parameter=strategy, value="median"
I0703 12:38:14.423132 139733218952960 extension.py:148] ----- flow_to_sklearn START o="median", components=OrderedDict(), init_defaults=False
I0703 12:38:14.423183 139733218952960 extension.py:245] ----- flow_to_sklearn END o=median, rval=median
I0703 12:38:14.423230 139733218952960 extension.py:679] ----- flow_parameter=verbose, value=0
I0703 12:38:14.423275 139733218952960 extension.py:148] ----- flow_to_sklearn START o=0, components=OrderedDict(), init_defaults=False
I0703 12:38:14.423334 139733218952960 extension.py:245] ----- flow_to_sklearn END o=0, rval=0
I0703 12:38:14.423608 139733218952960 extension.py:245] ---- flow_to_sklearn END o=<openml.flows.flow.OpenMLFlow object at 0x7f15a95be438>, rval=SimpleImputer(copy=True, fill_value=None, missing_values=nan,
strategy='median', verbose=0)
I0703 12:38:14.423797 139733218952960 extension.py:245] --- flow_to_sklearn END o={'oml-python:serialized_object': 'component_reference', 'value': {'key': 'simpleimputer', 'step_name': 'simpleimputer'}}, rval=('simpleimputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
strategy='median', verbose=0))
I0703 12:38:14.423890 139733218952960 extension.py:148] --- flow_to_sklearn START o={'oml-python:serialized_object': 'component_reference', 'value': {'key': 'randomforestclassifier', 'step_name': 'randomforestclassifier'}}, components=OrderedDict([('randomforestclassifier', <openml.flows.flow.OpenMLFlow object at 0x7f15a9561860>)]), init_defaults=False
I0703 12:38:14.423947 139733218952960 extension.py:148] ---- flow_to_sklearn START o={'key': 'randomforestclassifier', 'step_name': 'randomforestclassifier'}, components=None, init_defaults=False
I0703 12:38:14.424001 139733218952960 extension.py:148] ----- flow_to_sklearn START o=key, components=None, init_defaults=False
I0703 12:38:14.424066 139733218952960 extension.py:245] ----- flow_to_sklearn END o=key, rval=key
I0703 12:38:14.424116 139733218952960 extension.py:148] ----- flow_to_sklearn START o=randomforestclassifier, components=None, init_defaults=False
I0703 12:38:14.424173 139733218952960 extension.py:245] ----- flow_to_sklearn END o=randomforestclassifier, rval=randomforestclassifier
I0703 12:38:14.424223 139733218952960 extension.py:148] ----- flow_to_sklearn START o=step_name, components=None, init_defaults=False
I0703 12:38:14.424278 139733218952960 extension.py:245] ----- flow_to_sklearn END o=step_name, rval=step_name
I0703 12:38:14.424336 139733218952960 extension.py:148] ----- flow_to_sklearn START o=randomforestclassifier, components=None, init_defaults=False
I0703 12:38:14.424392 139733218952960 extension.py:245] ----- flow_to_sklearn END o=randomforestclassifier, rval=randomforestclassifier
I0703 12:38:14.424447 139733218952960 extension.py:245] ---- flow_to_sklearn END o={'key': 'randomforestclassifier', 'step_name': 'randomforestclassifier'}, rval=OrderedDict([('key', 'randomforestclassifier'), ('step_name', 'randomforestclassifier')])
I0703 12:38:14.424499 139733218952960 extension.py:148] ---- flow_to_sklearn START o=<openml.flows.flow.OpenMLFlow object at 0x7f15a9561860>, components=None, init_defaults=False
I0703 12:38:14.424549 139733218952960 extension.py:661] ---- deserialize sklearn.ensemble.forest.RandomForestClassifier
I0703 12:38:14.424736 139733218952960 extension.py:679] ----- flow_parameter=bootstrap, value=true
I0703 12:38:14.424801 139733218952960 extension.py:148] ----- flow_to_sklearn START o=true, components=OrderedDict(), init_defaults=False
I0703 12:38:14.424862 139733218952960 extension.py:245] ----- flow_to_sklearn END o=True, rval=True
I0703 12:38:14.424912 139733218952960 extension.py:679] ----- flow_parameter=class_weight, value=null
I0703 12:38:14.424960 139733218952960 extension.py:148] ----- flow_to_sklearn START o=null, components=OrderedDict(), init_defaults=False
I0703 12:38:14.425013 139733218952960 extension.py:245] ----- flow_to_sklearn END o=None, rval=None
I0703 12:38:14.425060 139733218952960 extension.py:679] ----- flow_parameter=criterion, value="entropy"
I0703 12:38:14.425107 139733218952960 extension.py:148] ----- flow_to_sklearn START o="entropy", components=OrderedDict(), init_defaults=False
I0703 12:38:14.425159 139733218952960 extension.py:245] ----- flow_to_sklearn END o=entropy, rval=entropy
I0703 12:38:14.425206 139733218952960 extension.py:679] ----- flow_parameter=max_depth, value=null
I0703 12:38:14.425253 139733218952960 extension.py:148] ----- flow_to_sklearn START o=null, components=OrderedDict(), init_defaults=False
I0703 12:38:14.425304 139733218952960 extension.py:245] ----- flow_to_sklearn END o=None, rval=None
I0703 12:38:14.425383 139733218952960 extension.py:679] ----- flow_parameter=max_features, value=0.2
I0703 12:38:14.425459 139733218952960 extension.py:148] ----- flow_to_sklearn START o=0.2, components=OrderedDict(), init_defaults=False
I0703 12:38:14.425521 139733218952960 extension.py:245] ----- flow_to_sklearn END o=0.2, rval=0.2
I0703 12:38:14.425576 139733218952960 extension.py:679] ----- flow_parameter=max_leaf_nodes, value=null
I0703 12:38:14.425623 139733218952960 extension.py:148] ----- flow_to_sklearn START o=null, components=OrderedDict(), init_defaults=False
I0703 12:38:14.425675 139733218952960 extension.py:245] ----- flow_to_sklearn END o=None, rval=None
I0703 12:38:14.425723 139733218952960 extension.py:679] ----- flow_parameter=min_impurity_decrease, value=0.0
I0703 12:38:14.425770 139733218952960 extension.py:148] ----- flow_to_sklearn START o=0.0, components=OrderedDict(), init_defaults=False
I0703 12:38:14.425824 139733218952960 extension.py:245] ----- flow_to_sklearn END o=0.0, rval=0.0
I0703 12:38:14.425872 139733218952960 extension.py:679] ----- flow_parameter=min_impurity_split, value=null
I0703 12:38:14.425919 139733218952960 extension.py:148] ----- flow_to_sklearn START o=null, components=OrderedDict(), init_defaults=False
I0703 12:38:14.425971 139733218952960 extension.py:245] ----- flow_to_sklearn END o=None, rval=None
I0703 12:38:14.426017 139733218952960 extension.py:679] ----- flow_parameter=min_samples_leaf, value=1
I0703 12:38:14.426064 139733218952960 extension.py:148] ----- flow_to_sklearn START o=1, components=OrderedDict(), init_defaults=False
I0703 12:38:14.426115 139733218952960 extension.py:245] ----- flow_to_sklearn END o=1, rval=1
I0703 12:38:14.426163 139733218952960 extension.py:679] ----- flow_parameter=min_samples_split, value=2
I0703 12:38:14.426215 139733218952960 extension.py:148] ----- flow_to_sklearn START o=2, components=OrderedDict(), init_defaults=False
I0703 12:38:14.426266 139733218952960 extension.py:245] ----- flow_to_sklearn END o=2, rval=2
I0703 12:38:14.426323 139733218952960 extension.py:679] ----- flow_parameter=min_weight_fraction_leaf, value=0.0
I0703 12:38:14.426374 139733218952960 extension.py:148] ----- flow_to_sklearn START o=0.0, components=OrderedDict(), init_defaults=False
I0703 12:38:14.426426 139733218952960 extension.py:245] ----- flow_to_sklearn END o=0.0, rval=0.0
I0703 12:38:14.426479 139733218952960 extension.py:679] ----- flow_parameter=n_estimators, value=16
I0703 12:38:14.426526 139733218952960 extension.py:148] ----- flow_to_sklearn START o=16, components=OrderedDict(), init_defaults=False
I0703 12:38:14.426577 139733218952960 extension.py:245] ----- flow_to_sklearn END o=16, rval=16
I0703 12:38:14.426624 139733218952960 extension.py:679] ----- flow_parameter=n_jobs, value=null
I0703 12:38:14.426673 139733218952960 extension.py:148] ----- flow_to_sklearn START o=null, components=OrderedDict(), init_defaults=False
I0703 12:38:14.426724 139733218952960 extension.py:245] ----- flow_to_sklearn END o=None, rval=None
I0703 12:38:14.426772 139733218952960 extension.py:679] ----- flow_parameter=oob_score, value=false
I0703 12:38:14.426825 139733218952960 extension.py:148] ----- flow_to_sklearn START o=false, components=OrderedDict(), init_defaults=False
I0703 12:38:14.426877 139733218952960 extension.py:245] ----- flow_to_sklearn END o=False, rval=False
I0703 12:38:14.426924 139733218952960 extension.py:679] ----- flow_parameter=random_state, value=42
I0703 12:38:14.426971 139733218952960 extension.py:148] ----- flow_to_sklearn START o=42, components=OrderedDict(), init_defaults=False
I0703 12:38:14.427022 139733218952960 extension.py:245] ----- flow_to_sklearn END o=42, rval=42
I0703 12:38:14.427068 139733218952960 extension.py:679] ----- flow_parameter=verbose, value=0
I0703 12:38:14.427115 139733218952960 extension.py:148] ----- flow_to_sklearn START o=0, components=OrderedDict(), init_defaults=False
I0703 12:38:14.427166 139733218952960 extension.py:245] ----- flow_to_sklearn END o=0, rval=0
I0703 12:38:14.427214 139733218952960 extension.py:679] ----- flow_parameter=warm_start, value=false
I0703 12:38:14.427260 139733218952960 extension.py:148] ----- flow_to_sklearn START o=false, components=OrderedDict(), init_defaults=False
I0703 12:38:14.427311 139733218952960 extension.py:245] ----- flow_to_sklearn END o=False, rval=False
I0703 12:38:14.427641 139733218952960 extension.py:245] ---- flow_to_sklearn END o=<openml.flows.flow.OpenMLFlow object at 0x7f15a9561860>, rval=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=None, max_features=0.2, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=16, n_jobs=None,
oob_score=False, random_state=42, verbose=0, warm_start=False)
I0703 12:38:14.427964 139733218952960 extension.py:245] --- flow_to_sklearn END o={'oml-python:serialized_object': 'component_reference', 'value': {'key': 'randomforestclassifier', 'step_name': 'randomforestclassifier'}}, rval=('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=None, max_features=0.2, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=16, n_jobs=None,
oob_score=False, random_state=42, verbose=0, warm_start=False))
I0703 12:38:14.428354 139733218952960 extension.py:245] -- flow_to_sklearn END o=[{'oml-python:serialized_object': 'component_reference', 'value': {'key': 'simpleimputer', 'step_name': 'simpleimputer'}}, {'oml-python:serialized_object': 'component_reference', 'value': {'key': 'randomforestclassifier', 'step_name': 'randomforestclassifier'}}], rval=[('simpleimputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
strategy='median', verbose=0)), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=None, max_features=0.2, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=16, n_jobs=None,
oob_score=False, random_state=42, verbose=0, warm_start=False))]
I0703 12:38:14.428890 139733218952960 extension.py:245] - flow_to_sklearn END o=<openml.flows.flow.OpenMLFlow object at 0x7f15a9561b70>, rval=Pipeline(memory=None,
steps=[('simpleimputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
strategy='median', verbose=0)), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=None, max_features=0.2, max_leaf_nodes=None,
...mators=16, n_jobs=None,
oob_score=False, random_state=42, verbose=0, warm_start=False))])
I0703 12:38:14.531917 139733218952960 functions.py:432] Going to execute flow 'sklearn.pipeline.Pipeline(simpleimputer=sklearn.impute.SimpleImputer,randomforestclassifier=sklearn.ensemble.forest.RandomForestClassifier)' on task 6 for repeat 0 fold 0 sample 0.
I0703 12:38:14.555600 139733218952960 functions.py:255] Executed Task 6 on local Flow with name sklearn.pipeline.Pipeline(simpleimputer=sklearn.impute.SimpleImputer,randomforestclassifier=sklearn.ensemble.forest.RandomForestClassifier).
3) We will verify that the obtained results are exactly the same.¶
# the run has stored all predictions in the field data content
np.testing.assert_array_equal(run_original.data_content,
run_duplicate.data_content)
openml.config.stop_using_configuration_for_example()
Total running time of the script: ( 0 minutes 7.375 seconds)