Adding an AutoML Framework
Rewrite in progress
Most information on this page is accurate, and it should be complete enough to use.
However, it hasn't been updated to make use of mkdocs-materials
features, and
might have some outdated examples. Contributions welcome.
Add an AutoML framework
Adding an AutoML framework consist in several steps:
- create a Python module that will contain everything related to the integration of this framework.
- define the framework in a Framework definition file.
- write some integration code
- to download/setup the framework dynamically: by convention, this is done by a
setup.sh
script defined in the module. - to run the framework using the data and constraints/parameters provided by the benchmark application: by convention, this is done by an
exec.py
script in the module, but it may require more files depending on the framework, for example if it runs on Python or R, Java...
Framework definition
The framework definition consists in an entry in a yaml
file with the framework name and some properties
- to describe the framework and define which version will be used:
project
,version
. - to indicate the Python module with the integration code:
module
orextends
. - to pass optional parameters to the framework and/or the integration code:
params
.
Default framework definitions are defined in file resources/frameworks.yaml
in lexicographic order,
where version
should be set to stable
, which will point dynamically to the most recent official release available.
Frameworks that offer the possibility to test cutting edge version (e.g. nightly builds,
dev
/master
repo, ...) can add an entry to resources/frameworks_latest.yaml
, where version
should be set to latest
.
Maintainers of this repository try to regularly — ideally, every quarter — create a framework definition using frozen framework versions in order to favour the reproducibility of the published benchmarks.
Following the custom configuration,
it is possible to override and/or add a framework definitions by creating a frameworks.yaml
file in your user_dir
.
See for example the examples/custom/frameworks.yaml
:
---
GradientBoosting:
module: extensions.GradientBoosting
project: https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting
params:
n_estimators: 500
Stacking:
module: extensions.Stacking
project: https://scikit-learn.org/stable/modules/ensemble.html#stacking
params:
_rf_params: {n_estimators: 200}
_gbm_params: {n_estimators: 200}
_linear_params: {penalty: elasticnet, loss: log}
# _svc_params: {tol: 1e-3, max_iter: 1e5}
# _final_params: {penalty: elasticnet, loss: log} # sgd linear
_final_params: {max_iter: 1000} # logistic/linear
autosklearn_latest:
extends: autosklearn
version: latest
description: "this will use master branch from the autosklearn repository instead of the fixed version"
autosklearn_mybranch:
extends: autosklearn
version: mybranch
description: "this will use mybranch branch from the autosklearn repository instead of the fixed version"
autosklearn_oldgen:
extends: autosklearn
version: "0.7.1"
description: "this will use the latest autosklearn version from the old generation"
H2OAutoML_nightly:
module: frameworks.H2OAutoML
setup_cmd: 'LATEST_H2O=`curl http://h2o-release.s3.amazonaws.com/h2o/master/latest` && pip install --no-cache-dir -U "http://h2o-release.s3.amazonaws.com/h2o/master/${{LATEST_H2O}}/Python/h2o-3.29.0.${{LATEST_H2O}}-py2.py3-none-any.whl"'
version: 'nightly'
H2OAutoML_custom:
extends: H2OAutoML
params:
nfolds: 3
stopping_tolerance: 0.05
This example shows
- the definitions for 2 new frameworks:
GradientBoosting
andStacking
. - Those definitions (optionally) externalize some parameters (e.g.
n_estimators
): theparams
property always appears in json format in the results, so that we can clearly see what has been tuned when analyzing the results later. - Note that the module is case sensitive and should point to the module containing the integration code.
- The application will search for modules from the sys path, which includes the application
root_dir
and theuser_dir
:- that's why the default frameworks use
module: frameworks.autosklearn
for example, - and the example above can use
module: extensions.GradientBoosting
because those examples must be run by setting theuser_dir
toexamples/config
, e.g.python runbenchmark.py gradientboosting -u examples/custom
.
- that's why the default frameworks use
- a custom definition (
H2OAutoML_nightly
) for the existingframeworks.H2OAutoML
module, allowing to reuse the module for a dynamic version of the module:- the
setup_cmd
is executed after the default setup of the module, so it can be used to make additional setup. To customize the setup, it is possible to use: setup_args: my_version
(only if thesetup.sh
in the framework module supports new arguments).setup_cmd
(as shown in this example).setup_script: my_additional_setup.sh
.
- the
- 2 custom definitions (
H2OAutoML_blending
andH2OAutoML_custom
) simply extending the existingH2OAutoML
definition (therefore inheriting from all its properties, including themodule
one), but overriding theparams
property, thus allowing to provide multiple "flavours" of the same framework.
The frameworks defined in this example can then be used like any other framework as soon as both the framework module and the definition file are made available to the application: in our case, this is done by the creation of the integration modules under examples/custom/extensions
and by exposing the definitions in examples/custom/frameworks.yaml
thanks to the entry in examples/custom/config.yaml
:
frameworks:
definition_file: # this allows to add custom framework definitions (in {user}/frameworks.yaml) on top of the default ones.
- '{root}/resources/frameworks.yaml'
- '{user}/frameworks.yaml'
By pointing the user_dir
to examples/custom
, our config.yaml
is also loaded, and we can use the new frameworks:
python runbenchmark.py gradientboosting -u examples/custom
python runbenchmark.py stacking -u examples/custom
python runbenchmark.py h2oautoml_blending -u examples/custom
Note:
By default, when generating a docker image, the image name is created as automlbenchmark/{framework}:{version}-{branch}
with the framework name in lowercase, and branch
being the branch of the automlbenchmark
app (usually stable
).
However, it is possible to customize this image name as follow:
MyFramework:
version: 1.0
module: extensions.MyFramework
docker:
author: my_docker_repo
image: my_image
tag: my_tag
my_docker_repo/my_image:my_tag-{branch}
, with branch
still being the branch of the application.
Framework integration
If the framework definition allows to use the new framework from the application, the (not so) hard part is to integrate it.
There are already several frameworks already integrated under frameworks
directory (+ the examples under examples/custom
), so the best starting point when adding a new framework is to first look at the existing ones.
Among the existing frameworks, we can see different type of integrations:
- trivial integration: these are frameworks running on Python and using dependencies (
numpy
,sklearn
) already required by the application itself. These are not really AutoML toolkits, but rather integrations usingsklearn
to provide a reference when analyzing the results: cf.constantpredictor
,DecisionTree
. - Python API integration: these are frameworks that can be run directly from Python: cf.
autosklearn
,H2OAutoML
,TPOT
,RandomForest
,TunedRandomForest
. - contrary to the trivial integration, those require a
setup
phase. - Most of them currently run using the same dependencies as the application, which is not recommended due to potential version conflicts (especially with
sklearn
). This was not a major constraint with the first frameworks implemented, but now, those integrations can and will be slightly changed to [run in their dedicated virtual environment], using their own dependencies: cf.RandomForest
andexamples/custom/extensions/Stacking
for examples. - non-Python frameworks: those frameworks typically run in
R
orJava
and don't provide any Python API. The integration is then still done by spawning theJava
orR
process from theexec.py
: cf.AutoWEKA
orranger
, respectively.
Recommended structure
By convention, the integration is done using the following structure:
Please note however, that this structure is not a requirement, the only requirement is the contract exposed by the integration module itself, i.e. by the __init__.py
file.
A simple __init__.py
would look like this:
from amlb.utils import call_script_in_same_dir
def setup(*args, **kwargs):
call_script_in_same_dir(__file__, "setup.sh", *args, **kwargs)
def run(*args, **kwargs):
from .exec import run
return run(*args, **kwargs)
where we see that the module should expose (only run
is actually required) the following functions:
setup
(optional): called by the application to setup the given framework, usually by simply running asetup.sh
script that will be responsible for potentially creating a local virtual env, downloading and installing the dependencies. Thesetup
function can also receive the optionalsetup_args
param from the framework definition as an argument.run
: called by the benchmark application to execute a task against the framework, using the selected dataset and constraints. We will describe the parameters in detail below, for now, just note that by convention, we just load theexec.py
file from the module and delegate the execution to itsrun
function.docker_commands
(optional): called by the application to collect docker instructions that are specific to the framework. If the framework requires asetup
phase, then the string returned by this function should at least ensure that the setup is also executed during the docker image creation, that's one reason why it is preferable to do all the setup in asetup.sh
script, to allow the docker support above.
Frameworks with Python API
Frameworks requiring a dedicated virtual env
For frameworks with Python API, we may worry about version conflicts between the packages used by the application (e.g. sklearn
, numpy
, pandas
) and the ones required by the framework.
In this case, the integration is slightly different as you can see with the RandomForest
integration allowing to use any version of sklearn
.
This is the basic structure after the creation of the dedicated Python virtual environment during setup:
frameworks/RandomForest/
|-- __init__.py
|-- exec.py
|-- requirements.txt
|-- setup.sh
`-- venv/
`-- (this local virtual env is created by the frameworks/shared/setup.sh)
Noticeable differences with a basic integration:
- the
venv
is created insetup.sh
by passing the current dir when sourcing theshared/setup.sh
script:. $HERE/../shared/setup.sh $HERE
. - the
run
function in__init__.py
prepares the data (in the application environment) before executing theexec.py
in the dedicatedvenv
. The call torun_in_venv
is in charge of serializing the input, callingexec.py
and deserializing + saving the results fromexec
. exec.py
, when calls in the subprocess (function__main__
), callscall_run(run)
which deserializes the input (dataset + config) and passes it to therun
function that just need to return aresult
object.
Note A:
As the serialization/deserialization of numpy
arrays can be costly for very large datasets, it is recommended to use dataset serialization only if the framework itself doesn't support loading datasets from files.
This means that, in the __init__.py
instead of implementing run
as:
data = dict(
train=dict(
X=dataset.train.X,
y=dataset.train.y
),
test=dict(
X=dataset.test.X,
y=dataset.test.y
)
)
return run_in_venv(__file__, "exec.py",
input_data=data, dataset=dataset, config=config)
data = dict(
target=dict(name=dataset.target.name),
train=dict(path=dataset.train.path),
test=dict(path=dataset.test.path)
)
return run_in_venv(__file__, "exec.py",
input_data=data, dataset=dataset, config=config)
Note B:
The serialization/deserialization of data between the main process and the framework process can be customized using the options
parameter:
The allowed options for (de)serialization are defined by the object amlb.utils.serialization.ser_config
.
For example:
data = dict(
train=dict(
X=dataset.train.X,
y=dataset.train.y
),
test=dict(
X=dataset.test.X,
y=dataset.test.y
)
)
options = dict(
serialization=dict(sparse_dataframe_deserialized_format='dense')
)
return run_in_venv(__file__, "exec.py",
input_data=data, dataset=dataset, config=config, options=options)
Other Frameworks
Integration of frameworks without any Python API is done in similar way, for example:
frameworks/AutoWEKA/
|-- __init__.py
|-- exec.py
|-- requirements.txt
|-- setup.sh
`-- lib/
`-- (this is where the framework dependencies go, usually created by setup.sh)
Here are the main differences:
- the setup
phase is identical, but if at runtime, some executable file or library is required that need to be installed locally (as opposed to globally: for example, R
or java
executable are usually installed globally), we just recommend to put everything under the integration module (for example in lib
and/or bin
subfolders as for AutoWEKA
). This is also true for some Python frameworks (cf. hyperoptsklearn
integration for example, where the modules are loaded from frameworks/hyperoptsklearn/lib/hyperopt-sklearn
).
- the framework is then executed by building a command manually in exec.py
, running it in a subprocess, and finally collecting the results generated by the subprocess. For example, in ranger/exec.py
:
with Timer() as training:
run_cmd(("Rscript --vanilla -e \""
"source('{script}'); "
"run('{train}', '{test}', '{output}', cores={cores}, meta_results_file='{meta_results}', task_type='{task_type}')"
"\"").format(
script=os.path.join(here, 'exec.R'),
train=dataset.train.path,
test=dataset.test.path,
output=config.output_predictions_file,
meta_results=meta_results_file,
task_type=config.type,
cores=config.cores
), _live_output_=True)
exec.R
script is also responsible to save the predictions in the expected format.
Add a default framework
Is called "default framework" an AutoML framework whose integration is available on master
branch under the frameworks
folder, and with a simple definition in resources/frameworks.yaml
.
NOTE: There are a few requirements when integrating a new default framework:
- The code snippet triggering the training should use only defaults (no AutoML hyper parameters), plus possibly a generic
**kwargs
in order to supportparams
section in custom framework definitions. In other words, one of the requirements for being included in the benchmark is that the framework is submitted without any tweaks to default settings. This is to prevent submissions (systems) from overfitting or tuning to the benchmark. - There must be a way to limit the runtime of the algorithm (a maximum runtime parameter).
- Exceptions:
- the problem type ("classification", "regression", "binary", "multiclass"): this is available through
config.type
ordataset.type
. - information about data, for example the column types: available through the
dataset
object. - time, cpu and memory constraints: those must be provided by the benchmark application through the
config
object. - the objective function: provided by
config.metric
(usually requires a translation for a given framework). - seed: provided by
config.seed
- paths to folders (output, temporary...): if possible, use
config.output_dir
or a subfolder (see existing integrations). - The default framework definition in
resources/frameworks.yaml
shouldn't have anyparams
section: thisparams
section is intended for custom definitions, not default ones.
Using the instructions above:
- verify that there is an issue created under https://github.com/openml/automlbenchmark/issues for the framework you want to add, or create one.
- create a private branch for your integration changes.
- create the framework module (e.g.
MyFramework
) underframeworks
folder. - define the module (if possible without any
params
) inresources/frameworks.yaml
. - try to setup the framework: > python runbenchmark.py myframework -s only
- fixes the framework setup until it works: the setup being usually a simple
setup.sh
script, you should be able to test it directly without using the application. - try to run simple test against one fold using defaults (
test
benchmark andtest
constraints) with the-Xtest_mode
that will trigger additional validations: > python runbenchmark.py myframework -f 0 -Xtest_mode - fix the module integration code until the test produce all results with no error (if the integration generated an error, it is visible in the results).
- if this works, validate it against the
validation
dataset using one fold: > python runbenchmark.py myframework validation 1h4c -f 0 -Xtest_mode - if this works, try to run it in docker to validate the docker image setup: > python runbenchmark.py myframework -m docker
- if this works, try to run it in aws: > python runbenchmark.py myframework -m aws
- add a brief description of the framework to the documentation in docs/website/framework.html following the same formatting as the other entries.
- create a pull request, and ask a review from authors of
automlbenchmark
: they'll also be happy to help you during this integration.
Add a custom framework
You may want to integrate a framework without wanting to make this publicly available.
In this case, as we've seen above, there's always the possibility to integrate your framework in a custom user_dir
.
Using the instructions above:
- define what is (or will be) your custom
user_dir
for this framework. - ensure it contains a
config.yaml
, otherwise create one (for example copy this one orexamples/custom/config.yaml
). - create the framework module somewhere under this
user_dir
, e.g.{user_dir}/extensions/MyFramework
. - define the module in
{user_dir}/frameworks.yaml
(create the file if needed). - follow the same steps as for a "default" framework to implement the integration: setup, test, ... except that you always need to specify the
user_dir
, e.g. for testing: > python runbenchmark.py myframework -f 0 -u {user_dir} - there may be some issues when trying to build the docker image when the framework is in a custom folder, as all the files should be under the docker build context: solving this probably requires a multi-stage build, needs more investigation. For now, if you really need a docker image, you can either build it manually, or simply copy the
extensions
folder temporarily underautomlbenchmark
. - even without docker image, you can run the framework on AWS, as soon as the custom
config.yaml
,frameworks.yaml
andextensions
folder are made available as AWS resources: cf. again the custom configuration. The application will copy those files to the EC2 instances into a localuser_dir
and will be able to setup the framework there.
Using a Different Hyperparameter Configuration
When you want to use an existing framework integration with a different hyperparameter configuration, it is often enough to write only a custom framework definition without further changes.
Framework definitions accept a params
dictionary for pass-through parameters,
i.e., parameters that are directly accessible from the exec.py
file in the framework
integration executing the AutoML training. Most integration scripts use this to
overwrite any (default) hyperparameter value. Use the extends
field to indicate
which framework definition to copy default values from, and then add any fields to
overwrite. In the example below the n_estimators
and verbose
params are passed
directly to the RandomForestClassifier
, which will now train only 200 trees
(default is 2000):
This new definition can be used as normal:
Note
By convention, param names starting with _
are filtered out (they are not passed
to the framework) but are used for custom logic in the exec.py
. For example, the
_save_artifact
field is often used to allow additional artifacts, such as logs or
models, to be saved.