Results
The AutoML benchmark produces many result files, such as logs, performance records, and meta-data of the experiments. Some of these files can also be automatically parsed and visualized by notebooks we provide.
Output File Structure
Except the logs, all the files generated by the application are in easy to process
csv
or json
format, and they are all generated in a subfolder of the output_dir
unique for each benchmark run.
For example:
results/randomforest.test.test.local.20201204T192714
|-- predictions
| |-- cholesterol
| | |-- 0
| | | |-- metadata.json
| | | `-- predictions.csv
| | `-- 1
| | |-- metadata.json
| | `-- predictions.csv
| |-- iris
| | |-- 0
| | | |-- metadata.json
| | | `-- predictions.csv
| | `-- 1
| | |-- metadata.json
| | `-- predictions.csv
| `-- kc2
| |-- 0
| | |-- metadata.json
| | `-- predictions.csv
| `-- 1
| |-- metadata.json
| `-- predictions.csv
`-- scores
|-- RandomForest.benchmark_test.csv
`-- results.csv
results.csv
Here is a sample results.csv
file from a test run against the RandomForest
framework:
id,task,framework,constraint,fold,result,metric,mode,version,params,tag,utc,duration,models,seed,info,acc,auc,balacc,logloss,mae,r2,rmse
openml.org/t/3913,kc2,RandomForest,test,0,0.865801,auc,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:27:46,3.2,2000,2633845682,,0.792453,0.865801,0.634199,0.350891,,,
openml.org/t/3913,kc2,RandomForest,test,1,0.86039,auc,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:27:52,3.0,2000,2633845683,,0.90566,0.86039,0.772727,0.406952,,,
openml.org/t/59,iris,RandomForest,test,0,0.126485,logloss,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:27:56,2.9,2000,2633845682,,0.933333,,0.933333,0.126485,,,
openml.org/t/59,iris,RandomForest,test,1,0.0271781,logloss,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:28:01,3.0,2000,2633845683,,1.0,,1.0,0.0271781,,,
openml.org/t/2295,cholesterol,RandomForest,test,0,44.3352,rmse,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:28:05,3.0,2000,2633845682,,,,,,35.6783,-0.014619,44.3352
openml.org/t/2295,cholesterol,RandomForest,test,1,55.3163,rmse,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:28:10,3.1,2000,2633845683,,,,,,43.1808,-0.0610752,55.3163
id task framework constraint fold result metric mode version params utc duration models seed acc auc balacc logloss mae r2 rmse
0 openml.org/t/3913 kc2 RandomForest test 0 0.865801 auc local 0.23.2 {'n_estimators': 2000} 2020-12-04T19:27:46 3.2 2000 2633845682 0.792453 0.865801 0.634199 0.350891 NaN NaN NaN
1 openml.org/t/3913 kc2 RandomForest test 1 0.860390 auc local 0.23.2 {'n_estimators': 2000} 2020-12-04T19:27:52 3.0 2000 2633845683 0.905660 0.860390 0.772727 0.406952 NaN NaN NaN
2 openml.org/t/59 iris RandomForest test 0 0.126485 logloss local 0.23.2 {'n_estimators': 2000} 2020-12-04T19:27:56 2.9 2000 2633845682 0.933333 NaN 0.933333 0.126485 NaN NaN NaN
3 openml.org/t/59 iris RandomForest test 1 0.027178 logloss local 0.23.2 {'n_estimators': 2000} 2020-12-04T19:28:01 3.0 2000 2633845683 1.000000 NaN 1.000000 0.027178 NaN NaN NaN
4 openml.org/t/2295 cholesterol RandomForest test 0 44.335200 rmse local 0.23.2 {'n_estimators': 2000} 2020-12-04T19:28:05 3.0 2000 2633845682 NaN NaN NaN NaN 35.6783 -0.014619 44.3352
5 openml.org/t/2295 cholesterol RandomForest test 1 55.316300 rmse local 0.23.2 {'n_estimators': 2000} 2020-12-04T19:28:10 3.1 2000 2633845683 NaN NaN NaN NaN 43.1808 -0.061075 55.3163
Here is a short description of each column:
id
: a identifier for the dataset used in this result. For convenience, we use the link to the OpenML task by default.task
: the task name as defined in the benchmark definition.framework
: the framework name as defined in the framework definition.fold
: the dataset fold being used for this job. Usually, we're using 10 folds, so the fold varies from 0 to 9.result
: the result score, this is the score for the metric that the framework was trying to optimize. For example, for binary classification, the default metrics defined inresources/config.yaml
arebinary: ['auc', 'acc']
; this means that the frameworks should try to optimizeauc
and the finalauc
score will become theresult
value, the other metrics (hereacc
) are then computed for information.mode
: one oflocal
,docker
,aws
,aws+docker
: tells where/how the job was executed.version
: the version of the framework being benchmarked.params
: if any, a JSON representation of the params defined in the framework definition. This allows to see clearly if some tuning was done for example.tag
: the branch tag of theautomlbenchmark
app that was running the job.utc
: the UTC timestamp at the job completion.duration
: the training duration: the framework integration is supposed to provide this information to ensure that it takes only into account the time taken by the framework itself. When benchmarking large data, the application can use a significant amount of time to prepare the data: this additional time doesn't appear in thisduration
column.models
: for some frameworks, it is possible to know how many models in total were trained by the AutoML framework.seed
: the seed or random state passed to the framework. With some frameworks, it is enough to obtain reproducible results. Note that the seed can be specified at the command line using-Xseed=
arg (for examplepython randomforest -Xseed=1452956522
): when there are multiple folds, the seed is then incremented by the fold number.info
: additional info in text format, this usually contains error messages if the job failed.acc
,auc
,logloss
metrics: all the metrics that were computed based on the generated predictions. For each job/row, one of them matches theresult
column, the others are purely informative. Those additional metric columns are simply added in alphabetical order.
Predictions Directory
For each evaluation, the framework integration must generate a predictions file that
will be used by the application to compute the scores. This predictions file is saved
under the predictions
subfolder as shown above and
follows the naming convention: {framework}_{task}_{fold}.csv
.
The csv
file contains a header row and contains the following columns, in order:
- For classification tasks only, there is first one column per class, sorted alphabetically.
Each column contains the probability of the sample belonging to that class, as predicted by the AutoML framework.
If a framework does not provide probabilities, it will be 1 for the predicted class and 0 otherwise.
- predictions
: contains the predictions of the test predictor data by the model trained by the framework,
- truth
: the true values of the test target data (test.y
).
Here are examples of the first few samples for KC2
(binary classification),
iris
(multiclass classification), and cholesterol
(regression):
no | yes | predictions | truth |
---|---|---|---|
0.965857617846013 | 0.034142382153998944 | no | no |
0.965857617846013 | 0.034142382153998944 | no | no |
0.5845 | 0.4155 | no | no |
0.6795 | 0.3205 | no | no |
0.965857617846013 | 0.034142382153998944 | no | no |
Iris-setosa | Iris-versicolor | Iris-virginica | predictions | truth |
---|---|---|---|---|
1.0 | 0.0 | 0.0 | Iris-setosa | Iris-setosa |
0.9715 | 0.028 | 0.0005 | Iris-setosa | Iris-setosa |
1.0 | 0.0 | 0.0 | Iris-setosa | Iris-setosa |
1.0 | 0.0 | 0.0 | Iris-setosa | Iris-setosa |
1.0 | 0.0 | 0.0 | Iris-setosa | Iris-setosa |
0.0 | 1.0 | 0.0 | Iris-versicolor | Iris-versicolor |
Extract more information
For some frameworks, it is also possible to extract more detailed information,
in the form of artifacts
that are saved after the training.
Examples of those artifacts are logs generated by the framework, models or descriptions
of the models trained by the framework, predictions for each of the model trained by the
AutoML framework. By default, those artifacts are not saved, and not all frameworks
provide the same artifacts. This is why the artifacts to be stored have to be specified
in the framework definition (before running the experiments!). By convention,
this can be achieved by specifying the params._save_artifacts
parameter. For example:
Save model descriptions under the models
subfolder:
Save the leaderboard and models under the models
subfolder,
and the H2O logs under logs
subfolder:
The framework integrations themselves determine where the artifacts are saved, this is typically not configurable from the framework definition.