Skip to content

Results

The AutoML benchmark produces many result files, such as logs, performance records, and meta-data of the experiments. Some of these files can also be automatically parsed and visualized by notebooks we provide.

Output File Structure

Except the logs, all the files generated by the application are in easy to process csv or json format, and they are all generated in a subfolder of the output_dir unique for each benchmark run.

For example:

results/randomforest.test.test.local.20201204T192714
|-- predictions
|   |-- cholesterol
|   |   |-- 0
|   |   |   |-- metadata.json
|   |   |   `-- predictions.csv
|   |   `-- 1
|   |       |-- metadata.json
|   |       `-- predictions.csv
|   |-- iris
|   |   |-- 0
|   |   |   |-- metadata.json
|   |   |   `-- predictions.csv
|   |   `-- 1
|   |       |-- metadata.json
|   |       `-- predictions.csv
|   `-- kc2
|       |-- 0
|       |   |-- metadata.json
|       |   `-- predictions.csv
|       `-- 1
|           |-- metadata.json
|           `-- predictions.csv
`-- scores
    |-- RandomForest.benchmark_test.csv
    `-- results.csv

results.csv

Here is a sample results.csv file from a test run against the RandomForest framework:

id,task,framework,constraint,fold,result,metric,mode,version,params,tag,utc,duration,models,seed,info,acc,auc,balacc,logloss,mae,r2,rmse
openml.org/t/3913,kc2,RandomForest,test,0,0.865801,auc,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:27:46,3.2,2000,2633845682,,0.792453,0.865801,0.634199,0.350891,,,
openml.org/t/3913,kc2,RandomForest,test,1,0.86039,auc,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:27:52,3.0,2000,2633845683,,0.90566,0.86039,0.772727,0.406952,,,
openml.org/t/59,iris,RandomForest,test,0,0.126485,logloss,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:27:56,2.9,2000,2633845682,,0.933333,,0.933333,0.126485,,,
openml.org/t/59,iris,RandomForest,test,1,0.0271781,logloss,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:28:01,3.0,2000,2633845683,,1.0,,1.0,0.0271781,,,
openml.org/t/2295,cholesterol,RandomForest,test,0,44.3352,rmse,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:28:05,3.0,2000,2633845682,,,,,,35.6783,-0.014619,44.3352
openml.org/t/2295,cholesterol,RandomForest,test,1,55.3163,rmse,local,0.23.2,{'n_estimators': 2000},,2020-12-04T19:28:10,3.1,2000,2633845683,,,,,,43.1808,-0.0610752,55.3163
                  id         task     framework constraint fold     result   metric   mode version                  params                  utc  duration models        seed       acc       auc    balacc   logloss      mae        r2     rmse
0  openml.org/t/3913          kc2  RandomForest       test    0   0.865801      auc  local  0.23.2  {'n_estimators': 2000}  2020-12-04T19:27:46       3.2   2000  2633845682  0.792453  0.865801  0.634199  0.350891      NaN       NaN      NaN
1  openml.org/t/3913          kc2  RandomForest       test    1   0.860390      auc  local  0.23.2  {'n_estimators': 2000}  2020-12-04T19:27:52       3.0   2000  2633845683  0.905660  0.860390  0.772727  0.406952      NaN       NaN      NaN
2    openml.org/t/59         iris  RandomForest       test    0   0.126485  logloss  local  0.23.2  {'n_estimators': 2000}  2020-12-04T19:27:56       2.9   2000  2633845682  0.933333       NaN  0.933333  0.126485      NaN       NaN      NaN
3    openml.org/t/59         iris  RandomForest       test    1   0.027178  logloss  local  0.23.2  {'n_estimators': 2000}  2020-12-04T19:28:01       3.0   2000  2633845683  1.000000       NaN  1.000000  0.027178      NaN       NaN      NaN
4  openml.org/t/2295  cholesterol  RandomForest       test    0  44.335200     rmse  local  0.23.2  {'n_estimators': 2000}  2020-12-04T19:28:05       3.0   2000  2633845682       NaN       NaN       NaN       NaN  35.6783 -0.014619  44.3352
5  openml.org/t/2295  cholesterol  RandomForest       test    1  55.316300     rmse  local  0.23.2  {'n_estimators': 2000}  2020-12-04T19:28:10       3.1   2000  2633845683       NaN       NaN       NaN       NaN  43.1808 -0.061075  55.3163

Here is a short description of each column:

  • id: a identifier for the dataset used in this result. For convenience, we use the link to the OpenML task by default.
  • task: the task name as defined in the benchmark definition.
  • framework: the framework name as defined in the framework definition.
  • fold: the dataset fold being used for this job. Usually, we're using 10 folds, so the fold varies from 0 to 9.
  • result: the result score, this is the score for the metric that the framework was trying to optimize. For example, for binary classification, the default metrics defined in resources/config.yaml are binary: ['auc', 'acc']; this means that the frameworks should try to optimize auc and the final auc score will become the result value, the other metrics (here acc) are then computed for information.
  • mode: one of local, docker, aws, aws+docker: tells where/how the job was executed.
  • version: the version of the framework being benchmarked.
  • params: if any, a JSON representation of the params defined in the framework definition. This allows to see clearly if some tuning was done for example.
  • tag: the branch tag of the automlbenchmark app that was running the job.
  • utc: the UTC timestamp at the job completion.
  • duration: the training duration: the framework integration is supposed to provide this information to ensure that it takes only into account the time taken by the framework itself. When benchmarking large data, the application can use a significant amount of time to prepare the data: this additional time doesn't appear in this duration column.
  • models: for some frameworks, it is possible to know how many models in total were trained by the AutoML framework.
  • seed: the seed or random state passed to the framework. With some frameworks, it is enough to obtain reproducible results. Note that the seed can be specified at the command line using -Xseed= arg (for example python randomforest -Xseed=1452956522): when there are multiple folds, the seed is then incremented by the fold number.
  • info: additional info in text format, this usually contains error messages if the job failed.
  • acc, auc, logloss metrics: all the metrics that were computed based on the generated predictions. For each job/row, one of them matches the result column, the others are purely informative. Those additional metric columns are simply added in alphabetical order.

Predictions Directory

For each evaluation, the framework integration must generate a predictions file that will be used by the application to compute the scores. This predictions file is saved under the predictions subfolder as shown above and follows the naming convention: {framework}_{task}_{fold}.csv.

The csv file contains a header row and contains the following columns, in order: - For classification tasks only, there is first one column per class, sorted alphabetically. Each column contains the probability of the sample belonging to that class, as predicted by the AutoML framework. If a framework does not provide probabilities, it will be 1 for the predicted class and 0 otherwise. - predictions: contains the predictions of the test predictor data by the model trained by the framework, - truth: the true values of the test target data (test.y).

Here are examples of the first few samples for KC2 (binary classification), iris (multiclass classification), and cholesterol (regression):

no,yes,predictions,truth
0.965857617846013,0.034142382153998944,no,no
0.965857617846013,0.034142382153998944,no,no
0.5845,0.4155,no,no
0.6795,0.3205,no,no
0.965857617846013,0.034142382153998944,no,no
no yes predictions truth
0.965857617846013 0.034142382153998944 no no
0.965857617846013 0.034142382153998944 no no
0.5845 0.4155 no no
0.6795 0.3205 no no
0.965857617846013 0.034142382153998944 no no
Iris-setosa,Iris-versicolor,Iris-virginica,predictions,truth
1.0,0.0,0.0,Iris-setosa,Iris-setosa
0.9715,0.028,0.0005,Iris-setosa,Iris-setosa
1.0,0.0,0.0,Iris-setosa,Iris-setosa
1.0,0.0,0.0,Iris-setosa,Iris-setosa
1.0,0.0,0.0,Iris-setosa,Iris-setosa
0.0,1.0,0.0,Iris-versicolor,Iris-versicolor
Iris-setosa Iris-versicolor Iris-virginica predictions truth
1.0 0.0 0.0 Iris-setosa Iris-setosa
0.9715 0.028 0.0005 Iris-setosa Iris-setosa
1.0 0.0 0.0 Iris-setosa Iris-setosa
1.0 0.0 0.0 Iris-setosa Iris-setosa
1.0 0.0 0.0 Iris-setosa Iris-setosa
0.0 1.0 0.0 Iris-versicolor Iris-versicolor
predictions,truth
241.204,207.0
248.9575,249.0
302.278,268.0
225.9215,234.0
226.6995,201.0
predictions truth
241.204 207.0
248.9575 249.0
302.278 268.0
225.9215 234.0
226.6995 201.0

Extract more information

For some frameworks, it is also possible to extract more detailed information, in the form of artifacts that are saved after the training. Examples of those artifacts are logs generated by the framework, models or descriptions of the models trained by the framework, predictions for each of the model trained by the AutoML framework. By default, those artifacts are not saved, and not all frameworks provide the same artifacts. This is why the artifacts to be stored have to be specified in the framework definition (before running the experiments!). By convention, this can be achieved by specifying the params._save_artifacts parameter. For example:

Save model descriptions under the models subfolder:

autosklearn_debug:
  extends: autosklearn
  params:
    _save_artifacts: ['models'] 

Save the leaderboard and models under the models subfolder, and the H2O logs under logs subfolder:

H2OAutoML_debug:
  extends: H2OAutoML
  params:
    _save_artifacts: ['leaderboard', 'logs', 'models'] 

Save the description of models for the Pareto frontin the models subfolder:

TPOT_debug:
  extends: TPOT
  params:
    _save_artifacts: ['models']

The framework integrations themselves determine where the artifacts are saved, this is typically not configurable from the framework definition.