R User Group Meetup, September 2017

For whom is OpenML?

  • Domain scientists
  • Data analysts
  • Algorithm developers
  • Students
  • Teachers

OpenML with R

Run cforest on task 10 (https://www.openml.org/t/10)

tsk.id <- 10
tsk <- getOMLTask(tsk.id)

lrn <- makeLearner("classif.cforest", ntree = 1000, mtry = 4)
run <- runTaskMlr(task = tsk, learner = lrn)
rn.id <- uploadOMLRun(run, confirm.upload = FALSE)
evals <- listOMLRunEvaluations(task.id = tsk.id)

mlrevals <- evals[grep("mlr", evals$flow.name), ]
mlrevals$my.run <- mlrevals$run.id == rn.id
ggplot(mlrevals, aes(x = predictive.accuracy, 
                     y = flow.name, 
                     color = my.run)) + 
  geom_point()

Checking a new algorithm

Papers

G. Casalicchio, J. Bossek, M. Lang, D. Kirchhoff, P. Kerschke, B. Hofner, H. Seibold, J. Vanschoren, and B. Bischl.
OpenML: An R package to connect to the machine learning platform OpenML.
Computational Statistics, 2017.
doi: 10.1007/s00180-017-0742-2.

J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo.
OpenML: Networked science in machine learning.
SIGKDD Explorations, 15(2):49–60, 2014.
doi: 10.1145/2641190.2641198.

Backup slides

ctree control parameters

splittest
a logical changing linear (the default FALSE) to maximally selected statistics for variable selection. Currently needs testtype = "MonteCarlo".

lookahead
a logical determining whether a split is implemented only after checking if tests in both daughter nodes can be performed.

intersplit
a logical indicating if splits in numeric variables are simply x <= a (the default) or interpolated x <= (a + b) / 2. The latter feature is experimental, see Galili and Meilijson (2016).

nmax
an integer defining the number of bins each variable is divided into prior to tree building. The default Inf does not apply any binning. Highly experimental, use at your own risk.

Resources