There is a method in org.apache.spark.mllib.util.MLUtils called "kFold" which will automatically partition your dataset for you into k train/test splits at which point you can build k different models and aggregate the results.
For example (a very rough sketch - assuming I want to do 10-fold cross validation on a binary classification model on a file with 1000 features in it): import org.apache.spark.mllib.util.MLUtils import org.apache.spark.mllib.util.LabelParsers import org.apache.spark.mllib.classification.LogisticRegressionWithSGD val dat = MLUtils.loadLibSVMFile(sc, "path/to/data", false, 1000) val cvdat = kFold(dat, 10, 42) val modelErrrors = cvdat.map { case (train, test) => { val model = LogisticRegressionWithSGD.train(train, 100, 0.1, 1.0) val error = computeError(model, test) (model, error)}} //Average error: val avgError = modelErrors.map(_._2).reduce(_ + _) / modelErrors.length Here, I'm assuming you've got some "computeError" function defined. Note that many of these APIs are marked "experimental" and thus might change in a future spark release. On Tue, Jun 24, 2014 at 6:44 AM, Eustache DIEMERT <eusta...@diemert.fr> wrote: > I'm interested in this topic too :) > > Are the MLLib core devs on this list ? > > E/ > > > 2014-06-24 14:19 GMT+02:00 holdingonrobin <robinholdin...@gmail.com>: > > Anyone knows anything about it? Or should I actually move this topic to a >> MLlib specif mailing list? Any information is appreciated! Thanks! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8172.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >