There is a method in org.apache.spark.mllib.util.MLUtils called "kFold"
which will automatically partition your dataset for you into k train/test
splits at which point you can build k different models and aggregate the
results.

For example (a very rough sketch - assuming I want to do 10-fold cross
validation on a binary classification model on a file with 1000 features in
it):

import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.util.LabelParsers
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD


val dat = MLUtils.loadLibSVMFile(sc, "path/to/data", false, 1000)

val cvdat = kFold(dat, 10, 42)

val modelErrrors = cvdat.map { case (train, test) => {
   val model = LogisticRegressionWithSGD.train(train, 100, 0.1, 1.0)
   val error = computeError(model, test)
    (model, error)}}

//Average error:
val avgError = modelErrors.map(_._2).reduce(_ + _) / modelErrors.length

Here, I'm assuming you've got some "computeError" function defined. Note
that many of these APIs are marked "experimental" and thus might change in
a future spark release.


On Tue, Jun 24, 2014 at 6:44 AM, Eustache DIEMERT <eusta...@diemert.fr>
wrote:

> I'm interested in this topic too :)
>
> Are the MLLib core devs on this list ?
>
> E/
>
>
> 2014-06-24 14:19 GMT+02:00 holdingonrobin <robinholdin...@gmail.com>:
>
> Anyone knows anything about it? Or should I actually move this topic to a
>> MLlib specif mailing list? Any information is appreciated! Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8172.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Reply via email to