The RDD has a takeSample method where you can supply the flag for
replacement or not as well as the fraction to sample.
On Nov 14, 2015 2:51 AM, "Andy Davidson" <a...@santacruzintegration.com>
wrote:

> In R, its easy to split a data set into training, crossValidation, and
> test set. Is there something like this in spark.ml? I am using python of
> now.
>
> My real problem is I want to randomly select a relatively small data set
> to do some initial data exploration. Its not clear to me how using spark I
> could create a random sample from a large data set. I would prefer to
> sample with out replacement.
>
> I have not tried to use sparkR yet. I assume I would not be able to use
> the caret package with spark ML
>
> Kind regards
>
> Andy
>
> ```{R}
>    inTrain <- createDataPartition(y=csv$classe, p=0.7, list=FALSE)
>     trainSetDF <- csv[inTrain,]
>     testSetDF <- csv[-inTrain,]
> ```
>
>

Reply via email to