The RDD has a takeSample method where you can supply the flag for replacement or not as well as the fraction to sample. On Nov 14, 2015 2:51 AM, "Andy Davidson" <a...@santacruzintegration.com> wrote:
> In R, its easy to split a data set into training, crossValidation, and > test set. Is there something like this in spark.ml? I am using python of > now. > > My real problem is I want to randomly select a relatively small data set > to do some initial data exploration. Its not clear to me how using spark I > could create a random sample from a large data set. I would prefer to > sample with out replacement. > > I have not tried to use sparkR yet. I assume I would not be able to use > the caret package with spark ML > > Kind regards > > Andy > > ```{R} > inTrain <- createDataPartition(y=csv$classe, p=0.7, list=FALSE) > trainSetDF <- csv[inTrain,] > testSetDF <- csv[-inTrain,] > ``` > >