You need to convert DataFrame to RDD, call sampleByKey, and then apply the schema back to create DataFrame.
val df: DataFrame = ... val schema = df.schema val sampledRDD = df.rdd.keyBy(r => r.getAs[Int](0)).sampleByKey(...).values val sampled = sqlContext.createDataFrame(sampledRDD, schema) Hopefully this would be much easier in 1.5. Best, Xiangrui On Mon, May 11, 2015 at 12:32 PM, Karthikeyan Muthukumar <mkarthiksw...@gmail.com> wrote: > Hi, > I'm in Spark 1.3.0 and my data is in DataFrames. > I need operations like sampleByKey(), sampleByKeyExact(). > I saw the JIRA "Add approximate stratified sampling to DataFrame" > (https://issues.apache.org/jira/browse/SPARK-7157). > That's targeted for Spark 1.5, till that comes through, whats the easiest > way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on > DataFrames. > Thanks & Regards > MK > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org