I check the RDD#randSplit, it is much more like multiple one-to-one transformation rather than a one-to-multiple transformation.
I write one sample code as following, it would generate 3 stages. Although we can use cache here to make it better, If spark can support multiple outputs, only 2 stages are needed. ( This would be useful for pig's multiple query and hive's self join ) val data = sc.textFile("/Users/jzhang/a.log").flatMap(line=>line.split("\\s")).map(w=>(w,1)) val parts = data.randomSplit(Array(0.2,0.8)) val joinResult = parts(0).join(parts(1)) println(joinResult.toDebugString) (1) MapPartitionsRDD[8] at join at WordCount.scala:22 [] | MapPartitionsRDD[7] at join at WordCount.scala:22 [] | CoGroupedRDD[6] at join at WordCount.scala:22 [] +-(1) PartitionwiseSampledRDD[4] at randomSplit at WordCount.scala:21 [] | | MapPartitionsRDD[3] at map at WordCount.scala:20 [] | | MapPartitionsRDD[2] at flatMap at WordCount.scala:20 [] | | /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at WordCount.scala:20 [] | | /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 [] +-(1) PartitionwiseSampledRDD[5] at randomSplit at WordCount.scala:21 [] | MapPartitionsRDD[3] at map at WordCount.scala:20 [] | MapPartitionsRDD[2] at flatMap at WordCount.scala:20 [] | /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at WordCount.scala:20 [] | /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 [] On Wed, Jun 3, 2015 at 2:45 PM, Sean Owen <so...@cloudera.com> wrote: > In the sense here, Spark actually does have operations that make multiple > RDDs like randomSplit. However there is not an equivalent of the partition > operation which gives the elements that matched and did not match at once. > > On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang <zjf...@gmail.com> wrote: > >> As far as I know, spark don't support multiple outputs >> >> On Wed, Jun 3, 2015 at 2:15 PM, ayan guha <guha.a...@gmail.com> wrote: >> >>> Why do you need to do that if filter and content of the resulting rdd >>> are exactly same? You may as well declare them as 1 RDD. >>> On 3 Jun 2015 15:28, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <deepuj...@gmail.com> wrote: >>> >>>> I want to do this >>>> >>>> val qtSessionsWithQt = rawQtSession.filter(_._2. >>>> qualifiedTreatmentId != NULL_VALUE) >>>> >>>> val guidUidMapSessions = rawQtSession.filter(_._2. >>>> qualifiedTreatmentId == NULL_VALUE) >>>> >>>> This will run two different stages can this be done in one stage ? >>>> >>>> val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession. >>>> *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE) >>>> >>>> >>>> >>>> >>>> -- >>>> Deepak >>>> >>>> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > -- Best Regards Jeff Zhang