Re: Filter operation to return two RDDs at once.

Jeff Zhang Wed, 03 Jun 2015 00:04:51 -0700

I check the RDD#randSplit, it is much more like multiple one-to-one
transformation rather than a one-to-multiple transformation.


I write one sample code as following, it would generate 3 stages. Although
we can use cache here to make it better, If spark can support multiple
outputs, only 2 stages are needed. ( This would be useful for pig's
multiple query and hive's self join )


val data = 
sc.textFile("/Users/jzhang/a.log").flatMap(line=>line.split("\\s")).map(w=>(w,1))
val parts = data.randomSplit(Array(0.2,0.8))
val joinResult = parts(0).join(parts(1))
println(joinResult.toDebugString)


(1) MapPartitionsRDD[8] at join at WordCount.scala:22 []
 |  MapPartitionsRDD[7] at join at WordCount.scala:22 []
 |  CoGroupedRDD[6] at join at WordCount.scala:22 []
 +-(1) PartitionwiseSampledRDD[4] at randomSplit at WordCount.scala:21 []
 |  |  MapPartitionsRDD[3] at map at WordCount.scala:20 []
 |  |  MapPartitionsRDD[2] at flatMap at WordCount.scala:20 []
 |  |  /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at
WordCount.scala:20 []
 |  |  /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 []
 +-(1) PartitionwiseSampledRDD[5] at randomSplit at WordCount.scala:21 []
    |  MapPartitionsRDD[3] at map at WordCount.scala:20 []
    |  MapPartitionsRDD[2] at flatMap at WordCount.scala:20 []
    |  /Users/jzhang/a.log MapPartitionsRDD[1] at textFile at
WordCount.scala:20 []
    |  /Users/jzhang/a.log HadoopRDD[0] at textFile at WordCount.scala:20 []


On Wed, Jun 3, 2015 at 2:45 PM, Sean Owen <so...@cloudera.com> wrote:

> In the sense here, Spark actually does have operations that make multiple
> RDDs like randomSplit. However there is not an equivalent of the partition
> operation which gives the elements that matched and did not match at once.
>
> On Wed, Jun 3, 2015, 8:32 AM Jeff Zhang <zjf...@gmail.com> wrote:
>
>> As far as I know, spark don't support multiple outputs
>>
>> On Wed, Jun 3, 2015 at 2:15 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Why do you need to do that if filter and content of the resulting rdd
>>> are exactly same? You may as well declare them as 1 RDD.
>>> On 3 Jun 2015 15:28, "ÐΞ€ρ@Ҝ (๏̯͡๏)" <deepuj...@gmail.com> wrote:
>>>
>>>> I want to do this
>>>>
>>>>     val qtSessionsWithQt = rawQtSession.filter(_._2.
>>>> qualifiedTreatmentId != NULL_VALUE)
>>>>
>>>>     val guidUidMapSessions = rawQtSession.filter(_._2.
>>>> qualifiedTreatmentId == NULL_VALUE)
>>>>
>>>> This will run two different stages can this be done in one stage ?
>>>>
>>>>     val (qtSessionsWithQt, guidUidMapSessions) = rawQtSession.
>>>> *magicFilter*(_._2.qualifiedTreatmentId != NULL_VALUE)
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Deepak
>>>>
>>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>


-- 
Best Regards

Jeff Zhang

Re: Filter operation to return two RDDs at once.

Reply via email to