One of the best discussion in mailing list :-) ...Please help me in concluding --
The whole discussion concludes that - 1- Framework does not support increasing parallelism of any task just by any inbuilt function . 2- User have to manualy write logic for filter output of upstream node in DAG to manage input to Downstream nodes (like shuffle grouping etc in STORM) 3- If we want to increase the level of parallelism of twitter streaming Spout to *get higher rate of DStream of tweets (to increase the rate of input ) , how it is possible ... * *val tweetStream = **TwitterUtils.createStream(ssc, Utils.getAuth)* On Fri, May 8, 2015 at 2:16 AM, Evo Eftimov <evo.efti...@isecc.com> wrote: > 1. Will rdd2.filter run before rdd1.filter finish? > > > > YES > > > > 2. We have to traverse rdd twice. Any comments? > > > > You can invoke filter or whatever other transformation / function many > times > > Ps: you have to study / learn the Parallel Programming Model of an OO > Framework like Spark – in any OO Framework lots of Behavior is hidden / > encapsulated by the Framework and the client code gets invoked at specific > points in the Flow of Control / Data based on callback functions > > > > That’s why stuff like RDD.filter(), RDD.filter() may look “sequential” to > you but it is not > > > > > > *From:* Bill Q [mailto:bill.q....@gmail.com] > *Sent:* Thursday, May 7, 2015 6:27 PM > > *To:* Evo Eftimov > *Cc:* u...@spark.apache.org > *Subject:* Re: Map one RDD into two RDD > > > > The multi-threading code in Scala is quite simple and you can google it > pretty easily. We used the Future framework. You can use Akka also. > > > > @Evo My concerns for filtering solution are: 1. Will rdd2.filter run > before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments? > > > > On Thursday, May 7, 2015, Evo Eftimov <evo.efti...@isecc.com> wrote: > > Scala is a language, Spark is an OO/Functional, Distributed Framework > facilitating Parallel Programming in a distributed environment > > > > Any “Scala parallelism” occurs within the Parallel Model imposed by the > Spark OO Framework – ie it is limited in terms of what it can achieve in > terms of influencing the Spark Framework behavior – that is the nature of > programming with/for frameworks > > > > When RDD1 and RDD2 are partitioned and different Actions applied to them > this will result in Parallel Pipelines / DAGs within the Spark Framework > > RDD1 = RDD.filter() > > RDD2 = RDD.filter() > > > > > > *From:* Bill Q [mailto:bill.q....@gmail.com] > *Sent:* Thursday, May 7, 2015 4:55 PM > *To:* Evo Eftimov > *Cc:* u...@spark.apache.org > *Subject:* Re: Map one RDD into two RDD > > > > Thanks for the replies. We decided to use concurrency in Scala to do the > two mappings using the same source RDD in parallel. So far, it seems to be > working. Any comments? > > On Wednesday, May 6, 2015, Evo Eftimov <evo.efti...@isecc.com> wrote: > > RDD1 = RDD.filter() > > RDD2 = RDD.filter() > > > > *From:* Bill Q [mailto:bill.q....@gmail.com <bill.q....@gmail.com>] > *Sent:* Tuesday, May 5, 2015 10:42 PM > *To:* u...@spark.apache.org > *Subject:* Map one RDD into two RDD > > > > Hi all, > > I have a large RDD that I map a function to it. Based on the nature of > each record in the input RDD, I will generate two types of data. I would > like to save each type into its own RDD. But I can't seem to find an > efficient way to do it. Any suggestions? > > > > Many thanks. > > > > > > Bill > > > > -- > > Many thanks. > > Bill > > > > > > -- > > Many thanks. > > Bill > > > > > > -- > > Many thanks. > > Bill > > > -- Thanks & Regards, Anshu Shukla