Re: Map one RDD into two RDD

anshu shukla Thu, 07 May 2015 14:14:37 -0700

One of  the best discussion in mailing list  :-)  ...Please  help me in
concluding --


The whole discussion concludes that -

1-  Framework  does not support  increasing parallelism of any task just by
any inbuilt function .
2-  User have to manualy write logic for filter output of upstream node in
DAG  to manage input to Downstream nodes (like shuffle grouping etc in
STORM)
3- If we want to increase the level of parallelism of twitter streaming
 Spout  to *get higher rate of  DStream of tweets  (to increase the rate of
input )  , how it is possible ...  *

  *val tweetStream = **TwitterUtils.createStream(ssc, Utils.getAuth)*



On Fri, May 8, 2015 at 2:16 AM, Evo Eftimov <evo.efti...@isecc.com> wrote:

> 1. Will rdd2.filter run before rdd1.filter finish?
>
>
>
> YES
>
>
>
> 2. We have to traverse rdd twice. Any comments?
>
>
>
> You can invoke filter or whatever other transformation / function many
> times
>
> Ps: you  have to study / learn the Parallel Programming Model of an OO
> Framework like Spark – in any OO Framework lots of Behavior is hidden /
> encapsulated by the Framework and the client code gets invoked at specific
> points in the Flow of Control / Data based on callback functions
>
>
>
> That’s why stuff like RDD.filter(), RDD.filter() may look “sequential” to
> you but it is not
>
>
>
>
>
> *From:* Bill Q [mailto:bill.q....@gmail.com]
> *Sent:* Thursday, May 7, 2015 6:27 PM
>
> *To:* Evo Eftimov
> *Cc:* u...@spark.apache.org
> *Subject:* Re: Map one RDD into two RDD
>
>
>
> The multi-threading code in Scala is quite simple and you can google it
> pretty easily. We used the Future framework. You can use Akka also.
>
>
>
> @Evo My concerns for filtering solution are: 1. Will rdd2.filter run
> before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments?
>
>
>
> On Thursday, May 7, 2015, Evo Eftimov <evo.efti...@isecc.com> wrote:
>
> Scala is a language, Spark is an OO/Functional, Distributed Framework
> facilitating Parallel Programming in a distributed environment
>
>
>
> Any “Scala parallelism” occurs within the Parallel Model imposed by the
> Spark OO Framework – ie it is limited in terms of what it can achieve in
> terms of influencing the Spark Framework behavior – that is the nature of
> programming with/for frameworks
>
>
>
> When RDD1 and RDD2 are partitioned and different Actions applied to them
> this will result in Parallel Pipelines / DAGs within the Spark Framework
>
> RDD1 = RDD.filter()
>
> RDD2 = RDD.filter()
>
>
>
>
>
> *From:* Bill Q [mailto:bill.q....@gmail.com]
> *Sent:* Thursday, May 7, 2015 4:55 PM
> *To:* Evo Eftimov
> *Cc:* u...@spark.apache.org
> *Subject:* Re: Map one RDD into two RDD
>
>
>
> Thanks for the replies. We decided to use concurrency in Scala to do the
> two mappings using the same source RDD in parallel. So far, it seems to be
> working. Any comments?
>
> On Wednesday, May 6, 2015, Evo Eftimov <evo.efti...@isecc.com> wrote:
>
> RDD1 = RDD.filter()
>
> RDD2 = RDD.filter()
>
>
>
> *From:* Bill Q [mailto:bill.q....@gmail.com <bill.q....@gmail.com>]
> *Sent:* Tuesday, May 5, 2015 10:42 PM
> *To:* u...@spark.apache.org
> *Subject:* Map one RDD into two RDD
>
>
>
> Hi all,
>
> I have a large RDD that I map a function to it. Based on the nature of
> each record in the input RDD, I will generate two types of data. I would
> like to save each type into its own RDD. But I can't seem to find an
> efficient way to do it. Any suggestions?
>
>
>
> Many thanks.
>
>
>
>
>
> Bill
>
>
>
> --
>
> Many thanks.
>
> Bill
>
>
>
>
>
> --
>
> Many thanks.
>
> Bill
>
>
>
>
>
> --
>
> Many thanks.
>
> Bill
>
>
>



-- 
Thanks & Regards,
Anshu Shukla

Re: Map one RDD into two RDD

Reply via email to