Re: RDD cache question

Yadid Ayzenberg Sat, 30 Nov 2013 18:39:53 -0800

step 4. would be count(), or collect(). The map() (in step 2.) would beperforming calculations and writing information to a DB.


Is this the information that was missing ?


Thanks,

Yadid





On 11/30/13 9:24 PM, Mark Hamstra wrote:

Your question doesn't really make any sense without specifying whereany RDD actions take place (i.e. where Spark jobs are actually run.)Without any actions, all you've outlined so far are different ways tospecify the chain of transformations that should be evaluated when anaction is eventually called and a job runs. In a real sense your codehasn't actually done anything yet.
On Sat, Nov 30, 2013 at 6:01 PM, Yadid Ayzenberg <[email protected]<mailto:[email protected]>> wrote:
    Hi All,

    Im trying to implement the following and would like to know in
    which places I should be calling RDD.cache():

    Suppose I have a group of RDDs : RDD1 to RDDn as input.

    1. create a single RDD_total = RDD1.union(RDD2)..union(RDDn)

    2. for i = 0 to x:    RDD_total = RDD_total.map (some map function());

    3. return RDD_total.

    I that I should cache RDD total in order to optimize the
    iterations. Should I just be calling RDD_total.cache() at the end
    of each iteration ? or should I be preforming something more
    elaborate:


    RDD_temp = RDD_total.map (some map function());
    RDD_total.unpersist();
    RDD_total = RDD_temp.cache();



    Thanks,
    Yadid

Re: RDD cache question

Reply via email to