step 4. would be count(), or collect(). The map() (in step 2.) would be performing calculations and writing information to a DB.

Is this the information that was missing ?

Thanks,

Yadid





On 11/30/13 9:24 PM, Mark Hamstra wrote:
Your question doesn't really make any sense without specifying where any RDD actions take place (i.e. where Spark jobs are actually run.) Without any actions, all you've outlined so far are different ways to specify the chain of transformations that should be evaluated when an action is eventually called and a job runs. In a real sense your code hasn't actually done anything yet.


On Sat, Nov 30, 2013 at 6:01 PM, Yadid Ayzenberg <[email protected] <mailto:[email protected]>> wrote:




    Hi All,

    Im trying to implement the following and would like to know in
    which places I should be calling RDD.cache():

    Suppose I have a group of RDDs : RDD1 to RDDn as input.

    1. create a single RDD_total = RDD1.union(RDD2)..union(RDDn)

    2. for i = 0 to x:    RDD_total = RDD_total.map (some map function());

    3. return RDD_total.

    I that I should cache RDD total in order to optimize the
    iterations. Should I just be calling RDD_total.cache() at the end
    of each iteration ? or should I be preforming something more
    elaborate:


    RDD_temp = RDD_total.map (some map function());
    RDD_total.unpersist();
    RDD_total = RDD_temp.cache();



    Thanks,
    Yadid








Reply via email to