Thanks Sean ! On Tue, Jan 20, 2015 at 3:32 PM, Sean Owen <so...@cloudera.com> wrote: > You can persist the RDD in (2) right after it is created. It will not > cause it to be persisted immediately, but rather the first time it is > materialized. If you persist after (3) is calculated, then it will be > re-calculated (and persisted) after (4) is calculated. > > On Tue, Jan 20, 2015 at 3:38 AM, Ashish <paliwalash...@gmail.com> wrote: >> Sean, >> >> A related question. When to persist the RDD after step 2 or after Step >> 3 (nothing would happen before step 3 I assume)? >> >> On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen <so...@cloudera.com> wrote: >>> From the OP: >>> >>> (1) val lines = Import full dataset using sc.textFile >>> (2) val ABonly = Filter out all rows from "lines" that are not of type A or >>> B >>> (3) val processA = Process only the A rows from ABonly >>> (4) val processB = Process only the B rows from ABonly >>> >>> I assume that 3 and 4 are actions, or else nothing happens here at all. >>> >>> When 3 is invoked, it will compute 1, then 2, then 3. 4 will happen >>> after 3, and may even cause 1 and 2 to happen again if nothing is >>> persisted. >>> >>> You can invoke 3 and 4 in parallel on the driver if you like. That's >>> fine. But actions are blocking in the driver. >>> >>> >>> >>> On Mon, Jan 19, 2015 at 8:21 AM, davidkl <davidkl...@hotmail.com> wrote: >>>> Hi Jon, I am looking for an answer for a similar question in the doc now, >>>> so >>>> far no clue. >>>> >>>> I would need to know what is spark behaviour in a situation like the >>>> example >>>> you provided, but taking into account also that there are multiple >>>> partitions/workers. >>>> >>>> I could imagine it's possible that different spark workers are not >>>> synchronized in terms of waiting for each other to progress to the next >>>> step/stage for the partitions of data they get assigned, while I believe in >>>> streaming they would wait for the current batch to complete before they >>>> start working on a new one. >>>> >>>> In the code I am working on, I need to make sure a particular step is >>>> completed (in all workers, for all partitions) before next transformation >>>> is >>>> applied. >>>> >>>> Would be great if someone could clarify or point to these issues in the >>>> doc! >>>> :-) >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21227.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >> >> >> -- >> thanks >> ashish >> >> Blog: http://www.ashishpaliwal.com/blog >> My Photo Galleries: http://www.pbase.com/ashishpaliwal
-- thanks ashish Blog: http://www.ashishpaliwal.com/blog My Photo Galleries: http://www.pbase.com/ashishpaliwal --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org