You can persist the RDD in (2) right after it is created. It will not cause it to be persisted immediately, but rather the first time it is materialized. If you persist after (3) is calculated, then it will be re-calculated (and persisted) after (4) is calculated.
On Tue, Jan 20, 2015 at 3:38 AM, Ashish <paliwalash...@gmail.com> wrote: > Sean, > > A related question. When to persist the RDD after step 2 or after Step > 3 (nothing would happen before step 3 I assume)? > > On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen <so...@cloudera.com> wrote: >> From the OP: >> >> (1) val lines = Import full dataset using sc.textFile >> (2) val ABonly = Filter out all rows from "lines" that are not of type A or B >> (3) val processA = Process only the A rows from ABonly >> (4) val processB = Process only the B rows from ABonly >> >> I assume that 3 and 4 are actions, or else nothing happens here at all. >> >> When 3 is invoked, it will compute 1, then 2, then 3. 4 will happen >> after 3, and may even cause 1 and 2 to happen again if nothing is >> persisted. >> >> You can invoke 3 and 4 in parallel on the driver if you like. That's >> fine. But actions are blocking in the driver. >> >> >> >> On Mon, Jan 19, 2015 at 8:21 AM, davidkl <davidkl...@hotmail.com> wrote: >>> Hi Jon, I am looking for an answer for a similar question in the doc now, so >>> far no clue. >>> >>> I would need to know what is spark behaviour in a situation like the example >>> you provided, but taking into account also that there are multiple >>> partitions/workers. >>> >>> I could imagine it's possible that different spark workers are not >>> synchronized in terms of waiting for each other to progress to the next >>> step/stage for the partitions of data they get assigned, while I believe in >>> streaming they would wait for the current batch to complete before they >>> start working on a new one. >>> >>> In the code I am working on, I need to make sure a particular step is >>> completed (in all workers, for all partitions) before next transformation is >>> applied. >>> >>> Would be great if someone could clarify or point to these issues in the doc! >>> :-) >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21227.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > > > > -- > thanks > ashish > > Blog: http://www.ashishpaliwal.com/blog > My Photo Galleries: http://www.pbase.com/ashishpaliwal --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org