Thanks Sean !

On Tue, Jan 20, 2015 at 3:32 PM, Sean Owen <so...@cloudera.com> wrote:
> You can persist the RDD in (2) right after it is created. It will not
> cause it to be persisted immediately, but rather the first time it is
> materialized. If you persist after (3) is calculated, then it will be
> re-calculated (and persisted) after (4) is calculated.
>
> On Tue, Jan 20, 2015 at 3:38 AM, Ashish <paliwalash...@gmail.com> wrote:
>> Sean,
>>
>> A related question. When to persist the RDD after step 2 or after Step
>> 3 (nothing would happen before step 3 I assume)?
>>
>> On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen <so...@cloudera.com> wrote:
>>> From the OP:
>>>
>>> (1) val lines = Import full dataset using sc.textFile
>>> (2) val ABonly = Filter out all rows from "lines" that are not of type A or 
>>> B
>>> (3) val processA = Process only the A rows from ABonly
>>> (4) val processB = Process only the B rows from ABonly
>>>
>>> I assume that 3 and 4 are actions, or else nothing happens here at all.
>>>
>>> When 3 is invoked, it will compute 1, then 2, then 3. 4 will happen
>>> after 3, and may even cause 1 and 2 to happen again if nothing is
>>> persisted.
>>>
>>> You can invoke 3 and 4 in parallel on the driver if you like. That's
>>> fine. But actions are blocking in the driver.
>>>
>>>
>>>
>>> On Mon, Jan 19, 2015 at 8:21 AM, davidkl <davidkl...@hotmail.com> wrote:
>>>> Hi Jon, I am looking for an answer for a similar question in the doc now, 
>>>> so
>>>> far no clue.
>>>>
>>>> I would need to know what is spark behaviour in a situation like the 
>>>> example
>>>> you provided, but taking into account also that there are multiple
>>>> partitions/workers.
>>>>
>>>> I could imagine it's possible that different spark workers are not
>>>> synchronized in terms of waiting for each other to progress to the next
>>>> step/stage for the partitions of data they get assigned, while I believe in
>>>> streaming they would wait for the current batch to complete before they
>>>> start working on a new one.
>>>>
>>>> In the code I am working on, I need to make sure a particular step is
>>>> completed (in all workers, for all partitions) before next transformation 
>>>> is
>>>> applied.
>>>>
>>>> Would be great if someone could clarify or point to these issues in the 
>>>> doc!
>>>> :-)
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: 
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21227.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>>
>> --
>> thanks
>> ashish
>>
>> Blog: http://www.ashishpaliwal.com/blog
>> My Photo Galleries: http://www.pbase.com/ashishpaliwal



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to