You can persist the RDD in (2) right after it is created. It will not
cause it to be persisted immediately, but rather the first time it is
materialized. If you persist after (3) is calculated, then it will be
re-calculated (and persisted) after (4) is calculated.
On Tue, Jan 20, 2015 at 3:38 AM,
Thanks Sean !
On Tue, Jan 20, 2015 at 3:32 PM, Sean Owen so...@cloudera.com wrote:
You can persist the RDD in (2) right after it is created. It will not
cause it to be persisted immediately, but rather the first time it is
materialized. If you persist after (3) is calculated, then it will be
I found the following to be a good discussion of the same topic:
http://apache-spark-user-list.1001560.n3.nabble.com/The-concurrent-model-of-spark-job-stage-task-td13083.html
From: so...@cloudera.com
Date: Tue, 20 Jan 2015 10:02:20 +
Subject: Re: Does Spark automatically run different
:
http://apache-spark-user-list.1001560.n3.nabble.com/The-concurrent-model-of-spark-job-stage-task-td13083.html
From: so...@cloudera.com
Date: Tue, 20 Jan 2015 10:02:20 +
Subject: Re: Does Spark automatically run different stages concurrently
when possible?
To: paliwalash...@gmail.com
CC
From: so...@cloudera.com
Date: Tue, 20 Jan 2015 10:02:20 +
Subject: Re: Does Spark automatically run different stages concurrently
when possible?
To: paliwalash...@gmail.com
CC: davidkl...@hotmail.com; user@spark.apache.org
You can persist the RDD in (2) right after it is created
Sean,
A related question. When to persist the RDD after step 2 or after Step
3 (nothing would happen before step 3 I assume)?
On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen so...@cloudera.com wrote:
From the OP:
(1) val lines = Import full dataset using sc.textFile
(2) val ABonly = Filter out
Hi Jon, I am looking for an answer for a similar question in the doc now, so
far no clue.
I would need to know what is spark behaviour in a situation like the example
you provided, but taking into account also that there are multiple
partitions/workers.
I could imagine it's possible that
+1, I too need to know.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
From the OP:
(1) val lines = Import full dataset using sc.textFile
(2) val ABonly = Filter out all rows from lines that are not of type A or B
(3) val processA = Process only the A rows from ABonly
(4) val processB = Process only the B rows from ABonly
I assume that 3 and 4 are actions, or else
Hi, john and david
I tried this to run them concurrently List(RDD1,RDD2,.).par.foreach{
rdd=
rdd.collect().foreach(println)
}
this was able to successfully register the task but the parallelism of the
stages is limited it was able run 4 of them some time and only one of them
some time
Keep in mind that your executors will be able to run some fixed number
of tasks in parallel, given your configuration. You should not
necessarily expect that arbitrarily many RDDs and tasks would schedule
simultaneously.
On Mon, Jan 19, 2015 at 5:34 PM, critikaled isasmani@gmail.com wrote:
You may try to change the schudlingMode to FAIR, the default is FIFO. Take
a look at this page
https://spark.apache.org/docs/1.1.0/job-scheduling.html#scheduling-within-an-application
On Sat, Jan 10, 2015 at 10:24 AM, YaoPau jonrgr...@gmail.com wrote:
I'm looking for ways to reduce the
From your pseudo code, it would be sequential and done twice
1+2+3
then 1+2+4
If you do a .cache() in step 2 then you would have 1+2+3 , then 4
I ran several steps in parrallel from the same program but never using the
same source RDD so I do not know the limitations there. I simply started
13 matches
Mail list logo