Re: major Spark performance problem

Matei Zaharia Sun, 09 Mar 2014 13:42:06 -0700

Hi Dana,

It’s hard to tell exactly what is consuming time, but I’d suggest starting by 
profiling the single application first. Three things to look at there:

1) How many stages and how many tasks per stage is Spark launching (in the 
application web UI at http://<driver>:4040)? If you have hundreds of tasks for 
this small a file, just the task launching time might be a problem. You can use 
RDD.coalesce() to have fewer data partitions.

2) If you run a Java profiler (e.g. YourKit or hprof) on the workers while the 
application is executing, where is time being spent? Maybe some of your code is 
more expensive than it seems. One other thing you might find is that some code 
you use requires synchronization and is therefore not scaling properly to 
multiple cores (e.g. Java’s Math.random() actually does that).

3) Are there any RDDs that are used over and over but not cached? In that case 
they’ll be recomputed on each use.

Once you look into these it might be easier to improve the multiple-job case. 
In that case as others have pointed out, running the jobs in the same 
SparkContext and using the fair scheduler 
(http://spark.apache.org/docs/latest/job-scheduling.html) should work.

Matei

On Mar 9, 2014, at 5:56 AM, Livni, Dana <dana.li...@intel.com> wrote:

> YARN also have this scheduling option.
> The problem is all of our applications have the same flow where the first  
> stage is the heaviest and the rest are very small.
> The problem is when some request (application) start to run on the same time, 
> the first stage of all is schedule in parallel, and for some reason they 
> delay each other,
> And a stage that alone will take around 13s can reach up to 2m when running 
> in parallel with other identic stages  (around 15 stages).
> 
> 
> 
> -----Original Message-----
> From: elyast [mailto:lukasz.jastrzeb...@gmail.com] 
> Sent: Friday, March 07, 2014 20:01
> To: u...@spark.incubator.apache.org
> Subject: Re: major Spark performance problem
> 
> Hi,
> 
> There is also an option to run spark applications on top of mesos in fine 
> grained mode, then it is possible for fair scheduling (applications will run 
> in parallel and mesos is responsible for scheduling all tasks) so in a sense 
> all applications will progress in parallel, obviously it total in may not be 
> faster however the benefit is the fair scheduling (small jobs will not be 
> stuck by the big ones).
> 
> Best regards
> Lukasz Jastrzebski
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/major-Spark-performance-problem-tp2364p2403.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>

Re: major Spark performance problem

Reply via email to