From
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
:

Look at the number of partitions in the parent RDD and then keep
multiplying that by 1.5 until performance stops improving.

FYI

On Thu, Jun 11, 2015 at 7:57 AM, matthewrj <mle...@gmail.com> wrote:

> I have a flatmap that takes a line from an input file, splits it by tab
> into
> "words" then returns an array of tuples consisting of every combination of
> 2
> words. I then go on to count the frequency of each combination across the
> whole file using a reduce by key (where the key is the tuple).
>
> I am running this now and I am getting terrible performance. My guess is
> that Spark is trying to run 12 tasks in parallel since each machine has 12
> cpus but the memory consumed by each task is far more than 1/12th of the
> memory on each machine.
>
> Metrics:
> 16 node cluster, 12 cpus each, 60GB allocated to spark each.
> Input file: 119.5 MB / 84134 lines
> After 3 days (3295.7 h across all tasks) - Shuffle write: 197.1 GB /
> 22701021927
>
> Here are the metrics from the worst performing task (so far):
> Duration: 58.8 h
> GC time: 19.0 h
> Input Size / Records: 383.9 KB (hadoop) / 153
> Write Time: 4 s
> Shuffle Write Size / Records: 2.4 GB / 283297353
> Shuffle Spill (Memory): 30.4 GB
> Shuffle Spill (Disk): 1941.4 MB
>
> I used 400 partitions when reading the input file. I'm guessing this would
> be the first thing to change. Does anyone have any suggestions of what to
> increase it to?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-deal-with-an-explosive-flatmap-tp23277.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to