From http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ :
Look at the number of partitions in the parent RDD and then keep multiplying that by 1.5 until performance stops improving. FYI On Thu, Jun 11, 2015 at 7:57 AM, matthewrj <mle...@gmail.com> wrote: > I have a flatmap that takes a line from an input file, splits it by tab > into > "words" then returns an array of tuples consisting of every combination of > 2 > words. I then go on to count the frequency of each combination across the > whole file using a reduce by key (where the key is the tuple). > > I am running this now and I am getting terrible performance. My guess is > that Spark is trying to run 12 tasks in parallel since each machine has 12 > cpus but the memory consumed by each task is far more than 1/12th of the > memory on each machine. > > Metrics: > 16 node cluster, 12 cpus each, 60GB allocated to spark each. > Input file: 119.5 MB / 84134 lines > After 3 days (3295.7 h across all tasks) - Shuffle write: 197.1 GB / > 22701021927 > > Here are the metrics from the worst performing task (so far): > Duration: 58.8 h > GC time: 19.0 h > Input Size / Records: 383.9 KB (hadoop) / 153 > Write Time: 4 s > Shuffle Write Size / Records: 2.4 GB / 283297353 > Shuffle Spill (Memory): 30.4 GB > Shuffle Spill (Disk): 1941.4 MB > > I used 400 partitions when reading the input file. I'm guessing this would > be the first thing to change. Does anyone have any suggestions of what to > increase it to? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-deal-with-an-explosive-flatmap-tp23277.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >