I have a flatmap that takes a line from an input file, splits it by tab into
"words" then returns an array of tuples consisting of every combination of 2
words. I then go on to count the frequency of each combination across the
whole file using a reduce by key (where the key is the tuple). 

I am running this now and I am getting terrible performance. My guess is
that Spark is trying to run 12 tasks in parallel since each machine has 12
cpus but the memory consumed by each task is far more than 1/12th of the
memory on each machine.

Metrics:
16 node cluster, 12 cpus each, 60GB allocated to spark each.
Input file: 119.5 MB / 84134 lines
After 3 days (3295.7 h across all tasks) - Shuffle write: 197.1 GB /
22701021927

Here are the metrics from the worst performing task (so far):
Duration: 58.8 h 
GC time: 19.0 h 
Input Size / Records: 383.9 KB (hadoop) / 153
Write Time: 4 s
Shuffle Write Size / Records: 2.4 GB / 283297353
Shuffle Spill (Memory): 30.4 GB
Shuffle Spill (Disk): 1941.4 MB

I used 400 partitions when reading the input file. I'm guessing this would
be the first thing to change. Does anyone have any suggestions of what to
increase it to?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-deal-with-an-explosive-flatmap-tp23277.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to