I have a flatmap that takes a line from an input file, splits it by tab into "words" then returns an array of tuples consisting of every combination of 2 words. I then go on to count the frequency of each combination across the whole file using a reduce by key (where the key is the tuple).
I am running this now and I am getting terrible performance. My guess is that Spark is trying to run 12 tasks in parallel since each machine has 12 cpus but the memory consumed by each task is far more than 1/12th of the memory on each machine. Metrics: 16 node cluster, 12 cpus each, 60GB allocated to spark each. Input file: 119.5 MB / 84134 lines After 3 days (3295.7 h across all tasks) - Shuffle write: 197.1 GB / 22701021927 Here are the metrics from the worst performing task (so far): Duration: 58.8 h GC time: 19.0 h Input Size / Records: 383.9 KB (hadoop) / 153 Write Time: 4 s Shuffle Write Size / Records: 2.4 GB / 283297353 Shuffle Spill (Memory): 30.4 GB Shuffle Spill (Disk): 1941.4 MB I used 400 partitions when reading the input file. I'm guessing this would be the first thing to change. Does anyone have any suggestions of what to increase it to? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-deal-with-an-explosive-flatmap-tp23277.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org