Just wanted to check if somebody has seen similar behaviour or knows what we might be doing wrong. We have a relatively complex spark application which processes half a terabyte of data at various stages. We have profiled it in several ways and everything seems to point to one place where 90% of the time is spent: AppendOnlyMap.changeValue. The job scales and is relatively faster than its map-reduce alternative but it still feels slower than it should be. I am suspecting too much spill but I haven't seen any improvement by increasing number of partitions to 10k. Any idea would be appreciated.
-- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033,