Hi, I was running the spark-itemsimilarity code and its taking a long time on the final saveAsTextFile. Specifically, on the flatMap step at AtB.Scala. On further inspection, it looks like the shuffle spill is very large, my guess is that this is causing a drastic reduction in speed.
Does anyone have any ideas about a good distribution between spark.shuffle.memoryFraction and spark.storage.memoryFraction for the spark-itemsimilarity job. In other words, how much caching does the algorithm as implemented in Mahout use? I’ll run some more tests by tweaking the different parameters on larger datasets and share my findings. Thank you, Nikaash Puri