Hi,

I was running the spark-itemsimilarity code and its taking a long time on the 
final saveAsTextFile. Specifically, on the flatMap step at AtB.Scala. On 
further inspection, it looks like the shuffle spill is very large, my guess is 
that this is causing a drastic reduction in speed. 

Does anyone have any ideas about a good distribution between 
spark.shuffle.memoryFraction and spark.storage.memoryFraction for the 
spark-itemsimilarity job. In other words, how much caching does the algorithm 
as implemented in Mahout use? 

I’ll run some more tests by tweaking the different parameters on larger 
datasets and share my findings.

Thank you,
Nikaash Puri

Reply via email to