i think in spark 1.6 this really became more flexible in terms of only
specifying max/min thresholds.

Yes shuffle spills in spark during multiplication are humongous, i tried a
few hacks but that's spark. that's one of known bottlenecks unfortunately.
You are welcome to try and hack A'B too. My personal conviction is that
spark is not a good fit for all-to-all kind of exchanges, which is superset
of what happens during multiplications.

The reasons are:
(1) spills are inevitable with I/O exceeding input size. No direct message
passing.
(2) no multicast support (message/serialization duplication).
(3) No asynchronous message passing exchange (compute while I/O overlap).

There are a couple others problems with the architecture but they are less
significant in this particular case.

There are some ideas here but it will probably take some time before it
gets better.

Also i thought spark item similarity was doing some subsampling to make
things a bit easier. Perhaps i had a wrong impression.



On Tue, Apr 19, 2016 at 6:17 AM, Nikaash Puri <nikaashp...@gmail.com> wrote:

> Hi,
>
> I was running the spark-itemsimilarity code and its taking a long time on
> the final saveAsTextFile. Specifically, on the flatMap step at AtB.Scala.
> On further inspection, it looks like the shuffle spill is very large, my
> guess is that this is causing a drastic reduction in speed.
>
> Does anyone have any ideas about a good distribution between
> spark.shuffle.memoryFraction and spark.storage.memoryFraction for the
> spark-itemsimilarity job. In other words, how much caching does the
> algorithm as implemented in Mahout use?
>
> I’ll run some more tests by tweaking the different parameters on larger
> datasets and share my findings.
>
> Thank you,
> Nikaash Puri

Reply via email to