Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Ravi Mody Fri, 26 Jun 2015 07:20:23 -0700

Forgot to mention: rank of 100 usually works ok, 120 consistently cannot
finish.


On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody <rmody...@gmail.com> wrote:

> 1. These are my settings:
> rank = 100
> iterations = 12
> users = ~20M
> items = ~2M
> training examples = ~500M-1B (I'm running into the issue even with 500M
> training examples)
>
> 2. The memory storage never seems to go too high. The user blocks may go
> up to ~10Gb, and each executor will have a few GB used out of 30 free GB.
> Everything seems small compared to the amount of memory I'm using.
>
> 3. I think I have a lot of disk space - is this on the executors or the
> driver? Is there a way to know if the error is coming from disk space.
>
> 4. I'm not changing checkpointing settings, but I think checkpointing
> defaults to every 10 iterations? One notable thing is the crashes often
> start on or after the 9th iteration, so it may be related to checkpointing.
> But this could just be a coincidence.
>
> Thanks!
>
>
>
>
>
> On Fri, Jun 26, 2015 at 1:08 AM, Ayman Farahat <ayman.fara...@yahoo.com>
> wrote:
>
>> was there any resolution to that problem?
>> I am also having that with Pyspark 1.4
>> 380 Million observations
>> 100 factors and 5 iterations
>> Thanks
>> Ayman
>>
>> On Jun 23, 2015, at 6:20 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> > It shouldn't be hard to handle 1 billion ratings in 1.3. Just need
>> > more information to guess what happened:
>> >
>> > 1. Could you share the ALS settings, e.g., number of blocks, rank and
>> > number of iterations, as well as number of users/items in your
>> > dataset?
>> > 2. If you monitor the progress in the WebUI, how much data is stored
>> > in memory and how much data is shuffled per iteration?
>> > 3. Do you have enough disk space for the shuffle files?
>> > 4. Did you set checkpointDir in SparkContext and checkpointInterval in
>> ALS?
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com> wrote:
>> >> Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on
>> fairly
>> >> large datasets (1+ billion input records). As I grow my dataset I
>> often run
>> >> into issues with a lot of failed stages and dropped executors,
>> ultimately
>> >> leading to the whole application failing. The errors are like
>> >> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
>> output
>> >> location for shuffle 19" and
>> "org.apache.spark.shuffle.FetchFailedException:
>> >> Failed to connect to...". These occur during flatMap, mapPartitions,
>> and
>> >> aggregate stages. I know that increasing memory fixes this issue, but
>> most
>> >> of the time my executors are only using a tiny portion of the their
>> >> allocated memory (<10%). Often, the stages run fine until the last
>> iteration
>> >> or two of ALS, but this could just be a coincidence.
>> >>
>> >> I've tried tweaking a lot of settings, but it's time-consuming to do
>> this
>> >> through guess-and-check. Right now I have these set:
>> >> spark.shuffle.memoryFraction = 0.3
>> >> spark.storage.memoryFraction = 0.65
>> >> spark.executor.heartbeatInterval = 600000
>> >>
>> >> I'm sure these settings aren't optimal - any idea of what could be
>> causing
>> >> my errors, and what direction I can push these settings in to get more
>> out
>> >> of my memory? I'm currently using 240 GB of memory (on 7 executors)
>> for a 1
>> >> billion record dataset, which seems like too much. Thanks!
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>>
>

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Reply via email to