Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Xiangrui Meng Fri, 26 Jun 2015 08:27:36 -0700

No, they use the same implementation.

On Fri, Jun 26, 2015 at 8:05 AM, Ayman Farahat <ayman.fara...@yahoo.com> wrote:
> I use the mllib not the ML. Does that make a difference ?
>
> Sent from my iPhone
>
> On Jun 26, 2015, at 7:19 AM, Ravi Mody <rmody...@gmail.com> wrote:
>
> Forgot to mention: rank of 100 usually works ok, 120 consistently cannot
> finish.
>
> On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody <rmody...@gmail.com> wrote:
>>
>> 1. These are my settings:
>> rank = 100
>> iterations = 12
>> users = ~20M
>> items = ~2M
>> training examples = ~500M-1B (I'm running into the issue even with 500M
>> training examples)
>>
>> 2. The memory storage never seems to go too high. The user blocks may go
>> up to ~10Gb, and each executor will have a few GB used out of 30 free GB.
>> Everything seems small compared to the amount of memory I'm using.
>>
>> 3. I think I have a lot of disk space - is this on the executors or the
>> driver? Is there a way to know if the error is coming from disk space.
>>
>> 4. I'm not changing checkpointing settings, but I think checkpointing
>> defaults to every 10 iterations? One notable thing is the crashes often
>> start on or after the 9th iteration, so it may be related to checkpointing.
>> But this could just be a coincidence.
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Fri, Jun 26, 2015 at 1:08 AM, Ayman Farahat <ayman.fara...@yahoo.com>
>> wrote:
>>>
>>> was there any resolution to that problem?
>>> I am also having that with Pyspark 1.4
>>> 380 Million observations
>>> 100 factors and 5 iterations
>>> Thanks
>>> Ayman
>>>
>>> On Jun 23, 2015, at 6:20 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>>
>>> > It shouldn't be hard to handle 1 billion ratings in 1.3. Just need
>>> > more information to guess what happened:
>>> >
>>> > 1. Could you share the ALS settings, e.g., number of blocks, rank and
>>> > number of iterations, as well as number of users/items in your
>>> > dataset?
>>> > 2. If you monitor the progress in the WebUI, how much data is stored
>>> > in memory and how much data is shuffled per iteration?
>>> > 3. Do you have enough disk space for the shuffle files?
>>> > 4. Did you set checkpointDir in SparkContext and checkpointInterval in
>>> > ALS?
>>> >
>>> > Best,
>>> > Xiangrui
>>> >
>>> > On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com> wrote:
>>> >> Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on
>>> >> fairly
>>> >> large datasets (1+ billion input records). As I grow my dataset I
>>> >> often run
>>> >> into issues with a lot of failed stages and dropped executors,
>>> >> ultimately
>>> >> leading to the whole application failing. The errors are like
>>> >> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
>>> >> output
>>> >> location for shuffle 19" and
>>> >> "org.apache.spark.shuffle.FetchFailedException:
>>> >> Failed to connect to...". These occur during flatMap, mapPartitions,
>>> >> and
>>> >> aggregate stages. I know that increasing memory fixes this issue, but
>>> >> most
>>> >> of the time my executors are only using a tiny portion of the their
>>> >> allocated memory (<10%). Often, the stages run fine until the last
>>> >> iteration
>>> >> or two of ALS, but this could just be a coincidence.
>>> >>
>>> >> I've tried tweaking a lot of settings, but it's time-consuming to do
>>> >> this
>>> >> through guess-and-check. Right now I have these set:
>>> >> spark.shuffle.memoryFraction = 0.3
>>> >> spark.storage.memoryFraction = 0.65
>>> >> spark.executor.heartbeatInterval = 600000
>>> >>
>>> >> I'm sure these settings aren't optimal - any idea of what could be
>>> >> causing
>>> >> my errors, and what direction I can push these settings in to get more
>>> >> out
>>> >> of my memory? I'm currently using 240 GB of memory (on 7 executors)
>>> >> for a 1
>>> >> billion record dataset, which seems like too much. Thanks!
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Reply via email to