Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Ayman Farahat Fri, 26 Jun 2015 08:07:08 -0700

I use the mllib not the ML. Does that make a difference ?

Sent from my iPhone


> On Jun 26, 2015, at 7:19 AM, Ravi Mody <rmody...@gmail.com> wrote:
> 
> Forgot to mention: rank of 100 usually works ok, 120 consistently cannot 
> finish. 
> 
>> On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody <rmody...@gmail.com> wrote:
>> 1. These are my settings:
>> rank = 100
>> iterations = 12
>> users = ~20M
>> items = ~2M
>> training examples = ~500M-1B (I'm running into the issue even with 500M 
>> training examples)
>> 
>> 2. The memory storage never seems to go too high. The user blocks may go up 
>> to ~10Gb, and each executor will have a few GB used out of 30 free GB. 
>> Everything seems small compared to the amount of memory I'm using. 
>> 
>> 3. I think I have a lot of disk space - is this on the executors or the 
>> driver? Is there a way to know if the error is coming from disk space.
>> 
>> 4. I'm not changing checkpointing settings, but I think checkpointing 
>> defaults to every 10 iterations? One notable thing is the crashes often 
>> start on or after the 9th iteration, so it may be related to checkpointing. 
>> But this could just be a coincidence. 
>> 
>> Thanks!
>> 
>> 
>> 
>> 
>> 
>>> On Fri, Jun 26, 2015 at 1:08 AM, Ayman Farahat <ayman.fara...@yahoo.com> 
>>> wrote:
>>> was there any resolution to that problem?
>>> I am also having that with Pyspark 1.4
>>> 380 Million observations
>>> 100 factors and 5 iterations
>>> Thanks
>>> Ayman
>>> 
>>> On Jun 23, 2015, at 6:20 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>> 
>>> > It shouldn't be hard to handle 1 billion ratings in 1.3. Just need
>>> > more information to guess what happened:
>>> >
>>> > 1. Could you share the ALS settings, e.g., number of blocks, rank and
>>> > number of iterations, as well as number of users/items in your
>>> > dataset?
>>> > 2. If you monitor the progress in the WebUI, how much data is stored
>>> > in memory and how much data is shuffled per iteration?
>>> > 3. Do you have enough disk space for the shuffle files?
>>> > 4. Did you set checkpointDir in SparkContext and checkpointInterval in 
>>> > ALS?
>>> >
>>> > Best,
>>> > Xiangrui
>>> >
>>> > On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com> wrote:
>>> >> Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on 
>>> >> fairly
>>> >> large datasets (1+ billion input records). As I grow my dataset I often 
>>> >> run
>>> >> into issues with a lot of failed stages and dropped executors, ultimately
>>> >> leading to the whole application failing. The errors are like
>>> >> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
>>> >> location for shuffle 19" and 
>>> >> "org.apache.spark.shuffle.FetchFailedException:
>>> >> Failed to connect to...". These occur during flatMap, mapPartitions, and
>>> >> aggregate stages. I know that increasing memory fixes this issue, but 
>>> >> most
>>> >> of the time my executors are only using a tiny portion of the their
>>> >> allocated memory (<10%). Often, the stages run fine until the last 
>>> >> iteration
>>> >> or two of ALS, but this could just be a coincidence.
>>> >>
>>> >> I've tried tweaking a lot of settings, but it's time-consuming to do this
>>> >> through guess-and-check. Right now I have these set:
>>> >> spark.shuffle.memoryFraction = 0.3
>>> >> spark.storage.memoryFraction = 0.65
>>> >> spark.executor.heartbeatInterval = 600000
>>> >>
>>> >> I'm sure these settings aren't optimal - any idea of what could be 
>>> >> causing
>>> >> my errors, and what direction I can push these settings in to get more 
>>> >> out
>>> >> of my memory? I'm currently using 240 GB of memory (on 7 executors) for 
>>> >> a 1
>>> >> billion record dataset, which seems like too much. Thanks!
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Reply via email to