Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Ayman Farahat Thu, 25 Jun 2015 22:09:35 -0700

was there any resolution to that problem?
I am also having that with Pyspark 1.4
380 Million observations
100 factors and 5 iterations
Thanks
Ayman


On Jun 23, 2015, at 6:20 PM, Xiangrui Meng <men...@gmail.com> wrote:

> It shouldn't be hard to handle 1 billion ratings in 1.3. Just need
> more information to guess what happened:
> 
> 1. Could you share the ALS settings, e.g., number of blocks, rank and
> number of iterations, as well as number of users/items in your
> dataset?
> 2. If you monitor the progress in the WebUI, how much data is stored
> in memory and how much data is shuffled per iteration?
> 3. Do you have enough disk space for the shuffle files?
> 4. Did you set checkpointDir in SparkContext and checkpointInterval in ALS?
> 
> Best,
> Xiangrui
> 
> On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com> wrote:
>> Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on fairly
>> large datasets (1+ billion input records). As I grow my dataset I often run
>> into issues with a lot of failed stages and dropped executors, ultimately
>> leading to the whole application failing. The errors are like
>> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
>> location for shuffle 19" and "org.apache.spark.shuffle.FetchFailedException:
>> Failed to connect to...". These occur during flatMap, mapPartitions, and
>> aggregate stages. I know that increasing memory fixes this issue, but most
>> of the time my executors are only using a tiny portion of the their
>> allocated memory (<10%). Often, the stages run fine until the last iteration
>> or two of ALS, but this could just be a coincidence.
>> 
>> I've tried tweaking a lot of settings, but it's time-consuming to do this
>> through guess-and-check. Right now I have these set:
>> spark.shuffle.memoryFraction = 0.3
>> spark.storage.memoryFraction = 0.65
>> spark.executor.heartbeatInterval = 600000
>> 
>> I'm sure these settings aren't optimal - any idea of what could be causing
>> my errors, and what direction I can push these settings in to get more out
>> of my memory? I'm currently using 240 GB of memory (on 7 executors) for a 1
>> billion record dataset, which seems like too much. Thanks!
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Reply via email to