was there any resolution to that problem? I am also having that with Pyspark 1.4 380 Million observations 100 factors and 5 iterations Thanks Ayman
On Jun 23, 2015, at 6:20 PM, Xiangrui Meng <men...@gmail.com> wrote: > It shouldn't be hard to handle 1 billion ratings in 1.3. Just need > more information to guess what happened: > > 1. Could you share the ALS settings, e.g., number of blocks, rank and > number of iterations, as well as number of users/items in your > dataset? > 2. If you monitor the progress in the WebUI, how much data is stored > in memory and how much data is shuffled per iteration? > 3. Do you have enough disk space for the shuffle files? > 4. Did you set checkpointDir in SparkContext and checkpointInterval in ALS? > > Best, > Xiangrui > > On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com> wrote: >> Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on fairly >> large datasets (1+ billion input records). As I grow my dataset I often run >> into issues with a lot of failed stages and dropped executors, ultimately >> leading to the whole application failing. The errors are like >> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output >> location for shuffle 19" and "org.apache.spark.shuffle.FetchFailedException: >> Failed to connect to...". These occur during flatMap, mapPartitions, and >> aggregate stages. I know that increasing memory fixes this issue, but most >> of the time my executors are only using a tiny portion of the their >> allocated memory (<10%). Often, the stages run fine until the last iteration >> or two of ALS, but this could just be a coincidence. >> >> I've tried tweaking a lot of settings, but it's time-consuming to do this >> through guess-and-check. Right now I have these set: >> spark.shuffle.memoryFraction = 0.3 >> spark.storage.memoryFraction = 0.65 >> spark.executor.heartbeatInterval = 600000 >> >> I'm sure these settings aren't optimal - any idea of what could be causing >> my errors, and what direction I can push these settings in to get more out >> of my memory? I'm currently using 240 GB of memory (on 7 executors) for a 1 >> billion record dataset, which seems like too much. Thanks! > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org