Forgot to mention: rank of 100 usually works ok, 120 consistently cannot finish.
On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody <rmody...@gmail.com> wrote: > 1. These are my settings: > rank = 100 > iterations = 12 > users = ~20M > items = ~2M > training examples = ~500M-1B (I'm running into the issue even with 500M > training examples) > > 2. The memory storage never seems to go too high. The user blocks may go > up to ~10Gb, and each executor will have a few GB used out of 30 free GB. > Everything seems small compared to the amount of memory I'm using. > > 3. I think I have a lot of disk space - is this on the executors or the > driver? Is there a way to know if the error is coming from disk space. > > 4. I'm not changing checkpointing settings, but I think checkpointing > defaults to every 10 iterations? One notable thing is the crashes often > start on or after the 9th iteration, so it may be related to checkpointing. > But this could just be a coincidence. > > Thanks! > > > > > > On Fri, Jun 26, 2015 at 1:08 AM, Ayman Farahat <ayman.fara...@yahoo.com> > wrote: > >> was there any resolution to that problem? >> I am also having that with Pyspark 1.4 >> 380 Million observations >> 100 factors and 5 iterations >> Thanks >> Ayman >> >> On Jun 23, 2015, at 6:20 PM, Xiangrui Meng <men...@gmail.com> wrote: >> >> > It shouldn't be hard to handle 1 billion ratings in 1.3. Just need >> > more information to guess what happened: >> > >> > 1. Could you share the ALS settings, e.g., number of blocks, rank and >> > number of iterations, as well as number of users/items in your >> > dataset? >> > 2. If you monitor the progress in the WebUI, how much data is stored >> > in memory and how much data is shuffled per iteration? >> > 3. Do you have enough disk space for the shuffle files? >> > 4. Did you set checkpointDir in SparkContext and checkpointInterval in >> ALS? >> > >> > Best, >> > Xiangrui >> > >> > On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com> wrote: >> >> Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on >> fairly >> >> large datasets (1+ billion input records). As I grow my dataset I >> often run >> >> into issues with a lot of failed stages and dropped executors, >> ultimately >> >> leading to the whole application failing. The errors are like >> >> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an >> output >> >> location for shuffle 19" and >> "org.apache.spark.shuffle.FetchFailedException: >> >> Failed to connect to...". These occur during flatMap, mapPartitions, >> and >> >> aggregate stages. I know that increasing memory fixes this issue, but >> most >> >> of the time my executors are only using a tiny portion of the their >> >> allocated memory (<10%). Often, the stages run fine until the last >> iteration >> >> or two of ALS, but this could just be a coincidence. >> >> >> >> I've tried tweaking a lot of settings, but it's time-consuming to do >> this >> >> through guess-and-check. Right now I have these set: >> >> spark.shuffle.memoryFraction = 0.3 >> >> spark.storage.memoryFraction = 0.65 >> >> spark.executor.heartbeatInterval = 600000 >> >> >> >> I'm sure these settings aren't optimal - any idea of what could be >> causing >> >> my errors, and what direction I can push these settings in to get more >> out >> >> of my memory? I'm currently using 240 GB of memory (on 7 executors) >> for a 1 >> >> billion record dataset, which seems like too much. Thanks! >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> >