I set the number of partitions on the input dataset at 50. The number of CPU cores I'm using is 84 (7 executors, 12 cores).
I'll look into getting a full stack trace. Any idea what my errors mean, and why increasing memory causes them to go away? Thanks. On Fri, Jun 26, 2015 at 11:26 AM, Xiangrui Meng <men...@gmail.com> wrote: > Please see my comments inline. It would be helpful if you can attach > the full stack trace. -Xiangrui > > On Fri, Jun 26, 2015 at 7:18 AM, Ravi Mody <rmody...@gmail.com> wrote: > > 1. These are my settings: > > rank = 100 > > iterations = 12 > > users = ~20M > > items = ~2M > > training examples = ~500M-1B (I'm running into the issue even with 500M > > training examples) > > > > Did you set number of blocks? If you didn't, could you check how many > partitions you have in the ratings RDD? Setting a large number of > blocks would increase shuffle size. If you have enough RAM, try to set > number of blocks to the number of CPU cores or less. > > > 2. The memory storage never seems to go too high. The user blocks may go > up > > to ~10Gb, and each executor will have a few GB used out of 30 free GB. > > Everything seems small compared to the amount of memory I'm using. > > > > This looks correct. > > > 3. I think I have a lot of disk space - is this on the executors or the > > driver? Is there a way to know if the error is coming from disk space. > > > > You can see the shuffle data size for each iteration from the WebUI. > Usually, it should throw an out of disk space exception instead of the > message you posted. But it is worth checking. > > > 4. I'm not changing checkpointing settings, but I think checkpointing > > defaults to every 10 iterations? One notable thing is the crashes often > > start on or after the 9th iteration, so it may be related to > checkpointing. > > But this could just be a coincidence. > > > > If you didn't set checkpointDir in SparkContext, the > checkpointInterval setting in ALS has no effect. > > > Thanks! > > > > > > > > > > > > On Fri, Jun 26, 2015 at 1:08 AM, Ayman Farahat <ayman.fara...@yahoo.com> > > wrote: > >> > >> was there any resolution to that problem? > >> I am also having that with Pyspark 1.4 > >> 380 Million observations > >> 100 factors and 5 iterations > >> Thanks > >> Ayman > >> > >> On Jun 23, 2015, at 6:20 PM, Xiangrui Meng <men...@gmail.com> wrote: > >> > >> > It shouldn't be hard to handle 1 billion ratings in 1.3. Just need > >> > more information to guess what happened: > >> > > >> > 1. Could you share the ALS settings, e.g., number of blocks, rank and > >> > number of iterations, as well as number of users/items in your > >> > dataset? > >> > 2. If you monitor the progress in the WebUI, how much data is stored > >> > in memory and how much data is shuffled per iteration? > >> > 3. Do you have enough disk space for the shuffle files? > >> > 4. Did you set checkpointDir in SparkContext and checkpointInterval in > >> > ALS? > >> > > >> > Best, > >> > Xiangrui > >> > > >> > On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com> > wrote: > >> >> Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on > >> >> fairly > >> >> large datasets (1+ billion input records). As I grow my dataset I > often > >> >> run > >> >> into issues with a lot of failed stages and dropped executors, > >> >> ultimately > >> >> leading to the whole application failing. The errors are like > >> >> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an > >> >> output > >> >> location for shuffle 19" and > >> >> "org.apache.spark.shuffle.FetchFailedException: > >> >> Failed to connect to...". These occur during flatMap, mapPartitions, > >> >> and > >> >> aggregate stages. I know that increasing memory fixes this issue, but > >> >> most > >> >> of the time my executors are only using a tiny portion of the their > >> >> allocated memory (<10%). Often, the stages run fine until the last > >> >> iteration > >> >> or two of ALS, but this could just be a coincidence. > >> >> > >> >> I've tried tweaking a lot of settings, but it's time-consuming to do > >> >> this > >> >> through guess-and-check. Right now I have these set: > >> >> spark.shuffle.memoryFraction = 0.3 > >> >> spark.storage.memoryFraction = 0.65 > >> >> spark.executor.heartbeatInterval = 600000 > >> >> > >> >> I'm sure these settings aren't optimal - any idea of what could be > >> >> causing > >> >> my errors, and what direction I can push these settings in to get > more > >> >> out > >> >> of my memory? I'm currently using 240 GB of memory (on 7 executors) > for > >> >> a 1 > >> >> billion record dataset, which seems like too much. Thanks! > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> > For additional commands, e-mail: user-h...@spark.apache.org > >> > > >> > > >