Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Ravi Mody Fri, 26 Jun 2015 07:19:24 -0700

1. These are my settings:
rank = 100
iterations = 12
users = ~20M
items = ~2M
training examples = ~500M-1B (I'm running into the issue even with 500M
training examples)


2. The memory storage never seems to go too high. The user blocks may go up
to ~10Gb, and each executor will have a few GB used out of 30 free GB.
Everything seems small compared to the amount of memory I'm using.

3. I think I have a lot of disk space - is this on the executors or the
driver? Is there a way to know if the error is coming from disk space.

4. I'm not changing checkpointing settings, but I think checkpointing
defaults to every 10 iterations? One notable thing is the crashes often
start on or after the 9th iteration, so it may be related to checkpointing.
But this could just be a coincidence.

Thanks!





On Fri, Jun 26, 2015 at 1:08 AM, Ayman Farahat <ayman.fara...@yahoo.com>
wrote:

> was there any resolution to that problem?
> I am also having that with Pyspark 1.4
> 380 Million observations
> 100 factors and 5 iterations
> Thanks
> Ayman
>
> On Jun 23, 2015, at 6:20 PM, Xiangrui Meng <men...@gmail.com> wrote:
>
> > It shouldn't be hard to handle 1 billion ratings in 1.3. Just need
> > more information to guess what happened:
> >
> > 1. Could you share the ALS settings, e.g., number of blocks, rank and
> > number of iterations, as well as number of users/items in your
> > dataset?
> > 2. If you monitor the progress in the WebUI, how much data is stored
> > in memory and how much data is shuffled per iteration?
> > 3. Do you have enough disk space for the shuffle files?
> > 4. Did you set checkpointDir in SparkContext and checkpointInterval in
> ALS?
> >
> > Best,
> > Xiangrui
> >
> > On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com> wrote:
> >> Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on
> fairly
> >> large datasets (1+ billion input records). As I grow my dataset I often
> run
> >> into issues with a lot of failed stages and dropped executors,
> ultimately
> >> leading to the whole application failing. The errors are like
> >> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
> output
> >> location for shuffle 19" and
> "org.apache.spark.shuffle.FetchFailedException:
> >> Failed to connect to...". These occur during flatMap, mapPartitions, and
> >> aggregate stages. I know that increasing memory fixes this issue, but
> most
> >> of the time my executors are only using a tiny portion of the their
> >> allocated memory (<10%). Often, the stages run fine until the last
> iteration
> >> or two of ALS, but this could just be a coincidence.
> >>
> >> I've tried tweaking a lot of settings, but it's time-consuming to do
> this
> >> through guess-and-check. Right now I have these set:
> >> spark.shuffle.memoryFraction = 0.3
> >> spark.storage.memoryFraction = 0.65
> >> spark.executor.heartbeatInterval = 600000
> >>
> >> I'm sure these settings aren't optimal - any idea of what could be
> causing
> >> my errors, and what direction I can push these settings in to get more
> out
> >> of my memory? I'm currently using 240 GB of memory (on 7 executors) for
> a 1
> >> billion record dataset, which seems like too much. Thanks!
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Reply via email to