Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Ravi Mody Fri, 21 Aug 2015 07:45:20 -0700

I've been able to almost halve my memory usage with no instability issues.

I lowered my storage.memoryFraction and increased my shuffle.memoryFraction
(essentially swapping them). I set spark.yarn.executor.memoryOverhead to
6GB. And I lowered executor-cores in case other jobs are using the
available cores. I'm not sure which of these fixed the issue, but things
are much more stable now.


On Fri, Jun 26, 2015 at 11:26 AM, Xiangrui Meng <men...@gmail.com> wrote:

> Please see my comments inline. It would be helpful if you can attach
> the full stack trace. -Xiangrui
>
> On Fri, Jun 26, 2015 at 7:18 AM, Ravi Mody <rmody...@gmail.com> wrote:
> > 1. These are my settings:
> > rank = 100
> > iterations = 12
> > users = ~20M
> > items = ~2M
> > training examples = ~500M-1B (I'm running into the issue even with 500M
> > training examples)
> >
>
> Did you set number of blocks? If you didn't, could you check how many
> partitions you have in the ratings RDD? Setting a large number of
> blocks would increase shuffle size. If you have enough RAM, try to set
> number of blocks to the number of CPU cores or less.
>
> > 2. The memory storage never seems to go too high. The user blocks may go
> up
> > to ~10Gb, and each executor will have a few GB used out of 30 free GB.
> > Everything seems small compared to the amount of memory I'm using.
> >
>
> This looks correct.
>
> > 3. I think I have a lot of disk space - is this on the executors or the
> > driver? Is there a way to know if the error is coming from disk space.
> >
>
> You can see the shuffle data size for each iteration from the WebUI.
> Usually, it should throw an out of disk space exception instead of the
> message you posted. But it is worth checking.
>
> > 4. I'm not changing checkpointing settings, but I think checkpointing
> > defaults to every 10 iterations? One notable thing is the crashes often
> > start on or after the 9th iteration, so it may be related to
> checkpointing.
> > But this could just be a coincidence.
> >
>
> If you didn't set checkpointDir in SparkContext, the
> checkpointInterval setting in ALS has no effect.
>
> > Thanks!
> >
> >
> >
> >
> >
> > On Fri, Jun 26, 2015 at 1:08 AM, Ayman Farahat <ayman.fara...@yahoo.com>
> > wrote:
> >>
> >> was there any resolution to that problem?
> >> I am also having that with Pyspark 1.4
> >> 380 Million observations
> >> 100 factors and 5 iterations
> >> Thanks
> >> Ayman
> >>
> >> On Jun 23, 2015, at 6:20 PM, Xiangrui Meng <men...@gmail.com> wrote:
> >>
> >> > It shouldn't be hard to handle 1 billion ratings in 1.3. Just need
> >> > more information to guess what happened:
> >> >
> >> > 1. Could you share the ALS settings, e.g., number of blocks, rank and
> >> > number of iterations, as well as number of users/items in your
> >> > dataset?
> >> > 2. If you monitor the progress in the WebUI, how much data is stored
> >> > in memory and how much data is shuffled per iteration?
> >> > 3. Do you have enough disk space for the shuffle files?
> >> > 4. Did you set checkpointDir in SparkContext and checkpointInterval in
> >> > ALS?
> >> >
> >> > Best,
> >> > Xiangrui
> >> >
> >> > On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com>
> wrote:
> >> >> Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on
> >> >> fairly
> >> >> large datasets (1+ billion input records). As I grow my dataset I
> often
> >> >> run
> >> >> into issues with a lot of failed stages and dropped executors,
> >> >> ultimately
> >> >> leading to the whole application failing. The errors are like
> >> >> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
> >> >> output
> >> >> location for shuffle 19" and
> >> >> "org.apache.spark.shuffle.FetchFailedException:
> >> >> Failed to connect to...". These occur during flatMap, mapPartitions,
> >> >> and
> >> >> aggregate stages. I know that increasing memory fixes this issue, but
> >> >> most
> >> >> of the time my executors are only using a tiny portion of the their
> >> >> allocated memory (<10%). Often, the stages run fine until the last
> >> >> iteration
> >> >> or two of ALS, but this could just be a coincidence.
> >> >>
> >> >> I've tried tweaking a lot of settings, but it's time-consuming to do
> >> >> this
> >> >> through guess-and-check. Right now I have these set:
> >> >> spark.shuffle.memoryFraction = 0.3
> >> >> spark.storage.memoryFraction = 0.65
> >> >> spark.executor.heartbeatInterval = 600000
> >> >>
> >> >> I'm sure these settings aren't optimal - any idea of what could be
> >> >> causing
> >> >> my errors, and what direction I can push these settings in to get
> more
> >> >> out
> >> >> of my memory? I'm currently using 240 GB of memory (on 7 executors)
> for
> >> >> a 1
> >> >> billion record dataset, which seems like too much. Thanks!
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> > For additional commands, e-mail: user-h...@spark.apache.org
> >> >
> >>
> >
>

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Reply via email to