Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Ravi Mody Fri, 26 Jun 2015 12:44:57 -0700

I set the number of partitions on the input dataset at 50. The number of
CPU cores I'm using is 84 (7 executors, 12 cores).


I'll look into getting a full stack trace. Any idea what my errors mean,
and why increasing memory causes them to go away? Thanks.

On Fri, Jun 26, 2015 at 11:26 AM, Xiangrui Meng <men...@gmail.com> wrote:

> Please see my comments inline. It would be helpful if you can attach
> the full stack trace. -Xiangrui
>
> On Fri, Jun 26, 2015 at 7:18 AM, Ravi Mody <rmody...@gmail.com> wrote:
> > 1. These are my settings:
> > rank = 100
> > iterations = 12
> > users = ~20M
> > items = ~2M
> > training examples = ~500M-1B (I'm running into the issue even with 500M
> > training examples)
> >
>
> Did you set number of blocks? If you didn't, could you check how many
> partitions you have in the ratings RDD? Setting a large number of
> blocks would increase shuffle size. If you have enough RAM, try to set
> number of blocks to the number of CPU cores or less.
>
> > 2. The memory storage never seems to go too high. The user blocks may go
> up
> > to ~10Gb, and each executor will have a few GB used out of 30 free GB.
> > Everything seems small compared to the amount of memory I'm using.
> >
>
> This looks correct.
>
> > 3. I think I have a lot of disk space - is this on the executors or the
> > driver? Is there a way to know if the error is coming from disk space.
> >
>
> You can see the shuffle data size for each iteration from the WebUI.
> Usually, it should throw an out of disk space exception instead of the
> message you posted. But it is worth checking.
>
> > 4. I'm not changing checkpointing settings, but I think checkpointing
> > defaults to every 10 iterations? One notable thing is the crashes often
> > start on or after the 9th iteration, so it may be related to
> checkpointing.
> > But this could just be a coincidence.
> >
>
> If you didn't set checkpointDir in SparkContext, the
> checkpointInterval setting in ALS has no effect.
>
> > Thanks!
> >
> >
> >
> >
> >
> > On Fri, Jun 26, 2015 at 1:08 AM, Ayman Farahat <ayman.fara...@yahoo.com>
> > wrote:
> >>
> >> was there any resolution to that problem?
> >> I am also having that with Pyspark 1.4
> >> 380 Million observations
> >> 100 factors and 5 iterations
> >> Thanks
> >> Ayman
> >>
> >> On Jun 23, 2015, at 6:20 PM, Xiangrui Meng <men...@gmail.com> wrote:
> >>
> >> > It shouldn't be hard to handle 1 billion ratings in 1.3. Just need
> >> > more information to guess what happened:
> >> >
> >> > 1. Could you share the ALS settings, e.g., number of blocks, rank and
> >> > number of iterations, as well as number of users/items in your
> >> > dataset?
> >> > 2. If you monitor the progress in the WebUI, how much data is stored
> >> > in memory and how much data is shuffled per iteration?
> >> > 3. Do you have enough disk space for the shuffle files?
> >> > 4. Did you set checkpointDir in SparkContext and checkpointInterval in
> >> > ALS?
> >> >
> >> > Best,
> >> > Xiangrui
> >> >
> >> > On Fri, Jun 19, 2015 at 11:43 AM, Ravi Mody <rmody...@gmail.com>
> wrote:
> >> >> Hi, I'm running implicit matrix factorization/ALS in Spark 1.3.1 on
> >> >> fairly
> >> >> large datasets (1+ billion input records). As I grow my dataset I
> often
> >> >> run
> >> >> into issues with a lot of failed stages and dropped executors,
> >> >> ultimately
> >> >> leading to the whole application failing. The errors are like
> >> >> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
> >> >> output
> >> >> location for shuffle 19" and
> >> >> "org.apache.spark.shuffle.FetchFailedException:
> >> >> Failed to connect to...". These occur during flatMap, mapPartitions,
> >> >> and
> >> >> aggregate stages. I know that increasing memory fixes this issue, but
> >> >> most
> >> >> of the time my executors are only using a tiny portion of the their
> >> >> allocated memory (<10%). Often, the stages run fine until the last
> >> >> iteration
> >> >> or two of ALS, but this could just be a coincidence.
> >> >>
> >> >> I've tried tweaking a lot of settings, but it's time-consuming to do
> >> >> this
> >> >> through guess-and-check. Right now I have these set:
> >> >> spark.shuffle.memoryFraction = 0.3
> >> >> spark.storage.memoryFraction = 0.65
> >> >> spark.executor.heartbeatInterval = 600000
> >> >>
> >> >> I'm sure these settings aren't optimal - any idea of what could be
> >> >> causing
> >> >> my errors, and what direction I can push these settings in to get
> more
> >> >> out
> >> >> of my memory? I'm currently using 240 GB of memory (on 7 executors)
> for
> >> >> a 1
> >> >> billion record dataset, which seems like too much. Thanks!
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> > For additional commands, e-mail: user-h...@spark.apache.org
> >> >
> >>
> >
>

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

Reply via email to