Thanks for the update! I've also run into the block manager timeout issue,
it might be a good idea to increase the default significantly (it would
probably timeout immediately if the TCP connection itself dropped anyway).


On Sun, Jun 1, 2014 at 9:48 AM, Chanwit Kaewkasi <chan...@gmail.com> wrote:

> Hi all,
>
> This is what I found:
>
> 1. Like Aaron suggested, an executor will be killed silently when the
> OS's memory is running out.
> I've found this many times to conclude this it's real. Adding swap and
> increasing the JVM heap solved the problem, but you will encounter OS
> paging out and full GC.
>
> 2. OS paging out and full GC are not likely to affect my benchmark
> much while processing data from HDFS. But Akka process's randomly
> killed during the network-related stage (for example, sorting). I've
> found that an Akka process cannot fetch the result fast enough.
> Increasing the block manager timeout helped a lot. I've doubled the
> value many times as the network of our ARM cluster is quite slow.
>
> 3. We'd like to collect times spent for all stages of our benchmark.
> So we always re-run when some tasks failed. Failure happened a lot but
> it's understandable as Spark is designed on top of Akka's let-it-crash
> philosophy. To make the benchmark run more perfectly (without a task
> failure), I called .cache() before calling the transformation of the
> next stage. And it helped a lot.
>
> Combined above and others tuning, we can now boost the performance of
> our ARM cluster to 2.8 times faster than our first report.
>
> Best regards,
>
> -chanwit
>
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
>
>
> On Wed, May 28, 2014 at 1:13 AM, Chanwit Kaewkasi <chan...@gmail.com>
> wrote:
> > May be that's explaining mine too.
> > Thank you very much, Aaron !!
> >
> > Best regards,
> >
> > -chanwit
> >
> > --
> > Chanwit Kaewkasi
> > linkedin.com/in/chanwit
> >
> >
> > On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson <ilike...@gmail.com>
> wrote:
> >> Spark should effectively turn Akka's failure detector off, because we
> >> historically had problems with GCs and other issues causing
> disassociations.
> >> The only thing that should cause these messages nowadays is if the TCP
> >> connection (which Akka sustains between Actor Systems on different
> machines)
> >> actually drops. TCP connections are pretty resilient, so one common
> cause of
> >> this is actual Executor failure -- recently, I have experienced a
> >> similar-sounding problem due to my machine's OOM killer terminating my
> >> Executors, such that they didn't produce any error output.
> >>
> >>
> >> On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi <chan...@gmail.com>
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> On an ARM cluster, I have been testing a wordcount program with JRE 7
> >>> and everything is OK. But when changing to the embedded version of
> >>> Java SE (Oracle's eJRE), the same program cannot complete all
> >>> computing stages.
> >>>
> >>> It is failed by many Akka's disassociation.
> >>>
> >>> - I've been trying to increase Akka's timeout but still stuck. I am
> >>> not sure what is the right way to do so? (I suspected that GC pausing
> >>> the world is causing this).
> >>>
> >>> - Another question is that how could I properly turn on Akka's logging
> >>> to see what's the root cause of this disassociation problem? (If my
> >>> guess about GC is wrong).
> >>>
> >>> Best regards,
> >>>
> >>> -chanwit
> >>>
> >>> --
> >>> Chanwit Kaewkasi
> >>> linkedin.com/in/chanwit
> >>
> >>
>

Reply via email to