Hi all,

This is what I found:

1. Like Aaron suggested, an executor will be killed silently when the
OS's memory is running out.
I've found this many times to conclude this it's real. Adding swap and
increasing the JVM heap solved the problem, but you will encounter OS
paging out and full GC.

2. OS paging out and full GC are not likely to affect my benchmark
much while processing data from HDFS. But Akka process's randomly
killed during the network-related stage (for example, sorting). I've
found that an Akka process cannot fetch the result fast enough.
Increasing the block manager timeout helped a lot. I've doubled the
value many times as the network of our ARM cluster is quite slow.

3. We'd like to collect times spent for all stages of our benchmark.
So we always re-run when some tasks failed. Failure happened a lot but
it's understandable as Spark is designed on top of Akka's let-it-crash
philosophy. To make the benchmark run more perfectly (without a task
failure), I called .cache() before calling the transformation of the
next stage. And it helped a lot.

Combined above and others tuning, we can now boost the performance of
our ARM cluster to 2.8 times faster than our first report.

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Wed, May 28, 2014 at 1:13 AM, Chanwit Kaewkasi <chan...@gmail.com> wrote:
> May be that's explaining mine too.
> Thank you very much, Aaron !!
>
> Best regards,
>
> -chanwit
>
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
>
>
> On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson <ilike...@gmail.com> wrote:
>> Spark should effectively turn Akka's failure detector off, because we
>> historically had problems with GCs and other issues causing disassociations.
>> The only thing that should cause these messages nowadays is if the TCP
>> connection (which Akka sustains between Actor Systems on different machines)
>> actually drops. TCP connections are pretty resilient, so one common cause of
>> this is actual Executor failure -- recently, I have experienced a
>> similar-sounding problem due to my machine's OOM killer terminating my
>> Executors, such that they didn't produce any error output.
>>
>>
>> On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi <chan...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> On an ARM cluster, I have been testing a wordcount program with JRE 7
>>> and everything is OK. But when changing to the embedded version of
>>> Java SE (Oracle's eJRE), the same program cannot complete all
>>> computing stages.
>>>
>>> It is failed by many Akka's disassociation.
>>>
>>> - I've been trying to increase Akka's timeout but still stuck. I am
>>> not sure what is the right way to do so? (I suspected that GC pausing
>>> the world is causing this).
>>>
>>> - Another question is that how could I properly turn on Akka's logging
>>> to see what's the root cause of this disassociation problem? (If my
>>> guess about GC is wrong).
>>>
>>> Best regards,
>>>
>>> -chanwit
>>>
>>> --
>>> Chanwit Kaewkasi
>>> linkedin.com/in/chanwit
>>
>>

Reply via email to