Thanks for the update! I've also run into the block manager timeout issue, it might be a good idea to increase the default significantly (it would probably timeout immediately if the TCP connection itself dropped anyway).
On Sun, Jun 1, 2014 at 9:48 AM, Chanwit Kaewkasi <chan...@gmail.com> wrote: > Hi all, > > This is what I found: > > 1. Like Aaron suggested, an executor will be killed silently when the > OS's memory is running out. > I've found this many times to conclude this it's real. Adding swap and > increasing the JVM heap solved the problem, but you will encounter OS > paging out and full GC. > > 2. OS paging out and full GC are not likely to affect my benchmark > much while processing data from HDFS. But Akka process's randomly > killed during the network-related stage (for example, sorting). I've > found that an Akka process cannot fetch the result fast enough. > Increasing the block manager timeout helped a lot. I've doubled the > value many times as the network of our ARM cluster is quite slow. > > 3. We'd like to collect times spent for all stages of our benchmark. > So we always re-run when some tasks failed. Failure happened a lot but > it's understandable as Spark is designed on top of Akka's let-it-crash > philosophy. To make the benchmark run more perfectly (without a task > failure), I called .cache() before calling the transformation of the > next stage. And it helped a lot. > > Combined above and others tuning, we can now boost the performance of > our ARM cluster to 2.8 times faster than our first report. > > Best regards, > > -chanwit > > -- > Chanwit Kaewkasi > linkedin.com/in/chanwit > > > On Wed, May 28, 2014 at 1:13 AM, Chanwit Kaewkasi <chan...@gmail.com> > wrote: > > May be that's explaining mine too. > > Thank you very much, Aaron !! > > > > Best regards, > > > > -chanwit > > > > -- > > Chanwit Kaewkasi > > linkedin.com/in/chanwit > > > > > > On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson <ilike...@gmail.com> > wrote: > >> Spark should effectively turn Akka's failure detector off, because we > >> historically had problems with GCs and other issues causing > disassociations. > >> The only thing that should cause these messages nowadays is if the TCP > >> connection (which Akka sustains between Actor Systems on different > machines) > >> actually drops. TCP connections are pretty resilient, so one common > cause of > >> this is actual Executor failure -- recently, I have experienced a > >> similar-sounding problem due to my machine's OOM killer terminating my > >> Executors, such that they didn't produce any error output. > >> > >> > >> On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi <chan...@gmail.com> > wrote: > >>> > >>> Hi all, > >>> > >>> On an ARM cluster, I have been testing a wordcount program with JRE 7 > >>> and everything is OK. But when changing to the embedded version of > >>> Java SE (Oracle's eJRE), the same program cannot complete all > >>> computing stages. > >>> > >>> It is failed by many Akka's disassociation. > >>> > >>> - I've been trying to increase Akka's timeout but still stuck. I am > >>> not sure what is the right way to do so? (I suspected that GC pausing > >>> the world is causing this). > >>> > >>> - Another question is that how could I properly turn on Akka's logging > >>> to see what's the root cause of this disassociation problem? (If my > >>> guess about GC is wrong). > >>> > >>> Best regards, > >>> > >>> -chanwit > >>> > >>> -- > >>> Chanwit Kaewkasi > >>> linkedin.com/in/chanwit > >> > >> >