Re: RegionServer dying every two or three days

Leonardo Gamas Tue, 24 Jan 2012 19:45:15 -0800

Thanks Matt for point out the problems that can happen, I will look into
it. And thanks Neil for sharing more details about your infrastructure, it
has been of great help.
I will run some tests with these instances and make my choice according to
our needs. Thank you all for the time dispensed. :)


2012/1/24 Neil Yalowitz <[email protected]>

> Hi Leonardo, excuse the late response.
>
> I read the link that Matt sent below RE: instance type and hardware
> isolation a while ago and struggled with the same problem with c1.xlarge
> and memory.  Another issue there, as Matt mentions network throughput, is
> the type of network connection.  We decided to go with cluster compute
> instances (cc1.4xlarge) instead since the larger memory and fatter pipe
> (10Gbit) suited our needs (more MR daemons/children and large DB rows,
> respectively).  The c1.xlarge also seemed like a bad match as the best
> trait of that instance, the CPU units, aren't really our bottleneck (it's
> more an issue of RAM and I/O).
>
> While the cluster compute instances improved performance and stability
> somewhat, the pain hasn't stopped there.  Creating/terminating instances
> seems to be a lottery, possibly due to bad neighbors on the physical host
> or network.  Some cluster instances are rock solid for days and weeks while
> we run our tests, others are problematic within hours of creation despite
> having an identical setup.  Even with the cluster compute instances, we
> have test clusters where we will run benchmarks, wipe the data, and rerun
> the benchmarks with wildly different performance (off by 400%).
>  Occasionally, an instance will become unresponsive to pings and SSH and
> will completely fall out of the cluster.
>
> It seems the strategy for EC2 deployment is to expect everything to fail
> and plan accordingly.  It hasn't been a good experience.
>
>
>
> Neil Yalowitz
>
> On Mon, Jan 23, 2012 at 1:37 PM, Matt Corgan <[email protected]> wrote:
>
> > You could always try going with a little smaller heap and see how it
> works
> > for your particular workload, maybe 4G.  1G block cache, 1G memstores,
> ~1G
> > GC overhead(?), leaving 1G for active program data.
> >
> > If trying to squeeze memory, you should be aware there is a limitation in
> > 0.90 where storefile indexes come out of that remaining 1G as opposed to
> > being stored in the block cache.  If you have big indexes, you would need
> > to shrink block cache and memstore limits to compensate.
> >
> >
> http://search-hadoop.com/m/OH4cT1LiN4Q1/corgan&subj=Re+a+question+storefileIndexSize
> >
> >
> > On Mon, Jan 23, 2012 at 4:32 AM, Leonardo Gamas
> > <[email protected]>wrote:
> >
> > > Thanks again Matt! I will try out this instance type, but i'm concerned
> > > about the MapReduce cluster running apart from HBase in my case, since
> we
> > > have some MapReduces running and planning to run more. Feels like
> losing
> > > the great strength of MapReduce, by running it far from data.
> > >
> > > 2012/1/21 Matt Corgan <[email protected]>
> > >
> > > > We actually don't run map/reduce on the same machines (most of our
> jobs
> > > are
> > > > on an old message based system), so don't have much experience there.
> >  We
> > > > run only HDFS (1G heap) and HBase (5.5G heap) with 12 * 100GB EBS
> > volumes
> > > > per regionserver, and ~350 regions/server at the moment.  5.5G is
> > > already a
> > > > small heap in the hbase world, so I wouldn't recommend decreasing it
> to
> > > fit
> > > > M/R,  You could always run map/reduce on separate servers, adding or
> > > > removing servers as needed (more at night?), or use Amazon's Elastic
> > M/R.
> > > >
> > > >
> > > > On Sat, Jan 21, 2012 at 5:04 AM, Leonardo Gamas
> > > > <[email protected]>wrote:
> > > >
> > > > > Thanks Matt for this insightful article, I will run my cluster with
> > > > > c1.xlarge to test it's performance. But i'm concerned with this
> > > machine,
> > > > > because the amount of RAM available, only 7GB. How many map/reduce
> > > slots
> > > > do
> > > > > you configure? And the amount of Heap for HBase? How many regions
> per
> > > > > RegionServer could my cluster support?
> > > > >
> > > > > 2012/1/20 Matt Corgan <[email protected]>
> > > > >
> > > > > > I run c1.xlarge servers and have found them very stable.  I see
> 100
> > > > > Mbit/s
> > > > > > sustained bi-directional network throughput (200Mbit/s total),
> > > > sometimes
> > > > > up
> > > > > > to 150 * 2 Mbit/s.
> > > > > >
> > > > > > Here's a pretty thorough examination of the underlying hardware:
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2-compute-unit/
> > > > > >
> > > > > >
> > > > > > *High-CPU instances*
> > > > > >
> > > > > > The high-CPU instances (c1.medium, c1.xlarge) run on systems with
> > > > > > dual-socket Intel Xeon E5410 2.33GHz processors. It is
> dual-socket
> > > > > because
> > > > > > we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge
> > > > instance
> > > > > > almost takes up the whole physical machine. However, we
> frequently
> > > > > observe
> > > > > > steal cycle on a c1.xlarge instance ranging from 0% to 25% with
> an
> > > > > average
> > > > > > of about 10%. The amount of steal cycle is not enough to host
> > another
> > > > > > smaller VM, i.e., a c1.medium. Maybe those steal cycles are used
> to
> > > run
> > > > > > Amazon’s software firewall (security group). On Passmark-CPU
> mark,
> > a
> > > > > > c1.xlarge machine achieves 7,962.6, actually higher than an
> average
> > > > > > dual-sock E5410 system is able to achieve (average is 6,903).
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
> > > > > > <[email protected]>wrote:
> > > > > >
> > > > > > > Thanks Neil for sharing your experience with AWS! Could you
> tell
> > > what
> > > > > > > instance type are you using?
> > > > > > > We are using m1.xlarge, that has 4 virtual cores, but i
> normally
> > > see
> > > > > > > recommendations for machines with 8 cores like c1.xlarge,
> > > m2.4xlarge,
> > > > > > etc.
> > > > > > > In principle these 8-core machines don't suffer too much with
> I/O
> > > > > > problems
> > > > > > > since they don't share the physical server. Is there any piece
> of
> > > > > > > information from Amazon or other source that affirms that or
> it's
> > > > based
> > > > > > in
> > > > > > > empirical analysis?
> > > > > > >
> > > > > > > 2012/1/19 Neil Yalowitz <[email protected]>
> > > > > > >
> > > > > > > > We have experienced many problems with our cluster on EC2.
>  The
> > > > blunt
> > > > > > > > solution was to increase the Zookeeper timeout to 5 minutes
> or
> > > even
> > > > > > more.
> > > > > > > >
> > > > > > > > Even with a long timeout, however, it's not uncommon for us
> to
> > > see
> > > > an
> > > > > > EC2
> > > > > > > > instance to become unresponsive to pings and SSH several
> times
> > > > > during a
> > > > > > > > week.  It's been a very bad environment for clusters.
> > > > > > > >
> > > > > > > >
> > > > > > > > Neil
> > > > > > > >
> > > > > > > > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > > > > > > > <[email protected]>wrote:
> > > > > > > >
> > > > > > > > > Hi Guys,
> > > > > > > > >
> > > > > > > > > I have tested the parameters provided by Sandy, and it
> solved
> > > the
> > > > > GC
> > > > > > > > > problems with the -XX:+UseParallelOldGC, thanks for the
> help
> > > > Sandy.
> > > > > > > > > I'm still experiencing some difficulties, the RegionServer
> > > > > continues
> > > > > > to
> > > > > > > > > shutdown, but it seems related to I/O. It starts to timeout
> > > many
> > > > > > > > > connections, new connections to/from the machine timeout
> too,
> > > and
> > > > > > > finally
> > > > > > > > > the RegionServer dies because of YouAreDeadException. I
> will
> > > > > collect
> > > > > > > more
> > > > > > > > > data, but i think it's an Amazon/Virtualized Environment
> > > inherent
> > > > > > > issue.
> > > > > > > > >
> > > > > > > > > Thanks for the great help provided so far.
> > > > > > > > >
> > > > > > > > > 2012/1/5 Leonardo Gamas <[email protected]>
> > > > > > > > >
> > > > > > > > > > I don't think so, if Amazon stopped the machine it would
> > > cause
> > > > a
> > > > > > stop
> > > > > > > > of
> > > > > > > > > > minutes, not seconds, and since the DataNode, TaskTracker
> > and
> > > > > > > Zookeeper
> > > > > > > > > > continue to work normally.
> > > > > > > > > > But it can be related to the shared environment nature of
> > > > Amazon,
> > > > > > > maybe
> > > > > > > > > > some spike in I/O caused by another virtualized server in
> > the
> > > > > same
> > > > > > > > > physical
> > > > > > > > > > machine.
> > > > > > > > > >
> > > > > > > > > > But the intance type i'm using:
> > > > > > > > > >
> > > > > > > > > > *Extra Large Instance*
> > > > > > > > > >
> > > > > > > > > > 15 GB memory
> > > > > > > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute
> > Units
> > > > > each)
> > > > > > > > > > 1,690 GB instance storage
> > > > > > > > > > 64-bit platform
> > > > > > > > > > I/O Performance: High
> > > > > > > > > > API name: m1.xlarge
> > > > > > > > > > I was not expecting to suffer from this problems, or at
> > least
> > > > not
> > > > > > > much.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2012/1/5 Sandy Pratt <[email protected]>
> > > > > > > > > >
> > > > > > > > > >> You think it's an Amazon problem maybe?  Like they
> paused
> > or
> > > > > > > migrated
> > > > > > > > > >> your virtual machine, and it just happens to be during
> GC,
> > > > > leaving
> > > > > > > us
> > > > > > > > to
> > > > > > > > > >> think the GC ran long when it didn't?  I don't have a
> lot
> > of
> > > > > > > > experience
> > > > > > > > > >> with Amazon so I don't know if that sort of thing is
> > common.
> > > > > > > > > >>
> > > > > > > > > >> > -----Original Message-----
> > > > > > > > > >> > From: Leonardo Gamas [mailto:
> [email protected]]
> > > > > > > > > >> > Sent: Thursday, January 05, 2012 13:15
> > > > > > > > > >> > To: [email protected]
> > > > > > > > > >> > Subject: Re: RegionServer dying every two or three
> days
> > > > > > > > > >> >
> > > > > > > > > >> > I checked the CPU Utilization graphics provided by
> > Amazon
> > > > > (it's
> > > > > > > not
> > > > > > > > > >> accurate,
> > > > > > > > > >> > since the sample time is about 5 minutes) and don't
> see
> > > any
> > > > > > > > > >> abnormality. I
> > > > > > > > > >> > will setup TSDB with Nagios to have a more reliable
> > source
> > > > of
> > > > > > > > > >> performance
> > > > > > > > > >> > data.
> > > > > > > > > >> >
> > > > > > > > > >> > The machines don't have swap space, if i run:
> > > > > > > > > >> >
> > > > > > > > > >> > $ swapon -s
> > > > > > > > > >> >
> > > > > > > > > >> > To display swap usage summary, it returns an empty
> list.
> > > > > > > > > >> >
> > > > > > > > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> in
> > > my
> > > > to
> > > > > > > > tests.
> > > > > > > > > >> >
> > > > > > > > > >> > I don't have payed much attention to the value of the
> > new
> > > > size
> > > > > > > > param.
> > > > > > > > > >> >
> > > > > > > > > >> > Thanks again for the help!!
> > > > > > > > > >> >
> > > > > > > > > >> > 2012/1/5 Sandy Pratt <[email protected]>
> > > > > > > > > >> >
> > > > > > > > > >> > > That size heap doesn't seem like it should cause a
> 36
> > > > second
> > > > > > GC
> > > > > > > (a
> > > > > > > > > >> > > minor GC even if I remember your logs correctly),
> so I
> > > > tend
> > > > > to
> > > > > > > > think
> > > > > > > > > >> > > that other things are probably going on.
> > > > > > > > > >> > >
> > > > > > > > > >> > > This line here:
> > > > > > > > > >> > >
> > > > > > > > > >> > > 14251.690: [GC 14288.620: [ParNew:
> > > 105352K->413K(118016K),
> > > > > > > > 0.0361840
> > > > > > > > > >> > > secs]
> > > > > > > > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times:
> > > > > user=0.05
> > > > > > > > > >> > > 954388K->sys=0.01,
> > > > > > > > > >> > > real=36.96 secs]
> > > > > > > > > >> > >
> > > > > > > > > >> > > is really mysterious to me.  It seems to indicate
> that
> > > the
> > > > > > > process
> > > > > > > > > was
> > > > > > > > > >> > > blocked for almost 37 seconds during a minor
> > collection.
> > > > >  Note
> > > > > > > the
> > > > > > > > > CPU
> > > > > > > > > >> > > times are very low but the wall time is very high.
>  If
> > > it
> > > > > was
> > > > > > > > > actually
> > > > > > > > > >> > > doing GC work, I'd expect to see user time higher
> than
> > > > real
> > > > > > > time,
> > > > > > > > as
> > > > > > > > > >> > > it is in other parallel collections (see your log
> > > > snippet).
> > > > > > >  Were
> > > > > > > > > you
> > > > > > > > > >> > > really so CPU starved that it took 37 seconds to get
> > in
> > > > 50ms
> > > > > > of
> > > > > > > > > work?
> > > > > > > > > >> > > I can't make sense of that.  I'm trying to think of
> > > > > something
> > > > > > > that
> > > > > > > > > >> > > would block you for that long while all your threads
> > are
> > > > > > stopped
> > > > > > > > for
> > > > > > > > > >> > > GC, other than being in swap, but I can't come up
> with
> > > > > > anything.
> > > > > > > > > >>  You're
> > > > > > > > > >> > certain you're not in swap?
> > > > > > > > > >> > >
> > > > > > > > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis
> > > > > > -XX:+AggressiveOpts
> > > > > > > > > while
> > > > > > > > > >> > > you troubleshoot?
> > > > > > > > > >> > >
> > > > > > > > > >> > > Why is your new size so small?  This generally means
> > > that
> > > > > > > > relatively
> > > > > > > > > >> > > more objects are being tenured than would be with a
> > > larger
> > > > > new
> > > > > > > > size.
> > > > > > > > > >> > > This could make collections of the old gen worse (GC
> > > time
> > > > is
> > > > > > > said
> > > > > > > > to
> > > > > > > > > >> > > be proportional to the number of live objects in the
> > > > > > generation,
> > > > > > > > and
> > > > > > > > > >> > > CMS does indeed cause STW pauses).  A typical new to
> > > > tenured
> > > > > > > ratio
> > > > > > > > > >> > > might be 1:3.  Were the new gen GCs taking too long?
> > >  This
> > > > > is
> > > > > > > > > probably
> > > > > > > > > >> > > orthogonal to your immediate issue, though.
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > > -----Original Message-----
> > > > > > > > > >> > > From: Leonardo Gamas [mailto:
> > [email protected]]
> > > > > > > > > >> > > Sent: Thursday, January 05, 2012 5:33 AM
> > > > > > > > > >> > > To: [email protected]
> > > > > > > > > >> > > Subject: Re: RegionServer dying every two or three
> > days
> > > > > > > > > >> > >
> > > > > > > > > >> > >  St.Ack,
> > > > > > > > > >> > >
> > > > > > > > > >> > > I don't have made any attempt in GC tunning, yet.
> > > > > > > > > >> > > I will read the perf section as suggested.
> > > > > > > > > >> > > I'm currently using Nagios + JMX to monitor the
> > cluster,
> > > > but
> > > > > > > it's
> > > > > > > > > >> > > currently used for alert only, the perfdata is not
> > been
> > > > > > stored,
> > > > > > > so
> > > > > > > > > >> > > it's kind of useless right now, but i was thinking
> in
> > > use
> > > > > TSDB
> > > > > > > to
> > > > > > > > > >> > > store it, any known case of integration?
> > > > > > > > > >> > > ---
> > > > > > > > > >> > >
> > > > > > > > > >> > > Sandy,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Yes, my timeout is 30 seconds:
> > > > > > > > > >> > >
> > > > > > > > > >> > > <property>
> > > > > > > > > >> > >   <name>zookeeper.session.timeout</name>
> > > > > > > > > >> > >   <value>30000</value>
> > > > > > > > > >> > > </property>
> > > > > > > > > >> > >
> > > > > > > > > >> > > To our application it's a sufferable time to wait in
> > > case
> > > > a
> > > > > > > > > >> > > RegionServer go offline.
> > > > > > > > > >> > >
> > > > > > > > > >> > > My heap is 4GB and my JVM params are:
> > > > > > > > > >> > >
> > > > > > > > > >> > > -Xmx4096m -server -XX:+UseParNewGC
> > > -XX:+UseConcMarkSweepGC
> > > > > > > > > >> > > -XX:CMSInitiatingOccupancyFraction=70
> -XX:NewSize=128m
> > > > > > > > > >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis
> > > > > -XX:+AggressiveOpts
> > > > > > > > > >> > > -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> > > > > > > > > >> > >
> -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > > > > > > > > >> > >
> > > > > > > > > >> > > I will try the -XX:+UseParallelOldGC param and post
> my
> > > > > > feedback
> > > > > > > > > here.
> > > > > > > > > >> > > ---
> > > > > > > > > >> > >
> > > > > > > > > >> > > Ramkrishna,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Seems the GC is the root of all evil in this case.
> > > > > > > > > >> > > ----
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thank you all for the answers. I will try out these
> > > > valuable
> > > > > > > > advices
> > > > > > > > > >> > > given here and post my results.
> > > > > > > > > >> > >
> > > > > > > > > >> > > Leo Gamas.
> > > > > > > > > >> > >
> > > > > > > > > >> > > 2012/1/5 Ramkrishna S Vasudevan <
> > > > > > > [email protected]>
> > > > > > > > > >> > >
> > > > > > > > > >> > > > Recently we faced a similar problem and it was due
> > to
> > > GC
> > > > > > > config.
> > > > > > > > > >> > > > Pls check your GC.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > Regards
> > > > > > > > > >> > > > Ram
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > -----Original Message-----
> > > > > > > > > >> > > > From: [email protected] [mailto:
> > [email protected]
> > > ]
> > > > On
> > > > > > > > Behalf
> > > > > > > > > Of
> > > > > > > > > >> > > > Stack
> > > > > > > > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > > > > > > > > >> > > > To: [email protected]
> > > > > > > > > >> > > > Subject: Re: RegionServer dying every two or three
> > > days
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > > > > > > > >> > > > <[email protected]> wrote:
> > > > > > > > > >> > > > > The third line took 36.96 seconds to execute,
> can
> > > this
> > > > > be
> > > > > > > > > causing
> > > > > > > > > >> > > > > this problem?
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > Probably.  Have you made any attempt at GC tuning?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Reading the code a little it seems that, even if
> > > it's
> > > > > > > > disabled,
> > > > > > > > > if
> > > > > > > > > >> > > > > all files are target in a compaction, it's
> > > considered
> > > > a
> > > > > > > major
> > > > > > > > > >> > > > > compaction. Is
> > > > > > > > > >> > > > it
> > > > > > > > > >> > > > > right?
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > That is right.  They get 'upgraded' from minor to
> > > major.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > This should be fine though.  What you are avoiding
> > > > setting
> > > > > > > major
> > > > > > > > > >> > > > compactions to 0 is all regions being major
> > compacted
> > > > on a
> > > > > > > > > period, a
> > > > > > > > > >> > > > heavy weight effective rewrite of all your data
> > > (unless
> > > > > > > already
> > > > > > > > > >> major
> > > > > > > > > >> > > > compacted).   It looks like you have this disabled
> > > which
> > > > > is
> > > > > > > good
> > > > > > > > > >> until
> > > > > > > > > >> > > > you've wrestled your cluster into submission.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > The machines don't have swap, so the swappiness
> > > > > parameter
> > > > > > > > don't
> > > > > > > > > >> > > > > seem to apply here. Any other suggestion?
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > See the perf section of the hbase manual.  It has
> > our
> > > > > > current
> > > > > > > > > list.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > Are you monitoring your cluster w/ ganglia or
> tsdb?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > St.Ack
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Thanks.
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > 2012/1/4 Leonardo Gamas <
> > [email protected]>
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > >> I will investigate this, thanks for the
> response.
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >> 2012/1/3 Sandy Pratt <[email protected]>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> > > Client
> > > > > > > session
> > > > > > > > > >> > > > >>> timed out, have not heard from server in
> 61103ms
> > > for
> > > > > > > > sessionid
> > > > > > > > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection
> and
> > > > > > > attempting
> > > > > > > > > >> > > > >>> reconnect
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>> It looks like the process has been
> unresponsive
> > > for
> > > > > some
> > > > > > > > time,
> > > > > > > > > >> > > > >>> so ZK
> > > > > > > > > >> > > > has
> > > > > > > > > >> > > > >>> terminated the session.  Did you experience a
> > long
> > > > GC
> > > > > > > pause
> > > > > > > > > >> > > > >>> right
> > > > > > > > > >> > > > before
> > > > > > > > > >> > > > >>> this?  If you don't have GC logging enabled
> for
> > > the
> > > > > RS,
> > > > > > > you
> > > > > > > > > can
> > > > > > > > > >> > > > sometimes
> > > > > > > > > >> > > > >>> tell by noticing a gap in the timestamps of
> the
> > > log
> > > > > > > > statements
> > > > > > > > > >> > > > >>> leading
> > > > > > > > > >> > > > up
> > > > > > > > > >> > > > >>> to the crash.
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>> If it turns out to be GC, you might want to
> look
> > > at
> > > > > your
> > > > > > > > > kernel
> > > > > > > > > >> > > > >>> swappiness setting (set it to 0) and your JVM
> > > > params.
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>> Sandy
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>> > -----Original Message-----
> > > > > > > > > >> > > > >>> > From: Leonardo Gamas [mailto:
> > > > > > [email protected]]
> > > > > > > > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > > > > > > > >> > > > >>> > To: [email protected]
> > > > > > > > > >> > > > >>> > Subject: RegionServer dying every two or
> three
> > > > days
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Hi,
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4
> > > machines
> > > > > (1
> > > > > > > > > Master +
> > > > > > > > > >> > > > >>> > 3
> > > > > > > > > >> > > > >>> Slaves),
> > > > > > > > > >> > > > >>> > running on Amazon EC2. The master is a
> > > High-Memory
> > > > > > Extra
> > > > > > > > > Large
> > > > > > > > > >> > > > Instance
> > > > > > > > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker,
> HMaster
> > > and
> > > > > > > > > Zookeeper.
> > > > > > > > > >> > > > >>> > The slaves are Extra Large Instances
> > (m1.xlarge)
> > > > > > running
> > > > > > > > > >> > > > >>> > Datanode,
> > > > > > > > > >> > > > >>> TaskTracker,
> > > > > > > > > >> > > > >>> > RegionServer and Zookeeper.
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > From time to time, every two or three days,
> > one
> > > of
> > > > > the
> > > > > > > > > >> > > > >>> > RegionServers processes goes down, but the
> > other
> > > > > > > processes
> > > > > > > > > >> > > > >>> > (DataNode, TaskTracker,
> > > > > > > > > >> > > > >>> > Zookeeper) continue normally.
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Reading the logs:
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > The connection with Zookeeper timed out:
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> > > > Client
> > > > > > > > session
> > > > > > > > > >> > > > >>> > timed
> > > > > > > > > >> > > > out,
> > > > > > > > > >> > > > >>> have
> > > > > > > > > >> > > > >>> > not heard from server in 61103ms for
> sessionid
> > > > > > > > > >> > > > >>> > 0x23462a4cf93a8fc,
> > > > > > > > > >> > > > >>> closing
> > > > > > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> > > > Client
> > > > > > > > session
> > > > > > > > > >> > > > >>> > timed
> > > > > > > > > >> > > > out,
> > > > > > > > > >> > > > >>> have
> > > > > > > > > >> > > > >>> > not heard from server in 61205ms for
> sessionid
> > > > > > > > > >> > > > >>> > 0x346c561a55953e,
> > > > > > > > > >> > > > closing
> > > > > > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > And the Handlers start to fail:
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > > Server
> > > > > > > > > Responder,
> > > > > > > > > >> > > > >>> > call
> > > > > > > > > >> > > > >>> >
> > > > > > > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf
> > > > > > > > )
> > > > > > > > > >> > > > >>> > from
> > > > > > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > > Server
> > > > > > > handler
> > > > > > > > > 81
> > > > > > > > > >> > > > >>> > on
> > > > > > > > > >> > > > 60020
> > > > > > > > > >> > > > >>> > caught:
> > java.nio.channels.ClosedChannelException
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > > > > > >> > > > 13
> > > > > > > > > >> > > > >>> > 3)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>>
> > > > > > > >
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > > >
> > > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > > > > > >> > > > >>> > 1341)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > > > > > >> > > > >>> > ns
> > > > > > > > > >> > > > >>> > e(HB
> > > > > > > > > >> > > > >>> > aseServer.java:727)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > > > > > >> > > > >>> > as
> > > > > > > > > >> > > > >>> > eSe
> > > > > > > > > >> > > > >>> > rver.java:792)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > > > >> > > > :1
> > > > > > > > > >> > > > >>> > 083)
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > > Server
> > > > > > > > > Responder,
> > > > > > > > > >> > > > >>> > call
> > > > > > > > > >> > > > >>> >
> > > > > > > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430
> > > > > > > > )
> > > > > > > > > >> > > > >>> > from
> > > > > > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > > Server
> > > > > > > handler
> > > > > > > > > 62
> > > > > > > > > >> > > > >>> > on
> > > > > > > > > >> > > > 60020
> > > > > > > > > >> > > > >>> > caught:
> > java.nio.channels.ClosedChannelException
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > > > > > >> > > > 13
> > > > > > > > > >> > > > >>> > 3)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>>
> > > > > > > >
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > > >
> > > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > > > > > >> > > > >>> > 1341)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > > > > > >> > > > >>> > ns
> > > > > > > > > >> > > > >>> > e(HB
> > > > > > > > > >> > > > >>> > aseServer.java:727)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > > > > > >> > > > >>> > as
> > > > > > > > > >> > > > >>> > eSe
> > > > > > > > > >> > > > >>> > rver.java:792)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > > > >> > > > :1
> > > > > > > > > >> > > > >>> > 083)
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > And finally the server throws a
> > > > YouAreDeadException
> > > > > > :( :
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Opening
> > > > > > > > socket
> > > > > > > > > >> > > > connection
> > > > > > > > > >> > > > >>> to
> > > > > > > > > >> > > > >>> > server
> > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Socket
> > > > > > > > > connection
> > > > > > > > > >> > > > >>> > established to
> > > > > > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > > > > > >> > > > initiating
> > > > > > > > > >> > > > >>> session
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Unable
> > > > > to
> > > > > > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > > > > > 0x23462a4cf93a8fc
> > > > > > > > > has
> > > > > > > > > >> > > > >>> > expired, closing
> > > > > > > > > >> > > > socket
> > > > > > > > > >> > > > >>> > connection
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Opening
> > > > > > > > socket
> > > > > > > > > >> > > > connection
> > > > > > > > > >> > > > >>> to
> > > > > > > > > >> > > > >>> > server
> > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Socket
> > > > > > > > > connection
> > > > > > > > > >> > > > >>> > established to
> > > > > > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > > > > > >> > > > initiating
> > > > > > > > > >> > > > >>> session
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Unable
> > > > > to
> > > > > > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > > > > 0x346c561a55953e
> > > > > > > > has
> > > > > > > > > >> > > > >>> > expired, closing
> > > > > > > > > >> > > > socket
> > > > > > > > > >> > > > >>> > connection
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:03 FATAL
> > > > regionserver.HRegionServer:
> > > > > > > > ABORTING
> > > > > > > > > >> > > > >>> > region server
> > > > > > > > > >> > > > >>> >
> > > > > > > > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > > > > > > > >> > > > >>> > load=(requests=447, regions=206,
> > usedHeap=1584,
> > > > > > > > > >> > maxHeap=4083):
> > > > > > > > > >> > > > >>> > Unhandled
> > > > > > > > > >> > > > >>> > exception:
> > > > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > > > >> > Server
> > > > > > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > > > > > >> > > > >>> >
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > as
> > > > > > dead
> > > > > > > > > server
> > > > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > Server
> > > > > > > REPORT
> > > > > > > > > >> > > > >>> > rejected; currently processing
> > > > > > > > > >> > > > >>> >
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > > > > > >> > > > as
> > > > > > > > > >> > > > >>> > dead server
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > >
> > > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > > > >> > > > >>> > Method)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > >
> > > > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > > > > > > > > >> > > > to
> > > > > > > > > >> > > > r
> > > > > > > > > >> > > > >>> > AccessorImpl.java:39)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > >
> > > > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > > > > > > > > >> > > > Co
> > > > > > > > > >> > > > n
> > > > > > > > > >> > > > >>> > structorAccessorImpl.java:27)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>>
> > > > > > > > >
> > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > > > > > > > >> > > > >>> > ot
> > > > > > > > > >> > > > >>> > eExce
> > > > > > > > > >> > > > >>> > ption.java:95)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > > > > > > > >> > > > >>> > mo
> > > > > > > > > >> > > > >>> > te
> > > > > > > > > >> > > > >>> > Exception.java:79)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > > > > > >> > > > >>> > rv
> > > > > > > > > >> > > > >>> > erRep
> > > > > > > > > >> > > > >>> > ort(HRegionServer.java:735)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > > > > > > > >> > > > .j
> > > > > > > > > >> > > > >>> > ava:596)
> > > > > > > > > >> > > > >>> >         at
> > java.lang.Thread.run(Thread.java:662)
> > > > > > > > > >> > > > >>> > Caused by:
> > > org.apache.hadoop.ipc.RemoteException:
> > > > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > Server
> > > > > > > REPORT
> > > > > > > > > >> > > > >>> > rejected; currently processing
> > > > > > > > > >> > > > >>> >
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > > > > > >> > > > as
> > > > > > > > > >> > > > >>> > dead server
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > > > > > > > >> > > > >>> > rM
> > > > > > > > > >> > > > >>> > ana
> > > > > > > > > >> > > > >>> > ger.java:204)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > > > > > > > >> > > > >>> > t(
> > > > > > > > > >> > > > >>> > Serv
> > > > > > > > > >> > > > >>> > erManager.java:262)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > > > > > > > >> > > > >>> > te
> > > > > > > > > >> > > > >>> > r.jav
> > > > > > > > > >> > > > >>> > a:669)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > > > > > > > >> > > > Source)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > > > > > > > >> > > > >>> > od
> > > > > > > > > >> > > > >>> > Acces
> > > > > > > > > >> > > > >>> > sorImpl.java:25)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > java.lang.reflect.Method.invoke(Method.java:597)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > > > >
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > > > >> > > > :1
> > > > > > > > > >> > > > >>> > 039)
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > > >
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > > > > > > > >> > > > >>> > av
> > > > > > > > > >> > > > >>> > a:257
> > > > > > > > > >> > > > >>> > )
> > > > > > > > > >> > > > >>> >         at
> $Proxy6.regionServerReport(Unknown
> > > > > Source)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > > > > > >> > > > >>> > rv
> > > > > > > > > >> > > > >>> > erRep
> > > > > > > > > >> > > > >>> > ort(HRegionServer.java:729)
> > > > > > > > > >> > > > >>> >         ... 2 more
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO
> > > regionserver.HRegionServer:
> > > > > > Dump
> > > > > > > of
> > > > > > > > > >> > metrics:
> > > > > > > > > >> > > > >>> > requests=66, regions=206, stores=2078,
> > > > > storefiles=970,
> > > > > > > > > >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> > > > > > > > > >> > > > >>> > compactionQueueSize=0, flushQueueSize=0,
> > > > > > usedHeap=1672,
> > > > > > > > > >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > > > > > > > > >> > > > >>> > blockCacheFree=150412064,
> > blockCacheCount=10648,
> > > > > > > > > >> > > > >>> > blockCacheHitCount=79578618,
> > > > > > > blockCacheMissCount=3036335,
> > > > > > > > > >> > > > >>> > blockCacheEvictedCount=1401352,
> > > > > blockCacheHitRatio=96,
> > > > > > > > > >> > > > >>> > blockCacheHitCachingRatio=98
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO
> > > regionserver.HRegionServer:
> > > > > > > > STOPPED:
> > > > > > > > > >> > > > >>> > Unhandled
> > > > > > > > > >> > > > >>> > exception:
> > > > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > > > >> > Server
> > > > > > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > > > > > >> > > > >>> >
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > as
> > > > > > dead
> > > > > > > > > server
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer:
> > Stopping
> > > > > > server
> > > > > > > on
> > > > > > > > > >> > > > >>> > 60020
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Then i restart the RegionServer and
> everything
> > > is
> > > > > back
> > > > > > > to
> > > > > > > > > >> normal.
> > > > > > > > > >> > > > >>> > Reading the DataNode, Zookeeper and
> > TaskTracker
> > > > > logs,
> > > > > > i
> > > > > > > > > don't
> > > > > > > > > >> > > > >>> > see any abnormality in the same time window.
> > > > > > > > > >> > > > >>> > I think it was caused by the lost of
> > connection
> > > to
> > > > > > > > > zookeeper.
> > > > > > > > > >> > > > >>> > Is it
> > > > > > > > > >> > > > >>> advisable to
> > > > > > > > > >> > > > >>> > run zookeeper in the same machines?
> > > > > > > > > >> > > > >>> > if the RegionServer lost it's connection to
> > > > > Zookeeper,
> > > > > > > > > there's
> > > > > > > > > >> > > > >>> > a way
> > > > > > > > > >> > > > (a
> > > > > > > > > >> > > > >>> > configuration perhaps) to re-join the
> cluster,
> > > and
> > > > > not
> > > > > > > > only
> > > > > > > > > >> die?
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Any idea what is causing this?? Or to
> prevent
> > it
> > > > > from
> > > > > > > > > >> happening?
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Any help is appreciated.
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Best Regards,
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > --
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > *Leonardo Gamas*
> > > > > > > > > >> > > > >>> > Software Engineer
> > > > > > > > > >> > > > >>> > +557134943514
> > > > > > > > > >> > > > >>> > +557581347440
> > > > > > > > > >> > > > >>> > [email protected]
> > > > > > > > > >> > > > >>> > www.jusbrasil.com.br
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >> --
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >> *Leonardo Gamas*
> > > > > > > > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71)
> > > > > 3494-3514C
> > > > > > > > > (75)
> > > > > > > > > >> > > > >> 8134-7440 [email protected]
> > > > > www.jusbrasil.com.br
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > --
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > *Leonardo Gamas*
> > > > > > > > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71)
> > > > > 3494-3514C
> > > > > > > > (75)
> > > > > > > > > >> > > > > 8134-7440 [email protected]
> > > > > www.jusbrasil.com.br
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > > --
> > > > > > > > > >> > >
> > > > > > > > > >> > > *Leonardo Gamas*
> > > > > > > > > >> > > Software Engineer
> > > > > > > > > >> > > +557134943514
> > > > > > > > > >> > > +557581347440
> > > > > > > > > >> > > [email protected]
> > > > > > > > > >> > > www.jusbrasil.com.br
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > --
> > > > > > > > > >> >
> > > > > > > > > >> > *Leonardo Gamas*
> > > > > > > > > >> > Software Engineer
> > > > > > > > > >> > T +55 (71) 3494-3514
> > > > > > > > > >> > C +55 (75) 8134-7440
> > > > > > > > > >> > [email protected]
> > > > > > > > > >> > www.jusbrasil.com.br
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > >
> > > > > > > > > > *Leonardo Gamas*
> > > > > > > > > >
> > > > > > > > > > Software Engineer
> > > > > > > > > > T +55 (71) 3494-3514
> > > > > > > > > > C +55 (75) 8134-7440
> > > > > > > > > > [email protected]
> > > > > > > > > >
> > > > > > > > > > www.jusbrasil.com.br
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > *Leonardo Gamas*
> > > > > > > > > Software Engineer
> > > > > > > > > T +55 (71) 3494-3514
> > > > > > > > > C +55 (75) 8134-7440
> > > > > > > > > [email protected]
> > > > > > > > > www.jusbrasil.com.br
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > *Leonardo Gamas*
> > > > > > > Software Engineer
> > > > > > > T +55 (71) 3494-3514
> > > > > > > C +55 (75) 8134-7440
> > > > > > > [email protected]
> > > > > > > www.jusbrasil.com.br
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Leonardo Gamas*
> > > > > Software Engineer
> > > > > T +55 (71) 3494-3514
> > > > > C +55 (75) 8134-7440
> > > > > [email protected]
> > > > > www.jusbrasil.com.br
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *Leonardo Gamas*
> > > Software Engineer
> > > T +55 (71) 3494-3514
> > > C +55 (75) 8134-7440
> > > [email protected]
> > > www.jusbrasil.com.br
> > >
> >
>



-- 

*Leonardo Gamas*
Software Engineer
T +55 (71) 3494-3514
C +55 (75) 8134-7440
[email protected]
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Reply via email to