Re: RegionServer dying every two or three days

Matt Corgan Sat, 21 Jan 2012 11:49:03 -0800

We actually don't run map/reduce on the same machines (most of our jobs are
on an old message based system), so don't have much experience there.  We
run only HDFS (1G heap) and HBase (5.5G heap) with 12 * 100GB EBS volumes
per regionserver, and ~350 regions/server at the moment.  5.5G is already a
small heap in the hbase world, so I wouldn't recommend decreasing it to fit
M/R,  You could always run map/reduce on separate servers, adding or
removing servers as needed (more at night?), or use Amazon's Elastic M/R.



On Sat, Jan 21, 2012 at 5:04 AM, Leonardo Gamas
<[email protected]>wrote:

> Thanks Matt for this insightful article, I will run my cluster with
> c1.xlarge to test it's performance. But i'm concerned with this machine,
> because the amount of RAM available, only 7GB. How many map/reduce slots do
> you configure? And the amount of Heap for HBase? How many regions per
> RegionServer could my cluster support?
>
> 2012/1/20 Matt Corgan <[email protected]>
>
> > I run c1.xlarge servers and have found them very stable.  I see 100
> Mbit/s
> > sustained bi-directional network throughput (200Mbit/s total), sometimes
> up
> > to 150 * 2 Mbit/s.
> >
> > Here's a pretty thorough examination of the underlying hardware:
> >
> >
> >
> http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2-compute-unit/
> >
> >
> > *High-CPU instances*
> >
> > The high-CPU instances (c1.medium, c1.xlarge) run on systems with
> > dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket
> because
> > we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge instance
> > almost takes up the whole physical machine. However, we frequently
> observe
> > steal cycle on a c1.xlarge instance ranging from 0% to 25% with an
> average
> > of about 10%. The amount of steal cycle is not enough to host another
> > smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to run
> > Amazon’s software firewall (security group). On Passmark-CPU mark, a
> > c1.xlarge machine achieves 7,962.6, actually higher than an average
> > dual-sock E5410 system is able to achieve (average is 6,903).
> >
> >
> >
> > On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
> > <[email protected]>wrote:
> >
> > > Thanks Neil for sharing your experience with AWS! Could you tell what
> > > instance type are you using?
> > > We are using m1.xlarge, that has 4 virtual cores, but i normally see
> > > recommendations for machines with 8 cores like c1.xlarge, m2.4xlarge,
> > etc.
> > > In principle these 8-core machines don't suffer too much with I/O
> > problems
> > > since they don't share the physical server. Is there any piece of
> > > information from Amazon or other source that affirms that or it's based
> > in
> > > empirical analysis?
> > >
> > > 2012/1/19 Neil Yalowitz <[email protected]>
> > >
> > > > We have experienced many problems with our cluster on EC2.  The blunt
> > > > solution was to increase the Zookeeper timeout to 5 minutes or even
> > more.
> > > >
> > > > Even with a long timeout, however, it's not uncommon for us to see an
> > EC2
> > > > instance to become unresponsive to pings and SSH several times
> during a
> > > > week.  It's been a very bad environment for clusters.
> > > >
> > > >
> > > > Neil
> > > >
> > > > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > > > <[email protected]>wrote:
> > > >
> > > > > Hi Guys,
> > > > >
> > > > > I have tested the parameters provided by Sandy, and it solved the
> GC
> > > > > problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
> > > > > I'm still experiencing some difficulties, the RegionServer
> continues
> > to
> > > > > shutdown, but it seems related to I/O. It starts to timeout many
> > > > > connections, new connections to/from the machine timeout too, and
> > > finally
> > > > > the RegionServer dies because of YouAreDeadException. I will
> collect
> > > more
> > > > > data, but i think it's an Amazon/Virtualized Environment inherent
> > > issue.
> > > > >
> > > > > Thanks for the great help provided so far.
> > > > >
> > > > > 2012/1/5 Leonardo Gamas <[email protected]>
> > > > >
> > > > > > I don't think so, if Amazon stopped the machine it would cause a
> > stop
> > > > of
> > > > > > minutes, not seconds, and since the DataNode, TaskTracker and
> > > Zookeeper
> > > > > > continue to work normally.
> > > > > > But it can be related to the shared environment nature of Amazon,
> > > maybe
> > > > > > some spike in I/O caused by another virtualized server in the
> same
> > > > > physical
> > > > > > machine.
> > > > > >
> > > > > > But the intance type i'm using:
> > > > > >
> > > > > > *Extra Large Instance*
> > > > > >
> > > > > > 15 GB memory
> > > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units
> each)
> > > > > > 1,690 GB instance storage
> > > > > > 64-bit platform
> > > > > > I/O Performance: High
> > > > > > API name: m1.xlarge
> > > > > > I was not expecting to suffer from this problems, or at least not
> > > much.
> > > > > >
> > > > > >
> > > > > > 2012/1/5 Sandy Pratt <[email protected]>
> > > > > >
> > > > > >> You think it's an Amazon problem maybe?  Like they paused or
> > > migrated
> > > > > >> your virtual machine, and it just happens to be during GC,
> leaving
> > > us
> > > > to
> > > > > >> think the GC ran long when it didn't?  I don't have a lot of
> > > > experience
> > > > > >> with Amazon so I don't know if that sort of thing is common.
> > > > > >>
> > > > > >> > -----Original Message-----
> > > > > >> > From: Leonardo Gamas [mailto:[email protected]]
> > > > > >> > Sent: Thursday, January 05, 2012 13:15
> > > > > >> > To: [email protected]
> > > > > >> > Subject: Re: RegionServer dying every two or three days
> > > > > >> >
> > > > > >> > I checked the CPU Utilization graphics provided by Amazon
> (it's
> > > not
> > > > > >> accurate,
> > > > > >> > since the sample time is about 5 minutes) and don't see any
> > > > > >> abnormality. I
> > > > > >> > will setup TSDB with Nagios to have a more reliable source of
> > > > > >> performance
> > > > > >> > data.
> > > > > >> >
> > > > > >> > The machines don't have swap space, if i run:
> > > > > >> >
> > > > > >> > $ swapon -s
> > > > > >> >
> > > > > >> > To display swap usage summary, it returns an empty list.
> > > > > >> >
> > > > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to
> > > > tests.
> > > > > >> >
> > > > > >> > I don't have payed much attention to the value of the new size
> > > > param.
> > > > > >> >
> > > > > >> > Thanks again for the help!!
> > > > > >> >
> > > > > >> > 2012/1/5 Sandy Pratt <[email protected]>
> > > > > >> >
> > > > > >> > > That size heap doesn't seem like it should cause a 36 second
> > GC
> > > (a
> > > > > >> > > minor GC even if I remember your logs correctly), so I tend
> to
> > > > think
> > > > > >> > > that other things are probably going on.
> > > > > >> > >
> > > > > >> > > This line here:
> > > > > >> > >
> > > > > >> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K),
> > > > 0.0361840
> > > > > >> > > secs]
> > > > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times:
> user=0.05
> > > > > >> > > 954388K->sys=0.01,
> > > > > >> > > real=36.96 secs]
> > > > > >> > >
> > > > > >> > > is really mysterious to me.  It seems to indicate that the
> > > process
> > > > > was
> > > > > >> > > blocked for almost 37 seconds during a minor collection.
>  Note
> > > the
> > > > > CPU
> > > > > >> > > times are very low but the wall time is very high.  If it
> was
> > > > > actually
> > > > > >> > > doing GC work, I'd expect to see user time higher than real
> > > time,
> > > > as
> > > > > >> > > it is in other parallel collections (see your log snippet).
> > >  Were
> > > > > you
> > > > > >> > > really so CPU starved that it took 37 seconds to get in 50ms
> > of
> > > > > work?
> > > > > >> > > I can't make sense of that.  I'm trying to think of
> something
> > > that
> > > > > >> > > would block you for that long while all your threads are
> > stopped
> > > > for
> > > > > >> > > GC, other than being in swap, but I can't come up with
> > anything.
> > > > > >>  You're
> > > > > >> > certain you're not in swap?
> > > > > >> > >
> > > > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis
> > -XX:+AggressiveOpts
> > > > > while
> > > > > >> > > you troubleshoot?
> > > > > >> > >
> > > > > >> > > Why is your new size so small?  This generally means that
> > > > relatively
> > > > > >> > > more objects are being tenured than would be with a larger
> new
> > > > size.
> > > > > >> > > This could make collections of the old gen worse (GC time is
> > > said
> > > > to
> > > > > >> > > be proportional to the number of live objects in the
> > generation,
> > > > and
> > > > > >> > > CMS does indeed cause STW pauses).  A typical new to tenured
> > > ratio
> > > > > >> > > might be 1:3.  Were the new gen GCs taking too long?  This
> is
> > > > > probably
> > > > > >> > > orthogonal to your immediate issue, though.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > -----Original Message-----
> > > > > >> > > From: Leonardo Gamas [mailto:[email protected]]
> > > > > >> > > Sent: Thursday, January 05, 2012 5:33 AM
> > > > > >> > > To: [email protected]
> > > > > >> > > Subject: Re: RegionServer dying every two or three days
> > > > > >> > >
> > > > > >> > >  St.Ack,
> > > > > >> > >
> > > > > >> > > I don't have made any attempt in GC tunning, yet.
> > > > > >> > > I will read the perf section as suggested.
> > > > > >> > > I'm currently using Nagios + JMX to monitor the cluster, but
> > > it's
> > > > > >> > > currently used for alert only, the perfdata is not been
> > stored,
> > > so
> > > > > >> > > it's kind of useless right now, but i was thinking in use
> TSDB
> > > to
> > > > > >> > > store it, any known case of integration?
> > > > > >> > > ---
> > > > > >> > >
> > > > > >> > > Sandy,
> > > > > >> > >
> > > > > >> > > Yes, my timeout is 30 seconds:
> > > > > >> > >
> > > > > >> > > <property>
> > > > > >> > >   <name>zookeeper.session.timeout</name>
> > > > > >> > >   <value>30000</value>
> > > > > >> > > </property>
> > > > > >> > >
> > > > > >> > > To our application it's a sufferable time to wait in case a
> > > > > >> > > RegionServer go offline.
> > > > > >> > >
> > > > > >> > > My heap is 4GB and my JVM params are:
> > > > > >> > >
> > > > > >> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> > > > > >> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> > > > > >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis
> -XX:+AggressiveOpts
> > > > > >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > > > > >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > > > > >> > >
> > > > > >> > > I will try the -XX:+UseParallelOldGC param and post my
> > feedback
> > > > > here.
> > > > > >> > > ---
> > > > > >> > >
> > > > > >> > > Ramkrishna,
> > > > > >> > >
> > > > > >> > > Seems the GC is the root of all evil in this case.
> > > > > >> > > ----
> > > > > >> > >
> > > > > >> > > Thank you all for the answers. I will try out these valuable
> > > > advices
> > > > > >> > > given here and post my results.
> > > > > >> > >
> > > > > >> > > Leo Gamas.
> > > > > >> > >
> > > > > >> > > 2012/1/5 Ramkrishna S Vasudevan <
> > > [email protected]>
> > > > > >> > >
> > > > > >> > > > Recently we faced a similar problem and it was due to GC
> > > config.
> > > > > >> > > > Pls check your GC.
> > > > > >> > > >
> > > > > >> > > > Regards
> > > > > >> > > > Ram
> > > > > >> > > >
> > > > > >> > > > -----Original Message-----
> > > > > >> > > > From: [email protected] [mailto:[email protected]] On
> > > > Behalf
> > > > > Of
> > > > > >> > > > Stack
> > > > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > > > > >> > > > To: [email protected]
> > > > > >> > > > Subject: Re: RegionServer dying every two or three days
> > > > > >> > > >
> > > > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > > > >> > > > <[email protected]> wrote:
> > > > > >> > > > > The third line took 36.96 seconds to execute, can this
> be
> > > > > causing
> > > > > >> > > > > this problem?
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > > Probably.  Have you made any attempt at GC tuning?
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > > Reading the code a little it seems that, even if it's
> > > > disabled,
> > > > > if
> > > > > >> > > > > all files are target in a compaction, it's considered a
> > > major
> > > > > >> > > > > compaction. Is
> > > > > >> > > > it
> > > > > >> > > > > right?
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > > That is right.  They get 'upgraded' from minor to major.
> > > > > >> > > >
> > > > > >> > > > This should be fine though.  What you are avoiding setting
> > > major
> > > > > >> > > > compactions to 0 is all regions being major compacted on a
> > > > > period, a
> > > > > >> > > > heavy weight effective rewrite of all your data (unless
> > > already
> > > > > >> major
> > > > > >> > > > compacted).   It looks like you have this disabled which
> is
> > > good
> > > > > >> until
> > > > > >> > > > you've wrestled your cluster into submission.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > > The machines don't have swap, so the swappiness
> parameter
> > > > don't
> > > > > >> > > > > seem to apply here. Any other suggestion?
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > > See the perf section of the hbase manual.  It has our
> > current
> > > > > list.
> > > > > >> > > >
> > > > > >> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > St.Ack
> > > > > >> > > >
> > > > > >> > > > > Thanks.
> > > > > >> > > > >
> > > > > >> > > > > 2012/1/4 Leonardo Gamas <[email protected]>
> > > > > >> > > > >
> > > > > >> > > > >> I will investigate this, thanks for the response.
> > > > > >> > > > >>
> > > > > >> > > > >>
> > > > > >> > > > >> 2012/1/3 Sandy Pratt <[email protected]>
> > > > > >> > > > >>
> > > > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > > session
> > > > > >> > > > >>> timed out, have not heard from server in 61103ms for
> > > > sessionid
> > > > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and
> > > attempting
> > > > > >> > > > >>> reconnect
> > > > > >> > > > >>>
> > > > > >> > > > >>> It looks like the process has been unresponsive for
> some
> > > > time,
> > > > > >> > > > >>> so ZK
> > > > > >> > > > has
> > > > > >> > > > >>> terminated the session.  Did you experience a long GC
> > > pause
> > > > > >> > > > >>> right
> > > > > >> > > > before
> > > > > >> > > > >>> this?  If you don't have GC logging enabled for the
> RS,
> > > you
> > > > > can
> > > > > >> > > > sometimes
> > > > > >> > > > >>> tell by noticing a gap in the timestamps of the log
> > > > statements
> > > > > >> > > > >>> leading
> > > > > >> > > > up
> > > > > >> > > > >>> to the crash.
> > > > > >> > > > >>>
> > > > > >> > > > >>> If it turns out to be GC, you might want to look at
> your
> > > > > kernel
> > > > > >> > > > >>> swappiness setting (set it to 0) and your JVM params.
> > > > > >> > > > >>>
> > > > > >> > > > >>> Sandy
> > > > > >> > > > >>>
> > > > > >> > > > >>>
> > > > > >> > > > >>> > -----Original Message-----
> > > > > >> > > > >>> > From: Leonardo Gamas [mailto:
> > [email protected]]
> > > > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > > > >> > > > >>> > To: [email protected]
> > > > > >> > > > >>> > Subject: RegionServer dying every two or three days
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Hi,
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines
> (1
> > > > > Master +
> > > > > >> > > > >>> > 3
> > > > > >> > > > >>> Slaves),
> > > > > >> > > > >>> > running on Amazon EC2. The master is a High-Memory
> > Extra
> > > > > Large
> > > > > >> > > > Instance
> > > > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and
> > > > > Zookeeper.
> > > > > >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge)
> > running
> > > > > >> > > > >>> > Datanode,
> > > > > >> > > > >>> TaskTracker,
> > > > > >> > > > >>> > RegionServer and Zookeeper.
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > From time to time, every two or three days, one of
> the
> > > > > >> > > > >>> > RegionServers processes goes down, but the other
> > > processes
> > > > > >> > > > >>> > (DataNode, TaskTracker,
> > > > > >> > > > >>> > Zookeeper) continue normally.
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Reading the logs:
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > The connection with Zookeeper timed out:
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > > > session
> > > > > >> > > > >>> > timed
> > > > > >> > > > out,
> > > > > >> > > > >>> have
> > > > > >> > > > >>> > not heard from server in 61103ms for sessionid
> > > > > >> > > > >>> > 0x23462a4cf93a8fc,
> > > > > >> > > > >>> closing
> > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > > > session
> > > > > >> > > > >>> > timed
> > > > > >> > > > out,
> > > > > >> > > > >>> have
> > > > > >> > > > >>> > not heard from server in 61205ms for sessionid
> > > > > >> > > > >>> > 0x346c561a55953e,
> > > > > >> > > > closing
> > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > And the Handlers start to fail:
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > > Responder,
> > > > > >> > > > >>> > call
> > > > > >> > > > >>> >
> > > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf
> > > > )
> > > > > >> > > > >>> > from
> > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > handler
> > > > > 81
> > > > > >> > > > >>> > on
> > > > > >> > > > 60020
> > > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > >> > > > 13
> > > > > >> > > > >>> > 3)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>>
> > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > >> > > > >>> > 1341)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > >> > > > >>> > ns
> > > > > >> > > > >>> > e(HB
> > > > > >> > > > >>> > aseServer.java:727)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > >> > > > >>> > as
> > > > > >> > > > >>> > eSe
> > > > > >> > > > >>> > rver.java:792)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > >> > > > :1
> > > > > >> > > > >>> > 083)
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > > Responder,
> > > > > >> > > > >>> > call
> > > > > >> > > > >>> >
> > > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430
> > > > )
> > > > > >> > > > >>> > from
> > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > handler
> > > > > 62
> > > > > >> > > > >>> > on
> > > > > >> > > > 60020
> > > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > >> > > > 13
> > > > > >> > > > >>> > 3)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>>
> > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > >> > > > >>> > 1341)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > >> > > > >>> > ns
> > > > > >> > > > >>> > e(HB
> > > > > >> > > > >>> > aseServer.java:727)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > >> > > > >>> > as
> > > > > >> > > > >>> > eSe
> > > > > >> > > > >>> > rver.java:792)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > >> > > > :1
> > > > > >> > > > >>> > 083)
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > And finally the server throws a YouAreDeadException
> > :( :
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> > > > socket
> > > > > >> > > > connection
> > > > > >> > > > >>> to
> > > > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > > > > connection
> > > > > >> > > > >>> > established to
> > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > >> > > > initiating
> > > > > >> > > > >>> session
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable
> to
> > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > 0x23462a4cf93a8fc
> > > > > has
> > > > > >> > > > >>> > expired, closing
> > > > > >> > > > socket
> > > > > >> > > > >>> > connection
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> > > > socket
> > > > > >> > > > connection
> > > > > >> > > > >>> to
> > > > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > > > > connection
> > > > > >> > > > >>> > established to
> > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > >> > > > initiating
> > > > > >> > > > >>> session
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable
> to
> > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > 0x346c561a55953e
> > > > has
> > > > > >> > > > >>> > expired, closing
> > > > > >> > > > socket
> > > > > >> > > > >>> > connection
> > > > > >> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer:
> > > > ABORTING
> > > > > >> > > > >>> > region server
> > > > > >> > > > >>> >
> > > > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > > > >> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
> > > > > >> > maxHeap=4083):
> > > > > >> > > > >>> > Unhandled
> > > > > >> > > > >>> > exception:
> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > >> > Server
> > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> > dead
> > > > > server
> > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server
> > > REPORT
> > > > > >> > > > >>> > rejected; currently processing
> > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > >> > > > as
> > > > > >> > > > >>> > dead server
> > > > > >> > > > >>> >         at
> > > > > >> > > >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > >> > > > >>> > Method)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > > > > >> > > > to
> > > > > >> > > > r
> > > > > >> > > > >>> > AccessorImpl.java:39)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > > > > >> > > > Co
> > > > > >> > > > n
> > > > > >> > > > >>> > structorAccessorImpl.java:27)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>>
> > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > > > >> > > > >>> > ot
> > > > > >> > > > >>> > eExce
> > > > > >> > > > >>> > ption.java:95)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > > > >> > > > >>> > mo
> > > > > >> > > > >>> > te
> > > > > >> > > > >>> > Exception.java:79)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > >> > > > >>> > rv
> > > > > >> > > > >>> > erRep
> > > > > >> > > > >>> > ort(HRegionServer.java:735)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > > > >> > > > .j
> > > > > >> > > > >>> > ava:596)
> > > > > >> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
> > > > > >> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server
> > > REPORT
> > > > > >> > > > >>> > rejected; currently processing
> > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > >> > > > as
> > > > > >> > > > >>> > dead server
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > > > >> > > > >>> > rM
> > > > > >> > > > >>> > ana
> > > > > >> > > > >>> > ger.java:204)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > > > >> > > > >>> > t(
> > > > > >> > > > >>> > Serv
> > > > > >> > > > >>> > erManager.java:262)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > > > >> > > > >>> > te
> > > > > >> > > > >>> > r.jav
> > > > > >> > > > >>> > a:669)
> > > > > >> > > > >>> >         at
> > > > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > > > >> > > > Source)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > > > >> > > > >>> > od
> > > > > >> > > > >>> > Acces
> > > > > >> > > > >>> > sorImpl.java:25)
> > > > > >> > > > >>> >         at
> > > > java.lang.reflect.Method.invoke(Method.java:597)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > >> > > > :1
> > > > > >> > > > >>> > 039)
> > > > > >> > > > >>> >
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > > > >> > > > >>> > av
> > > > > >> > > > >>> > a:257
> > > > > >> > > > >>> > )
> > > > > >> > > > >>> >         at $Proxy6.regionServerReport(Unknown
> Source)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > >> > > > >>> > rv
> > > > > >> > > > >>> > erRep
> > > > > >> > > > >>> > ort(HRegionServer.java:729)
> > > > > >> > > > >>> >         ... 2 more
> > > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> > Dump
> > > of
> > > > > >> > metrics:
> > > > > >> > > > >>> > requests=66, regions=206, stores=2078,
> storefiles=970,
> > > > > >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> > > > > >> > > > >>> > compactionQueueSize=0, flushQueueSize=0,
> > usedHeap=1672,
> > > > > >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > > > > >> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
> > > > > >> > > > >>> > blockCacheHitCount=79578618,
> > > blockCacheMissCount=3036335,
> > > > > >> > > > >>> > blockCacheEvictedCount=1401352,
> blockCacheHitRatio=96,
> > > > > >> > > > >>> > blockCacheHitCachingRatio=98
> > > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> > > > STOPPED:
> > > > > >> > > > >>> > Unhandled
> > > > > >> > > > >>> > exception:
> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > >> > Server
> > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> > dead
> > > > > server
> > > > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping
> > server
> > > on
> > > > > >> > > > >>> > 60020
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Then i restart the RegionServer and everything is
> back
> > > to
> > > > > >> normal.
> > > > > >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker
> logs,
> > i
> > > > > don't
> > > > > >> > > > >>> > see any abnormality in the same time window.
> > > > > >> > > > >>> > I think it was caused by the lost of connection to
> > > > > zookeeper.
> > > > > >> > > > >>> > Is it
> > > > > >> > > > >>> advisable to
> > > > > >> > > > >>> > run zookeeper in the same machines?
> > > > > >> > > > >>> > if the RegionServer lost it's connection to
> Zookeeper,
> > > > > there's
> > > > > >> > > > >>> > a way
> > > > > >> > > > (a
> > > > > >> > > > >>> > configuration perhaps) to re-join the cluster, and
> not
> > > > only
> > > > > >> die?
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Any idea what is causing this?? Or to prevent it
> from
> > > > > >> happening?
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Any help is appreciated.
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Best Regards,
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > --
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > *Leonardo Gamas*
> > > > > >> > > > >>> > Software Engineer
> > > > > >> > > > >>> > +557134943514
> > > > > >> > > > >>> > +557581347440
> > > > > >> > > > >>> > [email protected]
> > > > > >> > > > >>> > www.jusbrasil.com.br
> > > > > >> > > > >>>
> > > > > >> > > > >>
> > > > > >> > > > >>
> > > > > >> > > > >>
> > > > > >> > > > >> --
> > > > > >> > > > >>
> > > > > >> > > > >> *Leonardo Gamas*
> > > > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71)
> 3494-3514C
> > > > > (75)
> > > > > >> > > > >> 8134-7440 [email protected]
> www.jusbrasil.com.br
> > > > > >> > > > >>
> > > > > >> > > > >>
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > --
> > > > > >> > > > >
> > > > > >> > > > > *Leonardo Gamas*
> > > > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71)
> 3494-3514C
> > > > (75)
> > > > > >> > > > > 8134-7440 [email protected]
> www.jusbrasil.com.br
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > --
> > > > > >> > >
> > > > > >> > > *Leonardo Gamas*
> > > > > >> > > Software Engineer
> > > > > >> > > +557134943514
> > > > > >> > > +557581347440
> > > > > >> > > [email protected]
> > > > > >> > > www.jusbrasil.com.br
> > > > > >> > >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> >
> > > > > >> > *Leonardo Gamas*
> > > > > >> > Software Engineer
> > > > > >> > T +55 (71) 3494-3514
> > > > > >> > C +55 (75) 8134-7440
> > > > > >> > [email protected]
> > > > > >> > www.jusbrasil.com.br
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > *Leonardo Gamas*
> > > > > >
> > > > > > Software Engineer
> > > > > > T +55 (71) 3494-3514
> > > > > > C +55 (75) 8134-7440
> > > > > > [email protected]
> > > > > >
> > > > > > www.jusbrasil.com.br
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Leonardo Gamas*
> > > > > Software Engineer
> > > > > T +55 (71) 3494-3514
> > > > > C +55 (75) 8134-7440
> > > > > [email protected]
> > > > > www.jusbrasil.com.br
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *Leonardo Gamas*
> > > Software Engineer
> > > T +55 (71) 3494-3514
> > > C +55 (75) 8134-7440
> > > [email protected]
> > > www.jusbrasil.com.br
> > >
> >
>
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer
> T +55 (71) 3494-3514
> C +55 (75) 8134-7440
> [email protected]
> www.jusbrasil.com.br
>

Re: RegionServer dying every two or three days

Reply via email to