Re: RegionServer dying every two or three days

Leonardo Gamas Thu, 19 Jan 2012 08:50:26 -0800

Hi Guys,

I have tested the parameters provided by Sandy, and it solved the GC
problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
I'm still experiencing some difficulties, the RegionServer continues to
shutdown, but it seems related to I/O. It starts to timeout many
connections, new connections to/from the machine timeout too, and finally
the RegionServer dies because of YouAreDeadException. I will collect more
data, but i think it's an Amazon/Virtualized Environment inherent issue.


Thanks for the great help provided so far.

2012/1/5 Leonardo Gamas <[email protected]>

> I don't think so, if Amazon stopped the machine it would cause a stop of
> minutes, not seconds, and since the DataNode, TaskTracker and Zookeeper
> continue to work normally.
> But it can be related to the shared environment nature of Amazon, maybe
> some spike in I/O caused by another virtualized server in the same physical
> machine.
>
> But the intance type i'm using:
>
> *Extra Large Instance*
>
> 15 GB memory
> 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> 1,690 GB instance storage
> 64-bit platform
> I/O Performance: High
> API name: m1.xlarge
> I was not expecting to suffer from this problems, or at least not much.
>
>
> 2012/1/5 Sandy Pratt <[email protected]>
>
>> You think it's an Amazon problem maybe?  Like they paused or migrated
>> your virtual machine, and it just happens to be during GC, leaving us to
>> think the GC ran long when it didn't?  I don't have a lot of experience
>> with Amazon so I don't know if that sort of thing is common.
>>
>> > -----Original Message-----
>> > From: Leonardo Gamas [mailto:[email protected]]
>> > Sent: Thursday, January 05, 2012 13:15
>> > To: [email protected]
>> > Subject: Re: RegionServer dying every two or three days
>> >
>> > I checked the CPU Utilization graphics provided by Amazon (it's not
>> accurate,
>> > since the sample time is about 5 minutes) and don't see any
>> abnormality. I
>> > will setup TSDB with Nagios to have a more reliable source of
>> performance
>> > data.
>> >
>> > The machines don't have swap space, if i run:
>> >
>> > $ swapon -s
>> >
>> > To display swap usage summary, it returns an empty list.
>> >
>> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to tests.
>> >
>> > I don't have payed much attention to the value of the new size param.
>> >
>> > Thanks again for the help!!
>> >
>> > 2012/1/5 Sandy Pratt <[email protected]>
>> >
>> > > That size heap doesn't seem like it should cause a 36 second GC (a
>> > > minor GC even if I remember your logs correctly), so I tend to think
>> > > that other things are probably going on.
>> > >
>> > > This line here:
>> > >
>> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840
>> > > secs]
>> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05
>> > > 954388K->sys=0.01,
>> > > real=36.96 secs]
>> > >
>> > > is really mysterious to me.  It seems to indicate that the process was
>> > > blocked for almost 37 seconds during a minor collection.  Note the CPU
>> > > times are very low but the wall time is very high.  If it was actually
>> > > doing GC work, I'd expect to see user time higher than real time, as
>> > > it is in other parallel collections (see your log snippet).  Were you
>> > > really so CPU starved that it took 37 seconds to get in 50ms of work?
>> > > I can't make sense of that.  I'm trying to think of something that
>> > > would block you for that long while all your threads are stopped for
>> > > GC, other than being in swap, but I can't come up with anything.
>>  You're
>> > certain you're not in swap?
>> > >
>> > > Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts while
>> > > you troubleshoot?
>> > >
>> > > Why is your new size so small?  This generally means that relatively
>> > > more objects are being tenured than would be with a larger new size.
>> > > This could make collections of the old gen worse (GC time is said to
>> > > be proportional to the number of live objects in the generation, and
>> > > CMS does indeed cause STW pauses).  A typical new to tenured ratio
>> > > might be 1:3.  Were the new gen GCs taking too long?  This is probably
>> > > orthogonal to your immediate issue, though.
>> > >
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Leonardo Gamas [mailto:[email protected]]
>> > > Sent: Thursday, January 05, 2012 5:33 AM
>> > > To: [email protected]
>> > > Subject: Re: RegionServer dying every two or three days
>> > >
>> > >  St.Ack,
>> > >
>> > > I don't have made any attempt in GC tunning, yet.
>> > > I will read the perf section as suggested.
>> > > I'm currently using Nagios + JMX to monitor the cluster, but it's
>> > > currently used for alert only, the perfdata is not been stored, so
>> > > it's kind of useless right now, but i was thinking in use TSDB to
>> > > store it, any known case of integration?
>> > > ---
>> > >
>> > > Sandy,
>> > >
>> > > Yes, my timeout is 30 seconds:
>> > >
>> > > <property>
>> > >   <name>zookeeper.session.timeout</name>
>> > >   <value>30000</value>
>> > > </property>
>> > >
>> > > To our application it's a sufferable time to wait in case a
>> > > RegionServer go offline.
>> > >
>> > > My heap is 4GB and my JVM params are:
>> > >
>> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
>> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
>> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
>> > >
>> > > I will try the -XX:+UseParallelOldGC param and post my feedback here.
>> > > ---
>> > >
>> > > Ramkrishna,
>> > >
>> > > Seems the GC is the root of all evil in this case.
>> > > ----
>> > >
>> > > Thank you all for the answers. I will try out these valuable advices
>> > > given here and post my results.
>> > >
>> > > Leo Gamas.
>> > >
>> > > 2012/1/5 Ramkrishna S Vasudevan <[email protected]>
>> > >
>> > > > Recently we faced a similar problem and it was due to GC config.
>> > > > Pls check your GC.
>> > > >
>> > > > Regards
>> > > > Ram
>> > > >
>> > > > -----Original Message-----
>> > > > From: [email protected] [mailto:[email protected]] On Behalf Of
>> > > > Stack
>> > > > Sent: Thursday, January 05, 2012 2:50 AM
>> > > > To: [email protected]
>> > > > Subject: Re: RegionServer dying every two or three days
>> > > >
>> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
>> > > > <[email protected]> wrote:
>> > > > > The third line took 36.96 seconds to execute, can this be causing
>> > > > > this problem?
>> > > > >
>> > > >
>> > > > Probably.  Have you made any attempt at GC tuning?
>> > > >
>> > > >
>> > > > > Reading the code a little it seems that, even if it's disabled, if
>> > > > > all files are target in a compaction, it's considered a major
>> > > > > compaction. Is
>> > > > it
>> > > > > right?
>> > > > >
>> > > >
>> > > > That is right.  They get 'upgraded' from minor to major.
>> > > >
>> > > > This should be fine though.  What you are avoiding setting major
>> > > > compactions to 0 is all regions being major compacted on a period, a
>> > > > heavy weight effective rewrite of all your data (unless already
>> major
>> > > > compacted).   It looks like you have this disabled which is good
>> until
>> > > > you've wrestled your cluster into submission.
>> > > >
>> > > >
>> > > > > The machines don't have swap, so the swappiness parameter don't
>> > > > > seem to apply here. Any other suggestion?
>> > > > >
>> > > >
>> > > > See the perf section of the hbase manual.  It has our current list.
>> > > >
>> > > > Are you monitoring your cluster w/ ganglia or tsdb?
>> > > >
>> > > >
>> > > > St.Ack
>> > > >
>> > > > > Thanks.
>> > > > >
>> > > > > 2012/1/4 Leonardo Gamas <[email protected]>
>> > > > >
>> > > > >> I will investigate this, thanks for the response.
>> > > > >>
>> > > > >>
>> > > > >> 2012/1/3 Sandy Pratt <[email protected]>
>> > > > >>
>> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
>> > > > >>> timed out, have not heard from server in 61103ms for sessionid
>> > > > >>> 0x23462a4cf93a8fc, closing socket connection and attempting
>> > > > >>> reconnect
>> > > > >>>
>> > > > >>> It looks like the process has been unresponsive for some time,
>> > > > >>> so ZK
>> > > > has
>> > > > >>> terminated the session.  Did you experience a long GC pause
>> > > > >>> right
>> > > > before
>> > > > >>> this?  If you don't have GC logging enabled for the RS, you can
>> > > > sometimes
>> > > > >>> tell by noticing a gap in the timestamps of the log statements
>> > > > >>> leading
>> > > > up
>> > > > >>> to the crash.
>> > > > >>>
>> > > > >>> If it turns out to be GC, you might want to look at your kernel
>> > > > >>> swappiness setting (set it to 0) and your JVM params.
>> > > > >>>
>> > > > >>> Sandy
>> > > > >>>
>> > > > >>>
>> > > > >>> > -----Original Message-----
>> > > > >>> > From: Leonardo Gamas [mailto:[email protected]]
>> > > > >>> > Sent: Thursday, December 29, 2011 07:44
>> > > > >>> > To: [email protected]
>> > > > >>> > Subject: RegionServer dying every two or three days
>> > > > >>> >
>> > > > >>> > Hi,
>> > > > >>> >
>> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master +
>> > > > >>> > 3
>> > > > >>> Slaves),
>> > > > >>> > running on Amazon EC2. The master is a High-Memory Extra Large
>> > > > Instance
>> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper.
>> > > > >>> > The slaves are Extra Large Instances (m1.xlarge) running
>> > > > >>> > Datanode,
>> > > > >>> TaskTracker,
>> > > > >>> > RegionServer and Zookeeper.
>> > > > >>> >
>> > > > >>> > From time to time, every two or three days, one of the
>> > > > >>> > RegionServers processes goes down, but the other processes
>> > > > >>> > (DataNode, TaskTracker,
>> > > > >>> > Zookeeper) continue normally.
>> > > > >>> >
>> > > > >>> > Reading the logs:
>> > > > >>> >
>> > > > >>> > The connection with Zookeeper timed out:
>> > > > >>> >
>> > > > >>> > ---------------------------
>> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
>> > > > >>> > timed
>> > > > out,
>> > > > >>> have
>> > > > >>> > not heard from server in 61103ms for sessionid
>> > > > >>> > 0x23462a4cf93a8fc,
>> > > > >>> closing
>> > > > >>> > socket connection and attempting reconnect
>> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
>> > > > >>> > timed
>> > > > out,
>> > > > >>> have
>> > > > >>> > not heard from server in 61205ms for sessionid
>> > > > >>> > 0x346c561a55953e,
>> > > > closing
>> > > > >>> > socket connection and attempting reconnect
>> > > > >>> > ---------------------------
>> > > > >>> >
>> > > > >>> > And the Handlers start to fail:
>> > > > >>> >
>> > > > >>> > ---------------------------
>> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
>> > > > >>> > call
>> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf)
>> > > > >>> > from
>> > > > >>> > xx.xx.xx.xx:xxxx: output error
>> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81
>> > > > >>> > on
>> > > > 60020
>> > > > >>> > caught: java.nio.channels.ClosedChannelException
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
>> > > > 13
>> > > > >>> > 3)
>> > > > >>> >         at
>> > > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>> > > > >>> > 1341)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
>> > > > >>> > ns
>> > > > >>> > e(HB
>> > > > >>> > aseServer.java:727)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
>> > > > >>> > as
>> > > > >>> > eSe
>> > > > >>> > rver.java:792)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
>> > > > :1
>> > > > >>> > 083)
>> > > > >>> >
>> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
>> > > > >>> > call
>> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430)
>> > > > >>> > from
>> > > > >>> > xx.xx.xx.xx:xxxx: output error
>> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62
>> > > > >>> > on
>> > > > 60020
>> > > > >>> > caught: java.nio.channels.ClosedChannelException
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
>> > > > 13
>> > > > >>> > 3)
>> > > > >>> >         at
>> > > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>> > > > >>> > 1341)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
>> > > > >>> > ns
>> > > > >>> > e(HB
>> > > > >>> > aseServer.java:727)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
>> > > > >>> > as
>> > > > >>> > eSe
>> > > > >>> > rver.java:792)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
>> > > > :1
>> > > > >>> > 083)
>> > > > >>> > ---------------------------
>> > > > >>> >
>> > > > >>> > And finally the server throws a YouAreDeadException :( :
>> > > > >>> >
>> > > > >>> > ---------------------------
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
>> > > > connection
>> > > > >>> to
>> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>> > > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
>> > > > initiating
>> > > > >>> session
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
>> > > > >>> > reconnect to ZooKeeper service, session 0x23462a4cf93a8fc has
>> > > > >>> > expired, closing
>> > > > socket
>> > > > >>> > connection
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
>> > > > connection
>> > > > >>> to
>> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>> > > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
>> > > > initiating
>> > > > >>> session
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
>> > > > >>> > reconnect to ZooKeeper service, session 0x346c561a55953e has
>> > > > >>> > expired, closing
>> > > > socket
>> > > > >>> > connection
>> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING
>> > > > >>> > region server
>> > > > >>> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
>> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
>> > maxHeap=4083):
>> > > > >>> > Unhandled
>> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
>> > Server
>> > > > >>> > REPORT rejected; currently processing
>> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
>> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> > > > >>> > rejected; currently processing
>> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
>> > > > as
>> > > > >>> > dead server
>> > > > >>> >         at
>> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > > > >>> > Method)
>> > > > >>> >         at
>> > > > >>> >
>> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
>> > > > to
>> > > > r
>> > > > >>> > AccessorImpl.java:39)
>> > > > >>> >         at
>> > > > >>> >
>> > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
>> > > > Co
>> > > > n
>> > > > >>> > structorAccessorImpl.java:27)
>> > > > >>> >         at
>> > > > >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
>> > > > >>> > ot
>> > > > >>> > eExce
>> > > > >>> > ption.java:95)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
>> > > > >>> > mo
>> > > > >>> > te
>> > > > >>> > Exception.java:79)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
>> > > > >>> > rv
>> > > > >>> > erRep
>> > > > >>> > ort(HRegionServer.java:735)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
>> > > > .j
>> > > > >>> > ava:596)
>> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
>> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
>> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> > > > >>> > rejected; currently processing
>> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
>> > > > as
>> > > > >>> > dead server
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
>> > > > >>> > rM
>> > > > >>> > ana
>> > > > >>> > ger.java:204)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
>> > > > >>> > t(
>> > > > >>> > Serv
>> > > > >>> > erManager.java:262)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
>> > > > >>> > te
>> > > > >>> > r.jav
>> > > > >>> > a:669)
>> > > > >>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>> > > > Source)
>> > > > >>> >         at
>> > > > >>> >
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
>> > > > >>> > od
>> > > > >>> > Acces
>> > > > >>> > sorImpl.java:25)
>> > > > >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
>> > > > :1
>> > > > >>> > 039)
>> > > > >>> >
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
>> > > > >>> > av
>> > > > >>> > a:257
>> > > > >>> > )
>> > > > >>> >         at $Proxy6.regionServerReport(Unknown Source)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
>> > > > >>> > rv
>> > > > >>> > erRep
>> > > > >>> > ort(HRegionServer.java:729)
>> > > > >>> >         ... 2 more
>> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of
>> > metrics:
>> > > > >>> > requests=66, regions=206, stores=2078, storefiles=970,
>> > > > >>> > storefileIndexSize=78, memstoreSize=796,
>> > > > >>> > compactionQueueSize=0, flushQueueSize=0, usedHeap=1672,
>> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
>> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
>> > > > >>> > blockCacheHitCount=79578618, blockCacheMissCount=3036335,
>> > > > >>> > blockCacheEvictedCount=1401352, blockCacheHitRatio=96,
>> > > > >>> > blockCacheHitCachingRatio=98
>> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
>> > > > >>> > Unhandled
>> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
>> > Server
>> > > > >>> > REPORT rejected; currently processing
>> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on
>> > > > >>> > 60020
>> > > > >>> > ---------------------------
>> > > > >>> >
>> > > > >>> > Then i restart the RegionServer and everything is back to
>> normal.
>> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't
>> > > > >>> > see any abnormality in the same time window.
>> > > > >>> > I think it was caused by the lost of connection to zookeeper.
>> > > > >>> > Is it
>> > > > >>> advisable to
>> > > > >>> > run zookeeper in the same machines?
>> > > > >>> > if the RegionServer lost it's connection to Zookeeper, there's
>> > > > >>> > a way
>> > > > (a
>> > > > >>> > configuration perhaps) to re-join the cluster, and not only
>> die?
>> > > > >>> >
>> > > > >>> > Any idea what is causing this?? Or to prevent it from
>> happening?
>> > > > >>> >
>> > > > >>> > Any help is appreciated.
>> > > > >>> >
>> > > > >>> > Best Regards,
>> > > > >>> >
>> > > > >>> > --
>> > > > >>> >
>> > > > >>> > *Leonardo Gamas*
>> > > > >>> > Software Engineer
>> > > > >>> > +557134943514
>> > > > >>> > +557581347440
>> > > > >>> > [email protected]
>> > > > >>> > www.jusbrasil.com.br
>> > > > >>>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >>
>> > > > >> *Leonardo Gamas*
>> > > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
>> > > > >> 8134-7440 [email protected] www.jusbrasil.com.br
>> > > > >>
>> > > > >>
>> > > > >
>> > > > >
>> > > > > --
>> > > > >
>> > > > > *Leonardo Gamas*
>> > > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
>> > > > > 8134-7440 [email protected] www.jusbrasil.com.br
>> > > >
>> > > >
>> > >
>> > >
>> > > --
>> > >
>> > > *Leonardo Gamas*
>> > > Software Engineer
>> > > +557134943514
>> > > +557581347440
>> > > [email protected]
>> > > www.jusbrasil.com.br
>> > >
>> >
>> >
>> >
>> > --
>> >
>> > *Leonardo Gamas*
>> > Software Engineer
>> > T +55 (71) 3494-3514
>> > C +55 (75) 8134-7440
>> > [email protected]
>> > www.jusbrasil.com.br
>>
>
>
>
> --
>
> *Leonardo Gamas*
>
> Software Engineer
> T +55 (71) 3494-3514
> C +55 (75) 8134-7440
> [email protected]
>
> www.jusbrasil.com.br
>
>


-- 

*Leonardo Gamas*
Software Engineer
T +55 (71) 3494-3514
C +55 (75) 8134-7440
[email protected]
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Reply via email to