Re: [RegionServer Dead] Identify HBase Table Cause RegionServer Dead(Version 1.0.0-cdh5.5.2)

Ted Yu Fri, 23 Sep 2016 13:00:17 -0700

bq. there is a bulkPut available in Spark-HBase module

The bulkPut is abstraction for HBaseContext.foreachPartition method.


Please take a look at hbaseBulkLoad() in HBaseRDDFunctions.

Note: if you use Sparl 2.0, you would need the latest patch on HBASE-16179.

Cheers

On Thu, Sep 22, 2016 at 9:31 PM, Proust(Proust/Feng Guizhou) [Saipan]  <
pf...@coupang.com> wrote:

> Thanks Ted and Vlad
>
> I upload the screenshots to flickr, seems they are more clear there
> 1. MutateCount(Put Count) at table level(null as gap)
> https://flic.kr/p/LxLa6d
> <http://img0.ph.126.net/coIWlwP2w44jnl1kU5CbcA==/6631704382933867248.png>
> 2. MutateCount(Put Count) at table level(null as line connected)
> https://flic.kr/p/LxREvK
> 3. MutateCount for the table with Max MemStoreSize
> https://flic.kr/p/MnUkCt
> 4. Max Region MemStoreSize at table level
> https://flic.kr/p/LxR4TP
> <http://img0.ph.126.net/rEQhCk-VWx3HGbcZCM51iA==/6631798940933860065.png>
> 5. RegionCount for each RegionServer
> https://flic.kr/p/LxL9Zw
> <http://img0.ph.126.net/JKPA4-7NJEkFc3_TxOv63A==/6631679094166431135.png>
>
> In the graph, there was a purple line with spikes - I assume this was not
> > for the table with max MemStoreSize.
> > The legend in the graph is hard to read.
> > If the sizes of the Puts vary, that may explain what you observed w.r.t.
> > the two graphs.
> >
> Purple line with spikes because the table data comes from a scheduled Spark
> batch job, it have a low MemStoreSize and not happen at the problem time,
> so should be clear not the problem.
> From https://flic.kr/p/MnUkCt (Item 3 above) this graph, the metrics not
> able to catch significant MutateCount change for this table which has max
> MemStoreSize, this is why I'm not confident to say the table is the root
> cause although this table does have a high IO and table size(the most big
> one)
>
> 1. Tweak CMS config: -XX:CMSInitiatingOccupancyFraction=50
> > -XX:CMSInitiatingOccupancyFractionOnly.
> > Plus increase heap size accordingly to accommodate decreasing of a
> working
> > set size (now CMS starts when 50% of heap is occupied).  If you see a lot
> > of "promotion failure" messages in GC log - this may help you.
> >
> > There are some "promotion failed", but not that much
>
> 529874.782: [GC[YG occupancy: 725422 K (1380160 K)]529874.782: [Rescan
> (parallel) , 0.0105120 secs]529874.793: [weak refs processing,
> 0.0031030 secs]529874.796: [scrub string table, 0.0008700 secs] [1
> CMS-remark: 13341310K(15243712K)] 14066733K(16623872K), 0.0149260
> secs] [Times: user=0.18 sys=0.00, real=0.01 secs]
> 529874.797: [CMS-concurrent-sweep-start]
> 529874.858: [GC529874.858: [ParNew: 1380160K->153344K(1380160K),
> 0.0406620 secs] 14721537K->13708366K(16623872K), 0.0408930 secs]
> [Times: user=0.57 sys=0.01, real=0.04 secs]
> 529875.005: [GC529875.005: [ParNew: 1380160K->153344K(1380160K),
> 0.0388060 secs] 14922932K->13919282K(16623872K), 0.0390210 secs]
> [Times: user=0.57 sys=0.02, real=0.04 secs]
> 529875.146: [GC529875.146: [ParNew: 1380160K->153344K(1380160K),
> 0.0388120 secs] 15099772K->14090963K(16623872K), 0.0390400 secs]
> [Times: user=0.56 sys=0.01, real=0.04 secs]
> 529875.285: [GC529875.285: [ParNew: 1380160K->153344K(1380160K),
> 0.0368240 secs] 15211572K->14209325K(16623872K), 0.0370880 secs]
> [Times: user=0.55 sys=0.01, real=0.04 secs]
> 529875.419: [GC529875.419: [ParNew: 1380160K->153344K(1380160K),
> 0.0392130 secs] 15403252K->14384973K(16623872K), 0.0395230 secs]
> [Times: user=0.56 sys=0.00, real=0.04 secs]
> 529875.557: [GC529875.557: [ParNew: 1380160K->153344K(1380160K),
> 0.0373710 secs] 15554872K->14541229K(16623872K), 0.0375980 secs]
> [Times: user=0.56 sys=0.01, real=0.04 secs]
> 529875.699: [GC529875.699: [ParNew (*promotion failed*):
> 1380160K->1380160K(1380160K), 0.0922840 secs]
> 15726469K->15831128K(16623872K), 0.0924960 secs] [Times: user=0.53
> sys=0.00, real=0.10 secs]
> GC locker: Trying a full collection because scavenge failed
> 529875.791: [Full GC529875.791: [CMS529876.249: [CMS-concurrent-sweep:
> 1.080/1.452 secs] [Times: user=15.83 sys=1.81, real=1.45 secs]
>  (concurrent mode failure): 14450968K->8061155K(15243712K), 7.4250640
> secs] 15831128K->8061155K(16623872K), [CMS Perm :
> 55346K->55346K(83968K)], 7.4252340 secs] [Times: user=7.43 sys=0.00,
> real=7.42 secs]
> 529883.326: [GC529883.326: [ParNew: 1226816K->153344K(1380160K),
> 0.0339490 secs] 9287971K->8288344K(16623872K), 0.0341880 secs] [Times:
> user=0.52 sys=0.01, real=0.03 secs]
>
>
>
>
> > 2. Migrate to 1.8 Java and switch to GCG1. Search "tuning GCG1 for HBase"
> > on the Internet
> >
> > Thanks for this suggestion, I believe this is the direction to move, for
> now, I need identify the problem and get deeper understanding of HBase,
> then could be confident to address the problem
>
>
> > 3. Analyze what has triggered long GC in RS. Usually it is some long
> > running M/R jobs or data import. Consider other approaches which are not
> so
> > intrusive (ex. bulk load of data)
> >
> According to
> http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-
> loading-and-why/,
> bulk load is good at ETL, the biggest table with max MemStoreSize get data
> from RabbitMQ and then Put to HBase one by one row, I'm not clear yet how
> to change it to bulk load, any suggestion or I misunderstand bulk load? I
> can see there is a bulkPut available in Spark-HBase module(available at 2.0
> :(), is this same thing as bulk load? I need learn more seems.
>
> For question #2, were there region splits corresponding to the increase ?
> > Please check master log. You can pastebin master log if needed.
> >
> Below is the Master Log at the problem time, it say nothing except the
> RegionServer is dead😂
>
> 2016-09-21 11:01:56,746 WARN
> org.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleaner: Found a
> hfile
> (hdfs://nameservice1/hbase/archive/data/prod/variable/
> 650cefeb142f5a6558adb82f89151d8d/d/bd487f2dd9424fda98c35e8230982f9c)
> newer than current time (1474423316746 < 1474423316747), probably a clock
> skew
> 2016-09-21 11:46:19,786 ERROR
> org.apache.hadoop.hbase.master.MasterRpcServices: Region server
> fds-hadoop-prod06-mp,60020,1473551672107 reported a fatal error:
> ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing fds-hadoop-prod06-mp,60020,1473551672107 as dead
> server
>     at
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.
> java:382)
>     at
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
> ServerManager.java:287)
>     at
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(
> MasterRpcServices.java:287)
>     at
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$
> RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>     at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>     at java.lang.Thread.run(Thread.java:745)
>
> Cause:
> org.apache.hadoop.hbase.YouAreDeadException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing fds-hadoop-prod06-mp,60020,1473551672107 as dead
> server
>     at
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.
> java:382)
>     at
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
> ServerManager.java:287)
>     at
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(
> MasterRpcServices.java:287)
>     at
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$
> RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>     at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>     at java.lang.Thread.run(Thread.java:745)
>
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>     at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>     at
> org.apache.hadoop.ipc.RemoteException.instantiateException(
> RemoteException.java:106)
>     at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> RemoteException.java:95)
>     at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(
> ProtobufUtil.java:328)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(
> HRegionServer.java:1092)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.
> run(HRegionServer.java:900)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.
> apache.hadoop.hbase.YouAreDeadException):
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing fds-hadoop-prod06-mp,60020,1473551672107 as dead
> server
>     at
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.
> java:382)
>     at
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
> ServerManager.java:287)
>     at
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(
> MasterRpcServices.java:287)
>     at
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$
> RegionServerStatusService$2.callBlockingMethod(
> RegionServerStatusProtos.java:7912)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>     at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>     at java.lang.Thread.run(Thread.java:745)
>
>     at
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1219)
>     at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(
> AbstractRpcClient.java:216)
>     at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$
> BlockingRpcChannelImplementation.callBlockingMethod(
> AbstractRpcClient.java:300)
>     at
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$
> RegionServerStatusService$BlockingStub.regionServerReport(
> RegionServerStatusProtos.java:8289)
>     at
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(
> HRegionServer.java:1090)
>     ... 2 more
>
> 2016-09-21 11:46:19,801 ERROR
> org.apache.hadoop.hbase.master.MasterRpcServices: Region server
> fds-hadoop-prod06-mp,60020,1473551672107 reported a fatal error:
> ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107:
> regionserver:60020-0x255c4a5f76332f6,
> quorum=fds-hadoop-prod05-mp:2181,fds-hadoop-prod07-mp:
> 2181,fds-hadoop-prod06-mp:2181,
> baseZNode=/hbase regionserver:60020-0x255c4a5f76332f6 received expired
> from
> ZooKeeper, aborting
> Cause:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>     at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(
> ZooKeeperWatcher.java:525)
>     at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.
> process(ZooKeeperWatcher.java:436)
>     at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:
> 522)
>     at org.apache.zookeeper.ClientCnxn$EventThread.run(
> ClientCnxn.java:498)
>
> 2016-09-21 11:46:19,801 ERROR
> org.apache.hadoop.hbase.master.MasterRpcServices: Region server
> fds-hadoop-prod06-mp,60020,1473551672107 reported a fatal error:
> ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107: IOE in
> log
> roller
> Cause:
> java.io.IOException: cannot get log writer
>     at
> org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter(
> DefaultWALProvider.java:365)
>     at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> createWriterInstance(FSHLog.java:757)
>     at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.
> rollWriter(FSHLog.java:722)
>     at
> org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:148)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: Parent directory doesn't exist:
> /hbase/WALs/fds-hadoop-prod06-mp,60020,1473551672107
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> verifyParentDir(FSNamesystem.java:2466)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(
> FSNamesystem.java:2760)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> startFileInt(FSNamesystem.java:2674)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> startFile(FSNamesystem.java:2559)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.
> create(NameNodeRpcServer.java:592)
>     at
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClie
> ntProtocol.create(AuthorizationProviderProxyClientProtocol.java:110)
>     at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi
> deTranslatorPB.create(ClientNamenodeProtocolServerSi
> deTranslatorPB.java:395)
>     at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$
> ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.
> java)
>     at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(
> ProtobufRpcEngine.java:617)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1671)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>     at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>     at
> org.apache.hadoop.ipc.RemoteException.instantiateException(
> RemoteException.java:106)
>     at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> RemoteException.java:73)
>     at
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(
> DFSOutputStream.java:1872)
>     at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1737)
>     at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1697)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem$7.
> doCall(DistributedFileSystem.java:449)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem$7.
> doCall(DistributedFileSystem.java:445)
>     at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
> FileSystemLinkResolver.java:81)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(
> DistributedFileSystem.java:445)
>     at
> org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:1121)
>     at
> org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:1097)
>     at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.init(
> ProtobufLogWriter.java:90)
>     at
> org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter(
> DefaultWALProvider.java:361)
>     ... 4 more
> Caused by:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
> Parent directory doesn't exist:
> /hbase/WALs/fds-hadoop-prod06-mp,60020,1473551672107
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> verifyParentDir(FSNamesystem.java:2466)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(
> FSNamesystem.java:2760)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> startFileInt(FSNamesystem.java:2674)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> startFile(FSNamesystem.java:2559)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.
> create(NameNodeRpcServer.java:592)
>     at
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClie
> ntProtocol.create(AuthorizationProviderProxyClientProtocol.java:110)
>     at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi
> deTranslatorPB.create(ClientNamenodeProtocolServerSi
> deTranslatorPB.java:395)
>     at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$
> ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.
> java)
>     at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(
> ProtobufRpcEngine.java:617)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1671)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>
>     at org.apache.hadoop.ipc.Client.call(Client.java:1466)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1403)
>     at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.
> invoke(ProtobufRpcEngine.java:230)
>     at com.sun.proxy.$Proxy23.create(Unknown Source)
>     at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslat
> orPB.create(ClientNamenodeProtocolTranslatorPB.java:295)
>     at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(
> RetryInvocationHandler.java:256)
>     at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(
> RetryInvocationHandler.java:104)
>     at com.sun.proxy.$Proxy24.create(Unknown Source)
>     at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(
> HFileSystem.java:279)
>     at com.sun.proxy.$Proxy25.create(Unknown Source)
>     at
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(
> DFSOutputStream.java:1867)
>     ... 14 more
>
> Appreciate all your help
> Proust
>
>
>
> On 23 September 2016 at 01:39, Ted Yu <yuzhih...@gmail.com> wrote:
>
> > For tuning G1GC, see the head of https://blogs.apache.org/hbase/
> >
> > FYI
> >
> > On Thu, Sep 22, 2016 at 10:30 AM, Vladimir Rodionov <
> > vladrodio...@gmail.com>
> > wrote:
> >
> > > Your RS was declared dead because of a long GC.
> > >
> > > What you can do:
> > >
> > > 1. Tweak CMS config: -XX:CMSInitiatingOccupancyFraction=50
> > > -XX:CMSInitiatingOccupancyFractionOnly.
> > > Plus increase heap size accordingly to accommodate decreasing of a
> > working
> > > set size (now CMS starts when 50% of heap is occupied).  If you see a
> lot
> > > of "promotion failure" messages in GC log - this may help you.
> > >
> > > 2. Migrate to 1.8 Java and switch to GCG1. Search "tuning GCG1 for
> HBase"
> > > on the Internet
> > >
> > > 3. Analyze what has triggered long GC in RS. Usually it is some long
> > > running M/R jobs or data import. Consider other approaches which are
> not
> > so
> > > intrusive (ex. bulk load of data)
> > >
> > >
> > > -Vlad
> > >
> > > On Thu, Sep 22, 2016 at 10:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> > >
> > > > bq. the MutateCount metrics didn't show there is a peak time before
> and
> > > > after the problem on that table
> > > >
> > > > In the graph, there was a purple line with spikes - I assume this was
> > not
> > > > for the table with max MemStoreSize.
> > > > The legend in the graph is hard to read.
> > > > If the sizes of the Puts vary, that may explain what you observed
> > w.r.t.
> > > > the two graphs.
> > > >
> > > > For question #2, were there region splits corresponding to the
> > increase ?
> > > > Please check master log. You can pastebin master log if needed.
> > > >
> > > > For #4, if JMX doesn't respond, one option is to check region server
> > log
> > > -
> > > > this is workaround.
> > > >
> > > > On Thu, Sep 22, 2016 at 9:39 AM, Proust(Proust/Feng Guizhou) [Saipan]
> >  <
> > > > pf...@coupang.com> wrote:
> > > >
> > > > > Hi, HBase Users
> > > > >
> > > > > I encounter a RegionServer Dead case, and want to identify which
> > HBase
> > > > > Table actually cause the problem
> > > > >
> > > > > Based on HBase JMX Restful Service, what I have on hand is:
> > > > > 1. Table level Grafana monitoring for Put, Get, Scan, MemStoreSize
> > > > > 2. RegionServer level Grafana monitoring for GC and RegionCount
> > > > > 3. RegionServer logs
> > > > >
> > > > > *HRegionServer java process information*
> > > > > /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver
> > > > > -XX:OnOutOfMemoryError=kill -9 %p -Djava.net.preferIPv4Stack=true
> > > > > -Xms34359738368 -Xmx34359738368 -XX:+UseParNewGC
> > > -XX:+UseConcMarkSweepGC
> > > > > -XX:CMSInitiatingOccupancyFraction=70
> -XX:+CMSParallelRemarkEnabled
> > > > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> > > > > -Xloggc:/pang/log/hbase/gc-hbase.log -XX:+UseGCLogFileRotation
> > > > > -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M
> > > > > -XX:OnOutOfMemoryError=/usr/
> > > > > lib64/cmf/service/common/killparent.sh -Dhbase.log.dir=/pang/log/
> > hbase
> > > > > -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-fds-hadoop-
> > > > prod06-mp.log.out
> > > > > -Dhbase.home.dir=/pang/cloudera/parcels/CDH-5.5.2-1.
> > > > > cdh5.5.2.p0.4/lib/hbase
> > > > > -Dhbase.id.str= -Dhbase.root.logger=WARN,RFA
> > -Djava.library.path=/pang/
> > > > > cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hadoop/lib/
> > > > > native:/pang/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/
> > > > > hbase/lib/native/Linux-amd64-64 -Dhbase.security.logger=INFO,RFAS
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer start
> > > > >
> > > > > *RegionServer logs show a GC pause*
> > > > > 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hdfs.DFSClient:
> > > > > DFSOutputStream ResponseProcessor exception  for block
> > > > > BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582
> > > > > java.io.EOFException: Premature EOF: no length prefix available
> > > > >     at org.apache.hadoop.hdfs.protocolPB.PBHelper.
> > > > > vintPrefixed(PBHelper.java:2241)
> > > > >     at org.apache.hadoop.hdfs.protocol.datatransfer.
> > > > > PipelineAck.readFields(
> > > > > PipelineAck.java:235)
> > > > >     at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
> > > > > ResponseProcessor.run(DFSOutputStream.java:971)
> > > > > 2016-09-21 11:46:19,274 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> > > > > Recovery for block
> > > > > BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582
> > > > > in pipeline DatanodeInfoWithStorage[10.10.9.15:50010
> > ,DS-4209cc16-e676-
> > > > > 4226-be0b-1328075bbdd7,DISK],
> > > > > DatanodeInfoWithStorage[10.10.9.19:50010,DS-c06678d6-3f0d-
> > > > > 40a2-82e2-62387e26af7b,DISK],
> > > > > DatanodeInfoWithStorage[10.10.9.21:50010,DS-7bf90b1e-7862-
> > > > > 4cc1-a121-31768021a3aa,DISK]:
> > > > > bad datanode DatanodeInfoWithStorage[10.10.9.15:50010
> > > ,DS-4209cc16-e676-
> > > > > 4226-be0b-1328075bbdd7,DISK]
> > > > > 2016-09-21 11:46:19,266 WARN org.apache.hadoop.hbase.util.
> > > > JvmPauseMonitor:
> > > > >
> > > > > *Detected pause in JVM or host machine (eg GC): pause of
> > approximately
> > > > > 152322msGC pool 'ParNew' had collection(s): count=1 time=465msGC
> pool
> > > > > 'ConcurrentMarkSweep' had collection(s): count=2 time=152238ms*
> > > > > 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > slept
> > > > > 153423ms instead of 3000ms, this is likely due to a long garbage
> > > > collecting
> > > > > pause and it's usually bad, see http://hbase.apache.org/book.
> > > > > html#trouble.rs.runtime.zkexpired
> > > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn:
> Client
> > > > > session timed out, have not heard from server in 156121ms for
> > sessionid
> > > > > 0x255c4a5f76332f6, closing socket connection and attempting
> reconnect
> > > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn:
> Client
> > > > > session timed out, have not heard from server in 153414ms for
> > sessionid
> > > > > 0x355c4a5f3c2469f, closing socket connection and attempting
> reconnect
> > > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn:
> Client
> > > > > session timed out, have not heard from server in 156121ms for
> > sessionid
> > > > > 0x255c4a5f763332c, closing socket connection and attempting
> reconnect
> > > > >
> > > > > *RegionServer logs show server dead:*
> > > > > 2016-09-21 11:46:19,485 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > > ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107:
> > > > > *org.apache.hadoop.hbase.YouAreDeadException*: Server REPORT
> > rejected;
> > > > > currently processing fds-hadoop-prod06-mp,60020,1473551672107 as
> > *dead
> > > > > server*
> > > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > > > > checkIsDead(ServerManager.java:382)
> > > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > > regionServerReport(
> > > > > ServerManager.java:287)
> > > > >     at org.apache.hadoop.hbase.master.MasterRpcServices.
> > > > > regionServerReport(
> > > > > MasterRpcServices.java:287)
> > > > >     at org.apache.hadoop.hbase.protobuf.generated.
> > > > > RegionServerStatusProtos$
> > > > > RegionServerStatusService$2.callBlockingMethod(
> > > > > RegionServerStatusProtos.java:7912)
> > > > >     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:
> > 2034)
> > > > >     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.
> > java:107)
> > > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> > > > > RpcExecutor.java:130)
> > > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.
> > > > java:107)
> > > > >     at java.lang.Thread.run(Thread.java:745)
> > > > >
> > > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > Server REPORT rejected; currently processing
> > > > > fds-hadoop-prod06-mp,60020,1473551672107
> > > > > as dead server
> > > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > > > > checkIsDead(ServerManager.java:382)
> > > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > > regionServerReport(
> > > > > ServerManager.java:287)
> > > > >     at org.apache.hadoop.hbase.master.MasterRpcServices.
> > > > > regionServerReport(
> > > > > MasterRpcServices.java:287)
> > > > >     at org.apache.hadoop.hbase.protobuf.generated.
> > > > > RegionServerStatusProtos$
> > > > > RegionServerStatusService$2.callBlockingMethod(
> > > > > RegionServerStatusProtos.java:7912)
> > > > >     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:
> > 2034)
> > > > >     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.
> > java:107)
> > > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> > > > > RpcExecutor.java:130)
> > > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.
> > > > java:107)
> > > > >     at java.lang.Thread.run(Thread.java:745)
> > > > >
> > > > >     at sun.reflect.NativeConstructorAccessorImpl.
> newInstance0(Native
> > > > > Method)
> > > > >     at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> > > > > NativeConstructorAccessorImpl.java:57)
> > > > >     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> > > > > DelegatingConstructorAccessorImpl.java:45)
> > > > >     at java.lang.reflect.Constructor.newInstance(Constructor.java:
> > 526)
> > > > >     at org.apache.hadoop.ipc.RemoteException.instantiateException(
> > > > > RemoteException.java:106)
> > > > >     at org.apache.hadoop.ipc.RemoteException.
> unwrapRemoteException(
> > > > > RemoteException.java:95)
> > > > >     at org.apache.hadoop.hbase.protobuf.ProtobufUtil.
> > > getRemoteException(
> > > > > ProtobufUtil.java:328)
> > > > >     at org.apache.hadoop.hbase.regionserver.HRegionServer.
> > > > > tryRegionServerReport(HRegionServer.java:1092)
> > > > >     at org.apache.hadoop.hbase.regionserver.HRegionServer.
> > > > > run(HRegionServer.java:900)
> > > > >     at java.lang.Thread.run(Thread.java:745)
> > > > > Caused by: org.apache.hadoop.hbase.ipc.
> > RemoteWithExtrasException(org.
> > > > > apache.hadoop.hbase.YouAreDeadException):
> > > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > Server REPORT rejected; currently processing
> > > > > fds-hadoop-prod06-mp,60020,1473551672107
> > > > > as dead server
> > > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > > > > checkIsDead(ServerManager.java:382)
> > > > >     at org.apache.hadoop.hbase.master.ServerManager.
> > > regionServerReport(
> > > > > ServerManager.java:287)
> > > > >     at org.apache.hadoop.hbase.master.MasterRpcServices.
> > > > > regionServerReport(
> > > > > MasterRpcServices.java:287)
> > > > >     at org.apache.hadoop.hbase.protobuf.generated.
> > > > > RegionServerStatusProtos$
> > > > > RegionServerStatusService$2.callBlockingMethod(
> > > > > RegionServerStatusProtos.java:7912)
> > > > >     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:
> > 2034)
> > > > >     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.
> > java:107)
> > > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> > > > > RpcExecutor.java:130)
> > > > >     at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.
> > > > java:107)
> > > > >     at java.lang.Thread.run(Thread.java:745)
> > > > >
> > > > >     at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(
> > > > > RpcClientImpl.java:1219)
> > > > >     at org.apache.hadoop.hbase.ipc.AbstractRpcClient.
> > > callBlockingMethod(
> > > > > AbstractRpcClient.java:216)
> > > > >     at org.apache.hadoop.hbase.ipc.AbstractRpcClient$
> > > > > BlockingRpcChannelImplementation.callBlockingMethod(
> > > > > AbstractRpcClient.java:300)
> > > > >     at org.apache.hadoop.hbase.protobuf.generated.
> > > > > RegionServerStatusProtos$
> > > > > RegionServerStatusService$BlockingStub.regionServerReport(
> > > > > RegionServerStatusProtos.java:8289)
> > > > >     at org.apache.hadoop.hbase.regionserver.HRegionServer.
> > > > > tryRegionServerReport(HRegionServer.java:1090)
> > > > >     ... 2 more
> > > > > 2016-09-21 11:46:19,505 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > > RegionServer abort: loaded coprocessors are: [org.apache.phoenix.
> > > > > coprocessor.ServerCachingEndpointImpl,
> > > > > org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > > > > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > > > > org.apache.phoenix.
> > > > > coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.
> > > > > coprocessor.GroupedAggregateRegionObserver,
> > > > > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > > > > org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.
> > > coprocessor.
> > > > > SequenceRegionObserver]
> > > > > 2016-09-21 11:46:19,498 FATAL org.apache.hadoop.hbase.
> > > > > regionserver.LogRoller:
> > > > > Aborting
> > > > > java.io.IOException: cannot get log writer
> > > > >
> > > > >
> > > > > *Error level RegionServer logs when the problem happened*2016-09-21
> > > > > 11:46:19,347 ERROR org.apache.hadoop.hbase.
> regionserver.wal.FSHLog:
> > > > Error
> > > > > syncing, request close of WAL
> > > > > 2016-09-21 11:46:19,348 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.wal.FSHLog:
> > > > > Error syncing, request close of WAL
> > > > > 2016-09-21 11:46:19,350 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.wal.FSHLog:
> > > > > Error syncing, request close of WAL
> > > > > 2016-09-21 11:46:19,349 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.wal.FSHLog:
> > > > > Error syncing, request close of WAL
> > > > > 2016-09-21 11:46:19,385 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.wal.FSHLog:
> > > > > Error syncing, request close of WAL
> > > > > 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.wal.FSHLog:
> > > > > Error syncing, request close of WAL
> > > > > 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.wal.FSHLog:
> > > > > Error syncing, request close of WAL
> > > > > 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.wal.FSHLog:
> > > > > Error syncing, request close of WAL
> > > > > 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.wal.FSHLog:
> > > > > Error syncing, request close of WAL
> > > > > 2016-09-21 11:46:19,384 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.wal.FSHLog:
> > > > > Error syncing, request close of WAL
> > > > > 2016-09-21 11:46:20,048 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 692752
> > > > > 2016-09-21 11:46:20,103 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 1831544
> > > > > 2016-09-21 11:46:20,111 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 44967984
> > > > > 2016-09-21 11:46:20,131 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 2334208
> > > > > 2016-09-21 11:46:20,308 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 4393560
> > > > > 2016-09-21 11:46:20,334 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 2905432
> > > > > 2016-09-21 11:46:20,373 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 2310808
> > > > > 2016-09-21 11:46:20,402 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 1112648
> > > > > 2016-09-21 11:46:20,422 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 6578048
> > > > > 2016-09-21 11:46:20,437 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 59242936
> > > > > 2016-09-21 11:46:20,608 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 4482224
> > > > > 2016-09-21 11:46:20,624 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 20144
> > > > > 2016-09-21 11:46:20,652 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 23568
> > > > > 2016-09-21 11:46:20,687 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 14395336
> > > > > 2016-09-21 11:46:20,706 ERROR org.apache.hadoop.hbase.
> > > > > regionserver.HRegion:
> > > > > Memstore size is 14025824
> > > > >
> > > > > Attached some metrics screenshots catched when the problem happen
> > > > > All the metrics have gap because the RegionServer not able to
> handle
> > > /jmx
> > > > > requests at all
> > > > >
> > > > > 1. One screenshot for MutateCount(Put Count) at table level
> > > > > http://img0.ph.126.net/coIWlwP2w44jnl1kU5CbcA==/
> > > 6631704382933867248.png
> > > > > 2. One screenshot for RegionServer level GC Time
> > > > > http://img0.ph.126.net/ouciVyCvpqEkMPV3phnx5A==/
> > > 6631661501980397103.png
> > > > > 3. One screenshot for Max Region MemStoreSize at table level
> > > > > http://img0.ph.126.net/rEQhCk-VWx3HGbcZCM51iA==/
> > > 6631798940933860065.png
> > > > > 4. One screenshot for RegionCount for each RegionServer
> > > > > http://img0.ph.126.net/JKPA4-7NJEkFc3_TxOv63A==/
> > > 6631679094166431135.png
> > > > >
> > > > > By the way, all the legends in the screenshots are ordered by max
> > > value,
> > > > so
> > > > > the metric series with max value show at last
> > > > >
> > > > > With all those information, what I get now is:
> > > > > 1. A lot of tables have sudden Put increase after the RegionServer
> > back
> > > > to
> > > > > service, my understanding is data source streaming not able to
> > process
> > > > all
> > > > > the data when the problem happened, so the data accumulated about
> two
> > > > times
> > > > > at a later time, those tables seems not the root cause
> > > > > 2. From the GC metrics, it is pretty easy to identify which region
> > > server
> > > > > have the problem and all the logs catced from the problem server
> > > > > 3. All the region server have a sudden region count increase and
> > > decrease
> > > > > except the dead one
> > > > > 4. From the Max MemStoreSize metric sceenshot, I prefer to think
> the
> > > > table
> > > > > with max MemStoreSize is the root cause, but the MutateCount
> metrics
> > > > didn't
> > > > > show there is a peak time before and after the problem on that
> table,
> > > so
> > > > > not confident about this judgement
> > > > >
> > > > > I need help on
> > > > > 1. Identify which table cause the RegionServer dead
> > > > > 2. Why RegionCount have a sudden increase and decrease for all the
> > > > servers
> > > > > not dead
> > > > > 3. Any other metrics missed but important to debug some Hotspotting
> > > cases
> > > > > 4. How I can get merics when the RegionServer not able to handle
> JMX
> > > > > Restful service
> > > > >
> > > > > Thanks a lot in advance and please let me know if need any other
> > > > > information
> > > > >
> > > >
> > >
> >
>
>
>
> --
>
> 地址: 上海市浦东新区金科路2889弄长泰广场C座7楼
> Address: 7th floor, No#3, Jinke road 2889, Pudong district, Shanghai,
> China.
> Mobile: +86 13621672634
>

Re: [RegionServer Dead] Identify HBase Table Cause RegionServer Dead(Version 1.0.0-cdh5.5.2)

Reply via email to