bq. there is a bulkPut available in Spark-HBase module The bulkPut is abstraction for HBaseContext.foreachPartition method.
Please take a look at hbaseBulkLoad() in HBaseRDDFunctions. Note: if you use Sparl 2.0, you would need the latest patch on HBASE-16179. Cheers On Thu, Sep 22, 2016 at 9:31 PM, Proust(Proust/Feng Guizhou) [Saipan] < pf...@coupang.com> wrote: > Thanks Ted and Vlad > > I upload the screenshots to flickr, seems they are more clear there > 1. MutateCount(Put Count) at table level(null as gap) > https://flic.kr/p/LxLa6d > <http://img0.ph.126.net/coIWlwP2w44jnl1kU5CbcA==/6631704382933867248.png> > 2. MutateCount(Put Count) at table level(null as line connected) > https://flic.kr/p/LxREvK > 3. MutateCount for the table with Max MemStoreSize > https://flic.kr/p/MnUkCt > 4. Max Region MemStoreSize at table level > https://flic.kr/p/LxR4TP > <http://img0.ph.126.net/rEQhCk-VWx3HGbcZCM51iA==/6631798940933860065.png> > 5. RegionCount for each RegionServer > https://flic.kr/p/LxL9Zw > <http://img0.ph.126.net/JKPA4-7NJEkFc3_TxOv63A==/6631679094166431135.png> > > In the graph, there was a purple line with spikes - I assume this was not > > for the table with max MemStoreSize. > > The legend in the graph is hard to read. > > If the sizes of the Puts vary, that may explain what you observed w.r.t. > > the two graphs. > > > Purple line with spikes because the table data comes from a scheduled Spark > batch job, it have a low MemStoreSize and not happen at the problem time, > so should be clear not the problem. > From https://flic.kr/p/MnUkCt (Item 3 above) this graph, the metrics not > able to catch significant MutateCount change for this table which has max > MemStoreSize, this is why I'm not confident to say the table is the root > cause although this table does have a high IO and table size(the most big > one) > > 1. Tweak CMS config: -XX:CMSInitiatingOccupancyFraction=50 > > -XX:CMSInitiatingOccupancyFractionOnly. > > Plus increase heap size accordingly to accommodate decreasing of a > working > > set size (now CMS starts when 50% of heap is occupied). If you see a lot > > of "promotion failure" messages in GC log - this may help you. > > > > There are some "promotion failed", but not that much > > 529874.782: [GC[YG occupancy: 725422 K (1380160 K)]529874.782: [Rescan > (parallel) , 0.0105120 secs]529874.793: [weak refs processing, > 0.0031030 secs]529874.796: [scrub string table, 0.0008700 secs] [1 > CMS-remark: 13341310K(15243712K)] 14066733K(16623872K), 0.0149260 > secs] [Times: user=0.18 sys=0.00, real=0.01 secs] > 529874.797: [CMS-concurrent-sweep-start] > 529874.858: [GC529874.858: [ParNew: 1380160K->153344K(1380160K), > 0.0406620 secs] 14721537K->13708366K(16623872K), 0.0408930 secs] > [Times: user=0.57 sys=0.01, real=0.04 secs] > 529875.005: [GC529875.005: [ParNew: 1380160K->153344K(1380160K), > 0.0388060 secs] 14922932K->13919282K(16623872K), 0.0390210 secs] > [Times: user=0.57 sys=0.02, real=0.04 secs] > 529875.146: [GC529875.146: [ParNew: 1380160K->153344K(1380160K), > 0.0388120 secs] 15099772K->14090963K(16623872K), 0.0390400 secs] > [Times: user=0.56 sys=0.01, real=0.04 secs] > 529875.285: [GC529875.285: [ParNew: 1380160K->153344K(1380160K), > 0.0368240 secs] 15211572K->14209325K(16623872K), 0.0370880 secs] > [Times: user=0.55 sys=0.01, real=0.04 secs] > 529875.419: [GC529875.419: [ParNew: 1380160K->153344K(1380160K), > 0.0392130 secs] 15403252K->14384973K(16623872K), 0.0395230 secs] > [Times: user=0.56 sys=0.00, real=0.04 secs] > 529875.557: [GC529875.557: [ParNew: 1380160K->153344K(1380160K), > 0.0373710 secs] 15554872K->14541229K(16623872K), 0.0375980 secs] > [Times: user=0.56 sys=0.01, real=0.04 secs] > 529875.699: [GC529875.699: [ParNew (*promotion failed*): > 1380160K->1380160K(1380160K), 0.0922840 secs] > 15726469K->15831128K(16623872K), 0.0924960 secs] [Times: user=0.53 > sys=0.00, real=0.10 secs] > GC locker: Trying a full collection because scavenge failed > 529875.791: [Full GC529875.791: [CMS529876.249: [CMS-concurrent-sweep: > 1.080/1.452 secs] [Times: user=15.83 sys=1.81, real=1.45 secs] > (concurrent mode failure): 14450968K->8061155K(15243712K), 7.4250640 > secs] 15831128K->8061155K(16623872K), [CMS Perm : > 55346K->55346K(83968K)], 7.4252340 secs] [Times: user=7.43 sys=0.00, > real=7.42 secs] > 529883.326: [GC529883.326: [ParNew: 1226816K->153344K(1380160K), > 0.0339490 secs] 9287971K->8288344K(16623872K), 0.0341880 secs] [Times: > user=0.52 sys=0.01, real=0.03 secs] > > > > > > 2. Migrate to 1.8 Java and switch to GCG1. Search "tuning GCG1 for HBase" > > on the Internet > > > > Thanks for this suggestion, I believe this is the direction to move, for > now, I need identify the problem and get deeper understanding of HBase, > then could be confident to address the problem > > > > 3. Analyze what has triggered long GC in RS. Usually it is some long > > running M/R jobs or data import. Consider other approaches which are not > so > > intrusive (ex. bulk load of data) > > > According to > http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk- > loading-and-why/, > bulk load is good at ETL, the biggest table with max MemStoreSize get data > from RabbitMQ and then Put to HBase one by one row, I'm not clear yet how > to change it to bulk load, any suggestion or I misunderstand bulk load? I > can see there is a bulkPut available in Spark-HBase module(available at 2.0 > :(), is this same thing as bulk load? I need learn more seems. > > For question #2, were there region splits corresponding to the increase ? > > Please check master log. You can pastebin master log if needed. > > > Below is the Master Log at the problem time, it say nothing except the > RegionServer is dead😂 > > 2016-09-21 11:01:56,746 WARN > org.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleaner: Found a > hfile > (hdfs://nameservice1/hbase/archive/data/prod/variable/ > 650cefeb142f5a6558adb82f89151d8d/d/bd487f2dd9424fda98c35e8230982f9c) > newer than current time (1474423316746 < 1474423316747), probably a clock > skew > 2016-09-21 11:46:19,786 ERROR > org.apache.hadoop.hbase.master.MasterRpcServices: Region server > fds-hadoop-prod06-mp,60020,1473551672107 reported a fatal error: > ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing fds-hadoop-prod06-mp,60020,1473551672107 as dead > server > at > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager. > java:382) > at > org.apache.hadoop.hbase.master.ServerManager.regionServerReport( > ServerManager.java:287) > at > org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport( > MasterRpcServices.java:287) > at > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$ > RegionServerStatusService$2.callBlockingMethod( > RegionServerStatusProtos.java:7912) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) > > Cause: > org.apache.hadoop.hbase.YouAreDeadException: > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing fds-hadoop-prod06-mp,60020,1473551672107 as dead > server > at > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager. > java:382) > at > org.apache.hadoop.hbase.master.ServerManager.regionServerReport( > ServerManager.java:287) > at > org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport( > MasterRpcServices.java:287) > at > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$ > RegionServerStatusService$2.callBlockingMethod( > RegionServerStatusProtos.java:7912) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance( > NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance( > DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.ipc.RemoteException.instantiateException( > RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException( > RemoteException.java:95) > at > org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException( > ProtobufUtil.java:328) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport( > HRegionServer.java:1092) > at > org.apache.hadoop.hbase.regionserver.HRegionServer. > run(HRegionServer.java:900) > at java.lang.Thread.run(Thread.java:745) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org. > apache.hadoop.hbase.YouAreDeadException): > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing fds-hadoop-prod06-mp,60020,1473551672107 as dead > server > at > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager. > java:382) > at > org.apache.hadoop.hbase.master.ServerManager.regionServerReport( > ServerManager.java:287) > at > org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport( > MasterRpcServices.java:287) > at > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$ > RegionServerStatusService$2.callBlockingMethod( > RegionServerStatusProtos.java:7912) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) > at java.lang.Thread.run(Thread.java:745) > > at > org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1219) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod( > AbstractRpcClient.java:216) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$ > BlockingRpcChannelImplementation.callBlockingMethod( > AbstractRpcClient.java:300) > at > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$ > RegionServerStatusService$BlockingStub.regionServerReport( > RegionServerStatusProtos.java:8289) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport( > HRegionServer.java:1090) > ... 2 more > > 2016-09-21 11:46:19,801 ERROR > org.apache.hadoop.hbase.master.MasterRpcServices: Region server > fds-hadoop-prod06-mp,60020,1473551672107 reported a fatal error: > ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107: > regionserver:60020-0x255c4a5f76332f6, > quorum=fds-hadoop-prod05-mp:2181,fds-hadoop-prod07-mp: > 2181,fds-hadoop-prod06-mp:2181, > baseZNode=/hbase regionserver:60020-0x255c4a5f76332f6 received expired > from > ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent( > ZooKeeperWatcher.java:525) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher. > process(ZooKeeperWatcher.java:436) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java: > 522) > at org.apache.zookeeper.ClientCnxn$EventThread.run( > ClientCnxn.java:498) > > 2016-09-21 11:46:19,801 ERROR > org.apache.hadoop.hbase.master.MasterRpcServices: Region server > fds-hadoop-prod06-mp,60020,1473551672107 reported a fatal error: > ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107: IOE in > log > roller > Cause: > java.io.IOException: cannot get log writer > at > org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter( > DefaultWALProvider.java:365) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog. > createWriterInstance(FSHLog.java:757) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog. > rollWriter(FSHLog.java:722) > at > org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:148) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: > /hbase/WALs/fds-hadoop-prod06-mp,60020,1473551672107 > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem. > verifyParentDir(FSNamesystem.java:2466) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal( > FSNamesystem.java:2760) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem. > startFileInt(FSNamesystem.java:2674) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem. > startFile(FSNamesystem.java:2559) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer. > create(NameNodeRpcServer.java:592) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClie > ntProtocol.create(AuthorizationProviderProxyClientProtocol.java:110) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi > deTranslatorPB.create(ClientNamenodeProtocolServerSi > deTranslatorPB.java:395) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ > ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos. > java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call( > ProtobufRpcEngine.java:617) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs( > UserGroupInformation.java:1671) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance( > NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance( > DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.ipc.RemoteException.instantiateException( > RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException( > RemoteException.java:73) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate( > DFSOutputStream.java:1872) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1737) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1697) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7. > doCall(DistributedFileSystem.java:449) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7. > doCall(DistributedFileSystem.java:445) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve( > FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive( > DistributedFileSystem.java:445) > at > org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:1121) > at > org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:1097) > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.init( > ProtobufLogWriter.java:90) > at > org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter( > DefaultWALProvider.java:361) > ... 4 more > Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): > Parent directory doesn't exist: > /hbase/WALs/fds-hadoop-prod06-mp,60020,1473551672107 > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem. > verifyParentDir(FSNamesystem.java:2466) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal( > FSNamesystem.java:2760) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem. > startFileInt(FSNamesystem.java:2674) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem. > startFile(FSNamesystem.java:2559) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer. > create(NameNodeRpcServer.java:592) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClie > ntProtocol.create(AuthorizationProviderProxyClientProtocol.java:110) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi > deTranslatorPB.create(ClientNamenodeProtocolServerSi > deTranslatorPB.java:395) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ > ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos. > java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call( > ProtobufRpcEngine.java:617) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs( > UserGroupInformation.java:1671) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) > > at org.apache.hadoop.ipc.Client.call(Client.java:1466) > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker. > invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy23.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslat > orPB.create(ClientNamenodeProtocolTranslatorPB.java:295) > at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod( > RetryInvocationHandler.java:256) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke( > RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy24.create(Unknown Source) > at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke( > HFileSystem.java:279) > at com.sun.proxy.$Proxy25.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate( > DFSOutputStream.java:1867) > ... 14 more > > Appreciate all your help > Proust > > > > On 23 September 2016 at 01:39, Ted Yu <yuzhih...@gmail.com> wrote: > > > For tuning G1GC, see the head of https://blogs.apache.org/hbase/ > > > > FYI > > > > On Thu, Sep 22, 2016 at 10:30 AM, Vladimir Rodionov < > > vladrodio...@gmail.com> > > wrote: > > > > > Your RS was declared dead because of a long GC. > > > > > > What you can do: > > > > > > 1. Tweak CMS config: -XX:CMSInitiatingOccupancyFraction=50 > > > -XX:CMSInitiatingOccupancyFractionOnly. > > > Plus increase heap size accordingly to accommodate decreasing of a > > working > > > set size (now CMS starts when 50% of heap is occupied). If you see a > lot > > > of "promotion failure" messages in GC log - this may help you. > > > > > > 2. Migrate to 1.8 Java and switch to GCG1. Search "tuning GCG1 for > HBase" > > > on the Internet > > > > > > 3. Analyze what has triggered long GC in RS. Usually it is some long > > > running M/R jobs or data import. Consider other approaches which are > not > > so > > > intrusive (ex. bulk load of data) > > > > > > > > > -Vlad > > > > > > On Thu, Sep 22, 2016 at 10:20 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > > > bq. the MutateCount metrics didn't show there is a peak time before > and > > > > after the problem on that table > > > > > > > > In the graph, there was a purple line with spikes - I assume this was > > not > > > > for the table with max MemStoreSize. > > > > The legend in the graph is hard to read. > > > > If the sizes of the Puts vary, that may explain what you observed > > w.r.t. > > > > the two graphs. > > > > > > > > For question #2, were there region splits corresponding to the > > increase ? > > > > Please check master log. You can pastebin master log if needed. > > > > > > > > For #4, if JMX doesn't respond, one option is to check region server > > log > > > - > > > > this is workaround. > > > > > > > > On Thu, Sep 22, 2016 at 9:39 AM, Proust(Proust/Feng Guizhou) [Saipan] > > < > > > > pf...@coupang.com> wrote: > > > > > > > > > Hi, HBase Users > > > > > > > > > > I encounter a RegionServer Dead case, and want to identify which > > HBase > > > > > Table actually cause the problem > > > > > > > > > > Based on HBase JMX Restful Service, what I have on hand is: > > > > > 1. Table level Grafana monitoring for Put, Get, Scan, MemStoreSize > > > > > 2. RegionServer level Grafana monitoring for GC and RegionCount > > > > > 3. RegionServer logs > > > > > > > > > > *HRegionServer java process information* > > > > > /usr/java/jdk1.7.0_67-cloudera/bin/java -Dproc_regionserver > > > > > -XX:OnOutOfMemoryError=kill -9 %p -Djava.net.preferIPv4Stack=true > > > > > -Xms34359738368 -Xmx34359738368 -XX:+UseParNewGC > > > -XX:+UseConcMarkSweepGC > > > > > -XX:CMSInitiatingOccupancyFraction=70 > -XX:+CMSParallelRemarkEnabled > > > > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps > > > > > -Xloggc:/pang/log/hbase/gc-hbase.log -XX:+UseGCLogFileRotation > > > > > -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M > > > > > -XX:OnOutOfMemoryError=/usr/ > > > > > lib64/cmf/service/common/killparent.sh -Dhbase.log.dir=/pang/log/ > > hbase > > > > > -Dhbase.log.file=hbase-cmf-hbase-REGIONSERVER-fds-hadoop- > > > > prod06-mp.log.out > > > > > -Dhbase.home.dir=/pang/cloudera/parcels/CDH-5.5.2-1. > > > > > cdh5.5.2.p0.4/lib/hbase > > > > > -Dhbase.id.str= -Dhbase.root.logger=WARN,RFA > > -Djava.library.path=/pang/ > > > > > cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hadoop/lib/ > > > > > native:/pang/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/ > > > > > hbase/lib/native/Linux-amd64-64 -Dhbase.security.logger=INFO,RFAS > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer start > > > > > > > > > > *RegionServer logs show a GC pause* > > > > > 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hdfs.DFSClient: > > > > > DFSOutputStream ResponseProcessor exception for block > > > > > BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582 > > > > > java.io.EOFException: Premature EOF: no length prefix available > > > > > at org.apache.hadoop.hdfs.protocolPB.PBHelper. > > > > > vintPrefixed(PBHelper.java:2241) > > > > > at org.apache.hadoop.hdfs.protocol.datatransfer. > > > > > PipelineAck.readFields( > > > > > PipelineAck.java:235) > > > > > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ > > > > > ResponseProcessor.run(DFSOutputStream.java:971) > > > > > 2016-09-21 11:46:19,274 WARN org.apache.hadoop.hdfs.DFSClient: > Error > > > > > Recovery for block > > > > > BP-45016343-10.20.30.13-1446197023889:blk_1141417801_67682582 > > > > > in pipeline DatanodeInfoWithStorage[10.10.9.15:50010 > > ,DS-4209cc16-e676- > > > > > 4226-be0b-1328075bbdd7,DISK], > > > > > DatanodeInfoWithStorage[10.10.9.19:50010,DS-c06678d6-3f0d- > > > > > 40a2-82e2-62387e26af7b,DISK], > > > > > DatanodeInfoWithStorage[10.10.9.21:50010,DS-7bf90b1e-7862- > > > > > 4cc1-a121-31768021a3aa,DISK]: > > > > > bad datanode DatanodeInfoWithStorage[10.10.9.15:50010 > > > ,DS-4209cc16-e676- > > > > > 4226-be0b-1328075bbdd7,DISK] > > > > > 2016-09-21 11:46:19,266 WARN org.apache.hadoop.hbase.util. > > > > JvmPauseMonitor: > > > > > > > > > > *Detected pause in JVM or host machine (eg GC): pause of > > approximately > > > > > 152322msGC pool 'ParNew' had collection(s): count=1 time=465msGC > pool > > > > > 'ConcurrentMarkSweep' had collection(s): count=2 time=152238ms* > > > > > 2016-09-21 11:46:19,263 WARN org.apache.hadoop.hbase.util.Sleeper: > > We > > > > > slept > > > > > 153423ms instead of 3000ms, this is likely due to a long garbage > > > > collecting > > > > > pause and it's usually bad, see http://hbase.apache.org/book. > > > > > html#trouble.rs.runtime.zkexpired > > > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: > Client > > > > > session timed out, have not heard from server in 156121ms for > > sessionid > > > > > 0x255c4a5f76332f6, closing socket connection and attempting > reconnect > > > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: > Client > > > > > session timed out, have not heard from server in 153414ms for > > sessionid > > > > > 0x355c4a5f3c2469f, closing socket connection and attempting > reconnect > > > > > 2016-09-21 11:46:19,263 INFO org.apache.zookeeper.ClientCnxn: > Client > > > > > session timed out, have not heard from server in 156121ms for > > sessionid > > > > > 0x255c4a5f763332c, closing socket connection and attempting > reconnect > > > > > > > > > > *RegionServer logs show server dead:* > > > > > 2016-09-21 11:46:19,485 FATAL > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > > > > ABORTING region server fds-hadoop-prod06-mp,60020,1473551672107: > > > > > *org.apache.hadoop.hbase.YouAreDeadException*: Server REPORT > > rejected; > > > > > currently processing fds-hadoop-prod06-mp,60020,1473551672107 as > > *dead > > > > > server* > > > > > at org.apache.hadoop.hbase.master.ServerManager. > > > > > checkIsDead(ServerManager.java:382) > > > > > at org.apache.hadoop.hbase.master.ServerManager. > > > regionServerReport( > > > > > ServerManager.java:287) > > > > > at org.apache.hadoop.hbase.master.MasterRpcServices. > > > > > regionServerReport( > > > > > MasterRpcServices.java:287) > > > > > at org.apache.hadoop.hbase.protobuf.generated. > > > > > RegionServerStatusProtos$ > > > > > RegionServerStatusService$2.callBlockingMethod( > > > > > RegionServerStatusProtos.java:7912) > > > > > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java: > > 2034) > > > > > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner. > > java:107) > > > > > at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop( > > > > > RpcExecutor.java:130) > > > > > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor. > > > > java:107) > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > > > > > org.apache.hadoop.hbase.YouAreDeadException: > > > > > org.apache.hadoop.hbase.YouAreDeadException: > > > > > Server REPORT rejected; currently processing > > > > > fds-hadoop-prod06-mp,60020,1473551672107 > > > > > as dead server > > > > > at org.apache.hadoop.hbase.master.ServerManager. > > > > > checkIsDead(ServerManager.java:382) > > > > > at org.apache.hadoop.hbase.master.ServerManager. > > > regionServerReport( > > > > > ServerManager.java:287) > > > > > at org.apache.hadoop.hbase.master.MasterRpcServices. > > > > > regionServerReport( > > > > > MasterRpcServices.java:287) > > > > > at org.apache.hadoop.hbase.protobuf.generated. > > > > > RegionServerStatusProtos$ > > > > > RegionServerStatusService$2.callBlockingMethod( > > > > > RegionServerStatusProtos.java:7912) > > > > > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java: > > 2034) > > > > > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner. > > java:107) > > > > > at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop( > > > > > RpcExecutor.java:130) > > > > > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor. > > > > java:107) > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > > > > > at sun.reflect.NativeConstructorAccessorImpl. > newInstance0(Native > > > > > Method) > > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance( > > > > > NativeConstructorAccessorImpl.java:57) > > > > > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance( > > > > > DelegatingConstructorAccessorImpl.java:45) > > > > > at java.lang.reflect.Constructor.newInstance(Constructor.java: > > 526) > > > > > at org.apache.hadoop.ipc.RemoteException.instantiateException( > > > > > RemoteException.java:106) > > > > > at org.apache.hadoop.ipc.RemoteException. > unwrapRemoteException( > > > > > RemoteException.java:95) > > > > > at org.apache.hadoop.hbase.protobuf.ProtobufUtil. > > > getRemoteException( > > > > > ProtobufUtil.java:328) > > > > > at org.apache.hadoop.hbase.regionserver.HRegionServer. > > > > > tryRegionServerReport(HRegionServer.java:1092) > > > > > at org.apache.hadoop.hbase.regionserver.HRegionServer. > > > > > run(HRegionServer.java:900) > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > Caused by: org.apache.hadoop.hbase.ipc. > > RemoteWithExtrasException(org. > > > > > apache.hadoop.hbase.YouAreDeadException): > > > > > org.apache.hadoop.hbase.YouAreDeadException: > > > > > Server REPORT rejected; currently processing > > > > > fds-hadoop-prod06-mp,60020,1473551672107 > > > > > as dead server > > > > > at org.apache.hadoop.hbase.master.ServerManager. > > > > > checkIsDead(ServerManager.java:382) > > > > > at org.apache.hadoop.hbase.master.ServerManager. > > > regionServerReport( > > > > > ServerManager.java:287) > > > > > at org.apache.hadoop.hbase.master.MasterRpcServices. > > > > > regionServerReport( > > > > > MasterRpcServices.java:287) > > > > > at org.apache.hadoop.hbase.protobuf.generated. > > > > > RegionServerStatusProtos$ > > > > > RegionServerStatusService$2.callBlockingMethod( > > > > > RegionServerStatusProtos.java:7912) > > > > > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java: > > 2034) > > > > > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner. > > java:107) > > > > > at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop( > > > > > RpcExecutor.java:130) > > > > > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor. > > > > java:107) > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > > > > > at org.apache.hadoop.hbase.ipc.RpcClientImpl.call( > > > > > RpcClientImpl.java:1219) > > > > > at org.apache.hadoop.hbase.ipc.AbstractRpcClient. > > > callBlockingMethod( > > > > > AbstractRpcClient.java:216) > > > > > at org.apache.hadoop.hbase.ipc.AbstractRpcClient$ > > > > > BlockingRpcChannelImplementation.callBlockingMethod( > > > > > AbstractRpcClient.java:300) > > > > > at org.apache.hadoop.hbase.protobuf.generated. > > > > > RegionServerStatusProtos$ > > > > > RegionServerStatusService$BlockingStub.regionServerReport( > > > > > RegionServerStatusProtos.java:8289) > > > > > at org.apache.hadoop.hbase.regionserver.HRegionServer. > > > > > tryRegionServerReport(HRegionServer.java:1090) > > > > > ... 2 more > > > > > 2016-09-21 11:46:19,505 FATAL > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > > > > RegionServer abort: loaded coprocessors are: [org.apache.phoenix. > > > > > coprocessor.ServerCachingEndpointImpl, > > > > > org.apache.hadoop.hbase.regionserver.LocalIndexMerger, > > > > > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter, > > > > > org.apache.phoenix. > > > > > coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix. > > > > > coprocessor.GroupedAggregateRegionObserver, > > > > > org.apache.phoenix.coprocessor.ScanRegionObserver, > > > > > org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix. > > > coprocessor. > > > > > SequenceRegionObserver] > > > > > 2016-09-21 11:46:19,498 FATAL org.apache.hadoop.hbase. > > > > > regionserver.LogRoller: > > > > > Aborting > > > > > java.io.IOException: cannot get log writer > > > > > > > > > > > > > > > *Error level RegionServer logs when the problem happened*2016-09-21 > > > > > 11:46:19,347 ERROR org.apache.hadoop.hbase. > regionserver.wal.FSHLog: > > > > Error > > > > > syncing, request close of WAL > > > > > 2016-09-21 11:46:19,348 ERROR org.apache.hadoop.hbase. > > > > > regionserver.wal.FSHLog: > > > > > Error syncing, request close of WAL > > > > > 2016-09-21 11:46:19,350 ERROR org.apache.hadoop.hbase. > > > > > regionserver.wal.FSHLog: > > > > > Error syncing, request close of WAL > > > > > 2016-09-21 11:46:19,349 ERROR org.apache.hadoop.hbase. > > > > > regionserver.wal.FSHLog: > > > > > Error syncing, request close of WAL > > > > > 2016-09-21 11:46:19,385 ERROR org.apache.hadoop.hbase. > > > > > regionserver.wal.FSHLog: > > > > > Error syncing, request close of WAL > > > > > 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase. > > > > > regionserver.wal.FSHLog: > > > > > Error syncing, request close of WAL > > > > > 2016-09-21 11:46:19,351 ERROR org.apache.hadoop.hbase. > > > > > regionserver.wal.FSHLog: > > > > > Error syncing, request close of WAL > > > > > 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase. > > > > > regionserver.wal.FSHLog: > > > > > Error syncing, request close of WAL > > > > > 2016-09-21 11:46:19,388 ERROR org.apache.hadoop.hbase. > > > > > regionserver.wal.FSHLog: > > > > > Error syncing, request close of WAL > > > > > 2016-09-21 11:46:19,384 ERROR org.apache.hadoop.hbase. > > > > > regionserver.wal.FSHLog: > > > > > Error syncing, request close of WAL > > > > > 2016-09-21 11:46:20,048 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 692752 > > > > > 2016-09-21 11:46:20,103 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 1831544 > > > > > 2016-09-21 11:46:20,111 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 44967984 > > > > > 2016-09-21 11:46:20,131 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 2334208 > > > > > 2016-09-21 11:46:20,308 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 4393560 > > > > > 2016-09-21 11:46:20,334 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 2905432 > > > > > 2016-09-21 11:46:20,373 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 2310808 > > > > > 2016-09-21 11:46:20,402 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 1112648 > > > > > 2016-09-21 11:46:20,422 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 6578048 > > > > > 2016-09-21 11:46:20,437 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 59242936 > > > > > 2016-09-21 11:46:20,608 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 4482224 > > > > > 2016-09-21 11:46:20,624 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 20144 > > > > > 2016-09-21 11:46:20,652 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 23568 > > > > > 2016-09-21 11:46:20,687 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 14395336 > > > > > 2016-09-21 11:46:20,706 ERROR org.apache.hadoop.hbase. > > > > > regionserver.HRegion: > > > > > Memstore size is 14025824 > > > > > > > > > > Attached some metrics screenshots catched when the problem happen > > > > > All the metrics have gap because the RegionServer not able to > handle > > > /jmx > > > > > requests at all > > > > > > > > > > 1. One screenshot for MutateCount(Put Count) at table level > > > > > http://img0.ph.126.net/coIWlwP2w44jnl1kU5CbcA==/ > > > 6631704382933867248.png > > > > > 2. One screenshot for RegionServer level GC Time > > > > > http://img0.ph.126.net/ouciVyCvpqEkMPV3phnx5A==/ > > > 6631661501980397103.png > > > > > 3. One screenshot for Max Region MemStoreSize at table level > > > > > http://img0.ph.126.net/rEQhCk-VWx3HGbcZCM51iA==/ > > > 6631798940933860065.png > > > > > 4. One screenshot for RegionCount for each RegionServer > > > > > http://img0.ph.126.net/JKPA4-7NJEkFc3_TxOv63A==/ > > > 6631679094166431135.png > > > > > > > > > > By the way, all the legends in the screenshots are ordered by max > > > value, > > > > so > > > > > the metric series with max value show at last > > > > > > > > > > With all those information, what I get now is: > > > > > 1. A lot of tables have sudden Put increase after the RegionServer > > back > > > > to > > > > > service, my understanding is data source streaming not able to > > process > > > > all > > > > > the data when the problem happened, so the data accumulated about > two > > > > times > > > > > at a later time, those tables seems not the root cause > > > > > 2. From the GC metrics, it is pretty easy to identify which region > > > server > > > > > have the problem and all the logs catced from the problem server > > > > > 3. All the region server have a sudden region count increase and > > > decrease > > > > > except the dead one > > > > > 4. From the Max MemStoreSize metric sceenshot, I prefer to think > the > > > > table > > > > > with max MemStoreSize is the root cause, but the MutateCount > metrics > > > > didn't > > > > > show there is a peak time before and after the problem on that > table, > > > so > > > > > not confident about this judgement > > > > > > > > > > I need help on > > > > > 1. Identify which table cause the RegionServer dead > > > > > 2. Why RegionCount have a sudden increase and decrease for all the > > > > servers > > > > > not dead > > > > > 3. Any other metrics missed but important to debug some Hotspotting > > > cases > > > > > 4. How I can get merics when the RegionServer not able to handle > JMX > > > > > Restful service > > > > > > > > > > Thanks a lot in advance and please let me know if need any other > > > > > information > > > > > > > > > > > > > > > > > > -- > > 地址: 上海市浦东新区金科路2889弄长泰广场C座7楼 > Address: 7th floor, No#3, Jinke road 2889, Pudong district, Shanghai, > China. > Mobile: +86 13621672634 >