Hi, In our production, we have been seeing this strange issue where few RegionServers see the below exceptions and then their memory usage, GC time increase and eventually the
JvmPauseMonitor task kills the RegionServer because of high GC pauses. 2018-12-19 08:54:25,474 ERROR [RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020] ipc.RpcServer: Unexpected throwable object java.lang.NullPointerException at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2143) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167) 2018-12-19 08:54:25,475 WARN [RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020] ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020: caught a ClosedChannelException, this means that the server /10.130.157.159:60020 was processing a request but the client went away. The error message was: null 2018-12-19 08:54:25,662 INFO [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_1] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time 2018-12-19 08:54:25,685 ERROR [RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=60020] ipc.RpcServer: Unexpected throwable object java.lang.NegativeArraySizeException at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:76) at org.apache.hadoop.hbase.client.Put.add(Put.java:315) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580) at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734) at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167) 2018-12-19 08:55:27,482 ERROR [RpcServer.FifoWFPBQ.default.handler=14,queue=2,port=60020] ipc.RpcServer: Unexpected throwable object java.lang.NegativeArraySizeException at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:76) at org.apache.hadoop.hbase.client.Put.add(Put.java:315) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580) at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734) at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187) 2018-12-19 09:06:46,527 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 6491ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=6972ms 2018-12-19 09:07:04,207 INFO [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time 2018-12-19 09:07:04,207 INFO [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2] regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker missed its start time 2018-12-19 09:07:04,207 INFO [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2] regionserver.HRegionServer$PeriodicMemstoreFlusher: Chore: ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892-MemstoreFlusherChore missed its start time 2018-12-19 09:07:04,208 WARN [ResponseProcessor for block BP-2085229460-10.130.146.52-1479167003253:blk_1193747649_120007914] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-2085229460-10.130.146.52-1479167003253:blk_1193747649_120007914 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2468) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:849) 2018-12-19 09:07:04,208 WARN [regionserver/ip-10-130-157-159.us-west-2.compute.internal/10.130.157.159:60020.append-pool1-t1] wal.MetricsWAL: regionserver/ip-10-130-157-159.us-west-2.compute.internal/10.130.157.159:60020.append-pool1-t1 took 41693 ms appending an edit to wal; len~=439.1 K 2018-12-19 09:07:09,916 WARN [ResponseProcessor for block BP-2085229460-10.130.146.52-1479167003253:blk_1193747648_120007913] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-2085229460-10.130.146.52-1479167003253:blk_1193747648_120007913 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2468) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:849) 2018-12-19 09:07:41,667 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 17182ms GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=17667ms 2018-12-19 09:07:41,667 INFO [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_4] regionserver.HRegionServer$PeriodicMemstoreFlusher: Chore: ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892-MemstoreFlusherChore missed its start time 2018-12-19 09:07:41,667 INFO [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time 2018-12-19 09:07:49,701 INFO [ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_3] regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker missed its start time 2018-12-19 09:10:36,457 WARN [main] util.HeapMemorySizeUtil: hbase.regionserver.global.memstore.upperLimit is deprecated by hbase.regionserver.global.memstore.size 2018-12-19 09:10:38,993 INFO [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-235--1, built on 10/30/2017 02:17 GMT 2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client environment:host.name=ip-10-130-157-159.us-west-2.compute.internal 2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client environment:java.version=1.8.0_151 2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation 2018-12-15 15:02:12,875 WARN [RpcServer.FifoWFPBQ.default.handler=12,queue=0,port=60020] ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=12,queue=0,port=60020: caught a ClosedChannelException, this means that the server /10.130.146.146:60020 was processing a request but the client went away. The error message was: null 2018-12-15 15:02:13,824 ERROR [RpcServer.FifoWFPBQ.default.handler=18,queue=0,port=60020] ipc.RpcServer: Unexpected throwable object java.lang.IllegalArgumentException: offset (369668079) + length (8) exceed the capacity of the array: 1640976 at org.apache.hadoop.hbase.util.Bytes.explainWrongLengthOrOffset(Bytes.java:631) at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:605) at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:590) at org.apache.hadoop.hbase.KeyValue.keyToString(KeyValue.java:1241) at org.apache.hadoop.hbase.KeyValue.toString(KeyValue.java:1181) at org.apache.hadoop.hbase.client.Put.add(Put.java:321) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580) at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734) at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167) We initially suspected that some packet corruption must be happening over network but that doesn't seem to the case. However, we have found that few of the HBase clients which are doing Put operations (in batches) are running with OutOfMemory issues due to Direct buffer memory and were failing to write to HBase. Here are the logs from one of the HBase clients which had OutOfMemory errors. 2019-01-10 07:58:16.892 [HBase WRITE-3] INFO o.a.hadoop.hbase.client.AsyncProcess - Left over 9 task(s) are processed on server(s): [ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594, ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199, ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029, ip-10-130-158-137.us-west-2.compute.internal,60020,1545261163484, ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003, ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253, ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160] 2019-01-10 07:58:27.452 [HBase WRITE-2] INFO o.a.hadoop.hbase.client.AsyncProcess - , tableName=BLITZ_METRIC_NODE 2019-01-10 07:58:27.503 [HBase WRITE-4] INFO o.a.hadoop.hbase.client.AsyncProcess - , tableName=BLITZ_METRIC_NODE 2019-01-10 07:58:29.535 [htable-pool3231971-t2] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070503, table=BLITZ_METRIC_NODE, attempt=5/5 failed=239ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-146-248.us-west-2.compute.internal,60020,1545253418192, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 239 - final failure 2019-01-10 07:58:29.540 [htable-pool3231971-t7] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070503, table=BLITZ_METRIC_NODE, attempt=5/5 failed=2567ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2567 - final failure 2019-01-10 07:58:29.542 [htable-pool3231971-t6] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070503, table=BLITZ_METRIC_NODE, attempt=5/5 failed=2694ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2694 - final failure 2019-01-10 07:58:29.542 [htable-pool3231971-t3] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070503, table=BLITZ_METRIC_NODE, attempt=5/5 failed=2820ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2820 - final failure 2019-01-10 07:58:29.550 [htable-pool3231972-t7] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070504, table=BLITZ_METRIC_NODE, attempt=5/5 failed=1620ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1620 - final failure 2019-01-10 07:58:29.551 [htable-pool3231971-t4] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070503, table=BLITZ_METRIC_NODE, attempt=5/5 failed=907ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 907 - final failure 2019-01-10 07:58:29.551 [htable-pool3231971-t8] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070503, table=BLITZ_METRIC_NODE, attempt=5/5 failed=2657ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2657 - final failure 2019-01-10 07:58:29.561 [htable-pool3231972-t8] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070504, table=BLITZ_METRIC_NODE, attempt=5/5 failed=1637ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1637 - final failure 2019-01-10 07:58:29.562 [htable-pool3231972-t2] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070504, table=BLITZ_METRIC_NODE, attempt=5/5 failed=140ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-146-248.us-west-2.compute.internal,60020,1545253418192, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 140 - final failure 2019-01-10 07:58:29.563 [htable-pool3231972-t5] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070504, table=BLITZ_METRIC_NODE, attempt=5/5 failed=2769ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2769 - final failure 2019-01-10 07:58:29.564 [htable-pool3231972-t6] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070504, table=BLITZ_METRIC_NODE, attempt=5/5 failed=1623ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1623 - final failure 2019-01-10 07:58:29.565 [htable-pool3231972-t4] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070504, table=BLITZ_METRIC_NODE, attempt=5/5 failed=625ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 625 - final failure 2019-01-10 07:58:29.567 [htable-pool3231972-t3] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070504, table=BLITZ_METRIC_NODE, attempt=5/5 failed=1721ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1721 - final failure 2019-01-10 07:58:29.574 [htable-pool3231971-t5] WARN o.a.hadoop.hbase.client.AsyncProcess - #3070503, table=BLITZ_METRIC_NODE, attempt=5/5 failed=4782ops, last exception: java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Direct buffer memory on ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029, tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 4782 - final failure 2019-01-10 07:58:29.629 [HBase WRITE-4] ERROR c.a.b.p.c.h.w.AbstractHBaseWriter - Failed 10135 actions: IOException: 10135 times, The errors on RegionServer seems to happen right around the time when clients are seeing the OutOfMemory errors and from the timestamps, it's not super clear to determine the cause vs effect. *Question:* Is it actually possible for the HBase client to send malformed/partial RPC messages to RS if the JVM is suffering any OutOfMemory issues due to less direct memory? It would be helpful to know the root cause here so that we can remediate the issue accordingly. These