RegionServer is dying with memory issues after it receives malformed packets

Buchi Reddy Busi Reddy Fri, 11 Jan 2019 13:05:25 -0800

Hi,

In our production, we have been seeing this strange issue where few
RegionServers see the below exceptions and then their memory usage, GC time
increase and eventually the


JvmPauseMonitor task kills the RegionServer because of high GC pauses.

2018-12-19 08:54:25,474 ERROR
[RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020]
ipc.RpcServer: Unexpected throwable object
java.lang.NullPointerException
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2143)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)
2018-12-19 08:54:25,475 WARN
[RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020]
ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=37,queue=1,port=60020:
caught a ClosedChannelException, this means that the server
/10.130.157.159:60020 was processing a request but the client went
away. The error message was: null
2018-12-19 08:54:25,662 INFO
[ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_1]
regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore
missed its start time
2018-12-19 08:54:25,685 ERROR
[RpcServer.FifoWFPBQ.default.handler=20,queue=2,port=60020]
ipc.RpcServer: Unexpected throwable object
java.lang.NegativeArraySizeException
 at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:76)
 at org.apache.hadoop.hbase.client.Put.add(Put.java:315)
 at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580)
 at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734)
 at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710)
 at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146)
 at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)

2018-12-19 08:55:27,482 ERROR
[RpcServer.FifoWFPBQ.default.handler=14,queue=2,port=60020]
ipc.RpcServer: Unexpected throwable object
java.lang.NegativeArraySizeException
 at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:76)
 at org.apache.hadoop.hbase.client.Put.add(Put.java:315)
 at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580)
 at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734)
 at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710)
 at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146)
 at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)

2018-12-19 09:06:46,527 INFO [JvmPauseMonitor] util.JvmPauseMonitor:
Detected pause in JVM or host machine (eg GC): pause of approximately
6491ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=6972ms
2018-12-19 09:07:04,207 INFO
[ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2]
regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore
missed its start time
2018-12-19 09:07:04,207 INFO
[ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2]
regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker
missed its start time
2018-12-19 09:07:04,207 INFO
[ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2]
regionserver.HRegionServer$PeriodicMemstoreFlusher: Chore:
ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892-MemstoreFlusherChore
missed its start time
2018-12-19 09:07:04,208 WARN [ResponseProcessor for block
BP-2085229460-10.130.146.52-1479167003253:blk_1193747649_120007914]
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
BP-2085229460-10.130.146.52-1479167003253:blk_1193747649_120007914
java.io.EOFException: Premature EOF: no length prefix available
 at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2468)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:849)
2018-12-19 09:07:04,208 WARN
[regionserver/ip-10-130-157-159.us-west-2.compute.internal/10.130.157.159:60020.append-pool1-t1]
wal.MetricsWAL:
regionserver/ip-10-130-157-159.us-west-2.compute.internal/10.130.157.159:60020.append-pool1-t1
took 41693 ms appending an edit to wal; len~=439.1 K
2018-12-19 09:07:09,916 WARN [ResponseProcessor for block
BP-2085229460-10.130.146.52-1479167003253:blk_1193747648_120007913]
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
BP-2085229460-10.130.146.52-1479167003253:blk_1193747648_120007913
java.io.EOFException: Premature EOF: no length prefix available
 at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2468)
 at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:849)
2018-12-19 09:07:41,667 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
Detected pause in JVM or host machine (eg GC): pause of approximately
17182ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=17667ms
2018-12-19 09:07:41,667 INFO
[ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_4]
regionserver.HRegionServer$PeriodicMemstoreFlusher: Chore:
ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892-MemstoreFlusherChore
missed its start time
2018-12-19 09:07:41,667 INFO
[ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_2]
regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore
missed its start time
2018-12-19 09:07:49,701 INFO
[ip-10-130-157-159.us-west-2.compute.internal,60020,1537240977892_ChoreService_3]
regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker
missed its start time
2018-12-19 09:10:36,457 WARN [main] util.HeapMemorySizeUtil:
hbase.regionserver.global.memstore.upperLimit is deprecated by
hbase.regionserver.global.memstore.size
2018-12-19 09:10:38,993 INFO [main] zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.4.6-235--1, built on 10/30/2017 02:17
GMT
2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client
environment:host.name=ip-10-130-157-159.us-west-2.compute.internal
2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client
environment:java.version=1.8.0_151
2018-12-19 09:10:38,994 INFO [main] zookeeper.ZooKeeper: Client
environment:java.vendor=Oracle Corporation

2018-12-15 15:02:12,875 WARN
[RpcServer.FifoWFPBQ.default.handler=12,queue=0,port=60020]
ipc.RpcServer: RpcServer.FifoWFPBQ.default.handler=12,queue=0,port=60020:
caught a ClosedChannelException, this means that the server
/10.130.146.146:60020 was processing a request but the client went
away. The error message was: null
2018-12-15 15:02:13,824 ERROR
[RpcServer.FifoWFPBQ.default.handler=18,queue=0,port=60020]
ipc.RpcServer: Unexpected throwable object
java.lang.IllegalArgumentException: offset (369668079) + length (8)
exceed the capacity of the array: 1640976
        at 
org.apache.hadoop.hbase.util.Bytes.explainWrongLengthOrOffset(Bytes.java:631)
        at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:605)
        at org.apache.hadoop.hbase.util.Bytes.toLong(Bytes.java:590)
        at org.apache.hadoop.hbase.KeyValue.keyToString(KeyValue.java:1241)
        at org.apache.hadoop.hbase.KeyValue.toString(KeyValue.java:1181)
        at org.apache.hadoop.hbase.client.Put.add(Put.java:321)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:580)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:734)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:710)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2146)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32393)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2150)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:187)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:167)



We initially suspected that some packet corruption must be happening
over network but that doesn't seem to the case. However, we have found
that few of the HBase clients which are doing Put operations (in
batches) are running with OutOfMemory issues due to Direct buffer
memory and were failing to write to HBase.

Here are the logs from one of the HBase clients which had OutOfMemory errors.

2019-01-10 07:58:16.892 [HBase WRITE-3] INFO
o.a.hadoop.hbase.client.AsyncProcess - Left over 9 task(s) are
processed on server(s):
[ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594,
ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199,
ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029,
ip-10-130-158-137.us-west-2.compute.internal,60020,1545261163484,
ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003,
ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253,
ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160]
2019-01-10 07:58:27.452 [HBase WRITE-2] INFO
o.a.hadoop.hbase.client.AsyncProcess - , tableName=BLITZ_METRIC_NODE
2019-01-10 07:58:27.503 [HBase WRITE-4] INFO
o.a.hadoop.hbase.client.AsyncProcess - , tableName=BLITZ_METRIC_NODE
2019-01-10 07:58:29.535 [htable-pool3231971-t2] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070503,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=239ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-146-248.us-west-2.compute.internal,60020,1545253418192,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 239 -
final failure
2019-01-10 07:58:29.540 [htable-pool3231971-t7] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070503,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=2567ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2567 -
final failure
2019-01-10 07:58:29.542 [htable-pool3231971-t6] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070503,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=2694ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2694 -
final failure
2019-01-10 07:58:29.542 [htable-pool3231971-t3] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070503,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=2820ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2820 -
final failure
2019-01-10 07:58:29.550 [htable-pool3231972-t7] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070504,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=1620ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-158-232.us-west-2.compute.internal,60020,1545256203253,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1620 -
final failure
2019-01-10 07:58:29.551 [htable-pool3231971-t4] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070503,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=907ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 907 -
final failure
2019-01-10 07:58:29.551 [htable-pool3231971-t8] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070503,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=2657ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2657 -
final failure
2019-01-10 07:58:29.561 [htable-pool3231972-t8] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070504,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=1637ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-157-244.us-west-2.compute.internal,60020,1545273682160,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1637 -
final failure
2019-01-10 07:58:29.562 [htable-pool3231972-t2] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070504,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=140ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-146-248.us-west-2.compute.internal,60020,1545253418192,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 140 -
final failure
2019-01-10 07:58:29.563 [htable-pool3231972-t5] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070504,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=2769ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 2769 -
final failure
2019-01-10 07:58:29.564 [htable-pool3231972-t6] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070504,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=1623ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-146-223.us-west-2.compute.internal,60020,1546248582003,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1623 -
final failure
2019-01-10 07:58:29.565 [htable-pool3231972-t4] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070504,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=625ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-146-101.us-west-2.compute.internal,60020,1545258402199,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 625 -
final failure
2019-01-10 07:58:29.567 [htable-pool3231972-t3] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070504,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=1721ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-157-159.us-west-2.compute.internal,60020,1545252703594,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 1721 -
final failure
2019-01-10 07:58:29.574 [htable-pool3231971-t5] WARN
o.a.hadoop.hbase.client.AsyncProcess - #3070503,
table=BLITZ_METRIC_NODE, attempt=5/5 failed=4782ops, last exception:
java.io.IOException: com.google.protobuf.ServiceException:
java.lang.OutOfMemoryError: Direct buffer memory on
ip-10-130-146-48.us-west-2.compute.internal,60020,1545251399029,
tracking started Thu Jan 10 07:55:21 UTC 2019; not retrying 4782 -
final failure
2019-01-10 07:58:29.629 [HBase WRITE-4] ERROR
c.a.b.p.c.h.w.AbstractHBaseWriter - Failed 10135 actions: IOException:
10135 times,


The errors on RegionServer seems to happen right around the time when
clients are seeing the OutOfMemory errors and from the timestamps,
it's not super clear to determine the cause vs effect.


*Question:*

Is it actually possible for the HBase client to send malformed/partial
RPC messages to RS if the JVM is suffering any OutOfMemory issues due
to less direct memory? It would be helpful to know the root cause here
so that we can remediate the issue accordingly.


These

RegionServer is dying with memory issues after it receives malformed packets

Reply via email to