Can you show the complete stack trace for StackOverflowException (using pastebin) ?
Thanks On Aug 23, 2014, at 12:11 PM, Johannes Schaback <[email protected]> wrote: > Hi, > > we had to reduce load on the cluster yesterday night which reduced the > frequency of the phenomenon. That is why I could not get a jstack dump yet > because it did not occur since a couple hours. We will now get the load > back up hoping to trigger it again. > > Yes, I cut out the properties from the /debug dump because they are all > standard. We have hbase.ipc.server.callqueue.handler.factor to the standard > > <property> > <name>hbase.ipc.server.callqueue.handler.factor</name> > <value>0.1</value> > <source>hbase-default.xml</source> > </property> > > You find the complete config of the RS here: http://pastebin.com/iF1ibFb1 > > The hint about the .out files was a great one (I never really looked at > them actually). The .out files are flooded with StackOverflowExceptions: > > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > (and so on...) > > Filtering the .out file for "Exception" shows that several handlers crashed > like that: > > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > > Unfortunately, the exceptions are not timestamped so that I can not > correlate their occurrence with the exact time when the RS starts filling > up the queue. > > > On Sat, Aug 23, 2014 at 8:28 PM, Stack <[email protected]> wrote: > >> Anything in your .out that could help explain our losing handlers if you >> can't find anything in the logs? >> >> You did the 'snipp' in the below, right Johannes? >> >> RS Configuration: >> =========================================================== >> <?xml version="1.0" encoding="UTF-8" standalone="no"?><configuration> >> [snipp] no fancy stuff, all default, except absolute necessary settings >> [snipp] >> </configuration> >> >> As per Qian, if hbase.ipc.server.callqueue.handler.factor, that could help >> explain why we have handlers but they are not 'taking' from the call queue, >> they are stuck taking on those queues that do not have calls queued. >> >> St.Ack >> >> >> >> >> >> >> >> On Sat, Aug 23, 2014 at 2:56 AM, Qiang Tian <[email protected]> wrote: >> >>> Did you set hbase.ipc.server.callqueue.handler.factor? >>> it looks there are 3 queues, handlers on queue 1 are all gone as Stack >>> mentioned. jstack and pastebin regions server log would help. >>> >>> >>> >>> >>> >>> On Sat, Aug 23, 2014 at 7:02 AM, Stack <[email protected]> wrote: >>> >>>> On Fri, Aug 22, 2014 at 3:24 PM, Johannes Schaback < >>>> [email protected]> wrote: >>>> >>>>> ... >>>>> I grep'ed "defaultRpcServer.handler=" on the log from that particular >>> RS. >>>>> The >>>>> RS started at 15:35. After that, the handlers >>>>> >>>>> 6, 24, 0, 15, 28, 26, 7, 19, 21, 3, 5 and 23 >>>>> >>>>> make an appearance in error messages of HDFS related exceptions: >>>>> >>>>> 2014-08-22 13:54:57,470 WARN >>>>> [defaultRpcServer.handler=6,queue=0,port=60020] >>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 13:54:57,472 WARN >>>>> [defaultRpcServer.handler=6,queue=0,port=60020] hdfs.DFSClient: >>>> Connection >>>>> failure: Failed to connect to /192.168.3.233:50010 for file >>>>> /hbase/data/default/image/dae1b8e3bfdb608571d09916bf0f >>>>> a156/cf/866a773857654b0d83275dc4e4558be6 for block >> BP-1157637685-192.168.3.192-1382642140917:blk_1294949058_255264636:java.io.IOException: >>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:37835, remote=/ >>>>> 192.168.3.233:50 010, for file >> /hbase/data/default/image/dae1b8e3bfdb608571d09916bf0fa156/cf/866a773857654b0d83275dc4e4558be6, >>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block >>>>> 1294949058_255264636 >>>>> 2014-08-22 13:56:58,525 WARN >>>>> [defaultRpcServer.handler=24,queue=0,port=60020] >>>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 13:56:58,799 WARN >>>>> [defaultRpcServer.handler=24,queue=0,port=60020] hdfs.DFSClient: >>>>> Connection failure: Failed to connect to /192.168.3.142:50010 for >> file >>>>> /hbase/data/default/image/cf493cbd9921ae6ca5e5281cc07 >>>>> 18ca2/cf/7f9618dcdeae40ddbff21165d08e0a83 for block >> BP-1157637685-192.168.3.192-1382642140917:blk_1294072754_254292863:java.io.IOException: >>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:59268, remote=/ >>>>> 192.168.3.142:5 0010, for file >> /hbase/data/default/image/cf493cbd9921ae6ca5e5281cc0718ca2/cf/7f9618dcdeae40ddbff21165d08e0a83, >>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block >>>>> 1294072754_254292863 >>>>> 2014-08-22 14:08:01,632 WARN >>>>> [defaultRpcServer.handler=0,queue=0,port=60020] >>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 14:08:01,633 WARN >>>>> [defaultRpcServer.handler=0,queue=0,port=60020] hdfs.DFSClient: >>>> Connection >>>>> failure: Failed to connect to /192.168.3.53:50010 for file >>>>> /hbase/data/default/image/af02bc7fb404f4c054dcd64b44b0e >>>>> 2a9/cf/a8881b6e2d5d41b3b56fd34fd4ca8ffd for block >> BP-1157637685-192.168.3.192-1382642140917:blk_1294876123_255182439:java.io.IOException: >>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:60737, remote=/ >>>>> 192.168.3.53:5001 0, for file >> /hbase/data/default/image/af02bc7fb404f4c054dcd64b44b0e2a9/cf/a8881b6e2d5d41b3b56fd34fd4ca8ffd, >>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block >>>>> 1294876123_255182439 >>>>> 2014-08-22 14:11:34,192 WARN >>>>> [defaultRpcServer.handler=15,queue=0,port=60020] ipc.RpcServer: >>>>> (responseTooSlow): >> {"processingtimems":15432,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiReque >>>>> st)","client":"192.168.3.54:52838 >> ","starttimems":1408709478713,"queuetimems":0,"class":"HRegionServer","responsesize":99638,"method":"Multi"} >>>>> 2014-08-22 14:22:21,847 WARN >>>>> [defaultRpcServer.handler=28,queue=1,port=60020] >>>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 14:22:21,848 WARN >>>>> [defaultRpcServer.handler=28,queue=1,port=60020] hdfs.DFSClient: >>>>> Connection failure: Failed to connect to /192.168.3.233:50010 for >> file >>>>> /hbase/data/default/image/1d33a251862055cd999078bbd10 >>>>> aa44c/cf/a7470351725f4cc192b6210bac9b7c44 for block >> BP-1157637685-192.168.3.192-1382642140917:blk_1294949316_255264902:java.io.IOException: >>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:44787, remote=/ >>>>> 192.168.3.233:5 0010, for file >> /hbase/data/default/image/1d33a251862055cd999078bbd10aa44c/cf/a7470351725f4cc192b6210bac9b7c44, >>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block >>>>> 1294949316_255264902 >>>>> 2014-08-22 14:23:22,628 WARN >>>>> [defaultRpcServer.handler=26,queue=2,port=60020] >>>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 14:23:22,628 WARN >>>>> [defaultRpcServer.handler=26,queue=2,port=60020] hdfs.DFSClient: >>>>> Connection failure: Failed to connect to /192.168.3.142:50010 for >> file >>>>> /hbase/data/default/image/4a9c830a4fe006f0a6af7418164 >>>>> dd86d/cf/47d6f2c3f7054d40aed66cae9787c464 for block >> BP-1157637685-192.168.3.192-1382642140917:blk_1293459594_253612452:java.io.IOException: >>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:37660, remote=/ >>>>> 192.168.3.142:5 0010, for file >> /hbase/data/default/image/4a9c830a4fe006f0a6af7418164dd86d/cf/47d6f2c3f7054d40aed66cae9787c464, >>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block >>>>> 1293459594_253612452 >>>>> 2014-08-22 14:25:35,003 WARN >>>>> [defaultRpcServer.handler=7,queue=1,port=60020] >>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 14:25:35,004 WARN >>>>> [defaultRpcServer.handler=7,queue=1,port=60020] hdfs.DFSClient: >>>> Connection >>>>> failure: Failed to connect to /192.168.3.51:50010 for file >>>>> /hbase/data/default/run_automaton_cache/da03f8123004be3 >>>>> 2659e1c8a51afbbf8/cf/1daf234e740f4b00889bb60e574dc79b for block >> BP-1157637685-192.168.3.192-1382642140917:blk_1294614349_254896345:java.io.IOException: >>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:53627, remote=/192 >>>>> .168.3.51:50010, for file >> /hbase/data/default/run_automaton_cache/da03f8123004be32659e1c8a51afbbf8/cf/1daf234e740f4b00889bb60e574dc79b, >>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block >>>>> 1294614349_254896345 >>>>> 2014-08-22 14:25:46,831 WARN >>>>> [defaultRpcServer.handler=19,queue=1,port=60020] >>>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 14:25:46,832 WARN >>>>> [defaultRpcServer.handler=19,queue=1,port=60020] hdfs.DFSClient: >>>>> Connection failure: Failed to connect to /192.168.3.30:50010 for >> file >>>>> /hbase/data/default/image/1c3bab43e260ddb46a06cf04e293 >>>>> e386/cf/740f8240ac9a4abc9e5fcf6ec7df18bc for block >> BP-1157637685-192.168.3.192-1382642140917:blk_1294151564_254380545:java.io.IOException: >>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:55660, remote=/ >>>>> 192.168.3.30:500 10, for file >> /hbase/data/default/image/1c3bab43e260ddb46a06cf04e293e386/cf/740f8240ac9a4abc9e5fcf6ec7df18bc, >>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block >>>>> 1294151564_254380545 >>>>> 2014-08-22 14:28:22,395 WARN >>>>> [defaultRpcServer.handler=26,queue=2,port=60020] >>>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 14:28:22,397 WARN >>>>> [defaultRpcServer.handler=26,queue=2,port=60020] hdfs.DFSClient: >>>>> Connection failure: Failed to connect to /192.168.3.143:50010 for >> file >>>>> /hbase/data/default/image/b4b48e20a606c431e393b674f92 >>>>> 79daf/cf/76ff3628306a48f187a285b2a21d9ac9 for block >> BP-1157637685-192.168.3.192-1382642140917:blk_1294308080_254554506:java.io.IOException: >>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:51759, remote=/ >>>>> 192.168.3.143:5 0010, for file >> /hbase/data/default/image/b4b48e20a606c431e393b674f9279daf/cf/76ff3628306a48f187a285b2a21d9ac9, >>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block >>>>> 1294308080_254554506 >>>>> 2014-08-22 14:30:29,395 WARN >>>>> [defaultRpcServer.handler=21,queue=0,port=60020] >>>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 14:30:29,396 WARN >>>>> [defaultRpcServer.handler=21,queue=0,port=60020] hdfs.DFSClient: >>>>> Connection failure: Failed to connect to /192.168.3.30:50010 for >> file >>>>> /hbase/data/default/image/77ceb80d73c065f1a4ba2ad3b7cf >>>>> 04a2/cf/151c244177ac42b5996af0c9052660cc for block >> BP-1157637685-192.168.3.192-1382642140917:blk_1294365296_254618150:java.io.IOException: >>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:57238, remote=/ >>>>> 192.168.3.30:500 10, for file >> /hbase/data/default/image/77ceb80d73c065f1a4ba2ad3b7cf04a2/cf/151c244177ac42b5996af0c9052660cc, >>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block >>>>> 1294365296_254618150 >>>>> 2014-08-22 14:31:34,016 WARN >>>>> [defaultRpcServer.handler=6,queue=0,port=60020] >>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 14:31:34,018 WARN >>>>> [defaultRpcServer.handler=6,queue=0,port=60020] hdfs.DFSClient: >>>> Connection >>>>> failure: Failed to connect to /192.168.3.212:50010 for file >>>>> /hbase/data/default/image/45578853a5c807919578043c7715 >>>>> 1efa/cf/40c00d49a7494929a319467d82b383da for block >> BP-1157637685-192.168.3.192-1382642140917:blk_1294930505_255244337:java.io.IOException: >>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:33973, remote=/ >>>>> 192.168.3.212:50 010, for file >> /hbase/data/default/image/45578853a5c807919578043c77151efa/cf/40c00d49a7494929a319467d82b383da, >>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block >>>>> 1294930505_255244337 >>>>> 2014-08-22 14:33:55,994 WARN >>>>> [defaultRpcServer.handler=3,queue=0,port=60020] >>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 14:33:55,995 WARN >>>>> [defaultRpcServer.handler=3,queue=0,port=60020] hdfs.DFSClient: >>>> Connection >>>>> failure: Failed to connect to /192.168.3.52:50010 for file >>>>> /hbase/data/default/image/1f07e0160fbba29a2c26104dd5966 >>>>> 39f/cf/7e36136497fc4a5a9889797a1dfb5d3b for block >> BP-1157637685-192.168.3.192-1382642140917:blk_1294915607_255226575:java.io.IOException: >>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:41017, remote=/ >>>>> 192.168.3.52:5001 0, for file >> /hbase/data/default/image/1f07e0160fbba29a2c26104dd596639f/cf/7e36136497fc4a5a9889797a1dfb5d3b, >>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block >>>>> 1294915607_255226575 >>>>> 2014-08-22 14:44:14,301 WARN >>>>> [defaultRpcServer.handler=3,queue=0,port=60020] >>> hdfs.BlockReaderFactory: >>>>> I/O error constructing remote block reader. >>>>> 2014-08-22 14:44:14,302 WARN >>>>> [defaultRpcServer.handler=3,queue=0,port=60020] hdfs.DFSClient: >>>> Connection >>>>> failure: Failed to connec
