One of our clusters running HBase 0.98.6-cdh5.3.0 used to work (relatively) smoothly until a couple of days ago, when out of the sudden jobs stated grinding to a halt and getting killed upon reporting a massive amount of errors of form:
org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout? at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:410) at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:230) at org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.mapred.Child.main(Child.java:262) HBase regionservers contain a bunch of: WARN [B.defaultRpcServer.handler=16,queue=1,port=60020] ipc.RpcServer: B.defaultRpcServer.handler=16,queue=1,port=60020: caught a ClosedChannelException, this means that the server was processing a request but the client went away. The error message was: null and: INFO [regionserver60020.leaseChecker] regionserver.HRegionServer: Scanner 1086 lease expired on region table,8bf8fc3cd0e842c00fb4e556bbbdcd0f,1420155383100.19f5ed7c735d33b2cf8997e0b373a1a7 in addition there are reports of compactions (not sure if relevant at all): regionserver.HStore: Completed major compaction of 3 file(s) in cf of table,fc0caf49fa871a61702fa3781e160101,1420728621152.9ccc317ca180cabde13864d4600c8693. into efd8bec4dbf54ccca5f1351bfe9890c3(size=5.9 G), total size for store is 5.9 G. This selection was in queue for 0sec, and took 1mins, 57sec to execute. I've adjusted the following, thinking it might be scanner cache size issue (we're dealing with docs of circa 100kb): hbase.rpc.timeout - 900000 hbase.regionserver.lease.period - 450000 hbase.client.scanner.timeout.period - 450000 hbase.client.scanner.caching - (down to) 50 To no avail. So I stripped the hbase config from hbase-site.xml to bare minumum but I can reproduce it with a striking accuracy. The minimalistic job reads from a table(c 3500 regions, 17 nodes), uses NullOutputFormat but doesn't write to it, mappers's map function doesn't do anything. It starts pretty fast getting through 1.75% of the specified scan in ~1 minute. Then hits 2.5% in ~2m, 3% in ~3m. Then around 4m20s, a massive wave of aforementioned OutOfOrderScannerNextException starts pouring in, slowing the job down until it fails ~1h later. I checked the nodes memory and disk usage on the individual nodes - all good, open file permissions are set relatively high, we're clearly not hitting the limit. I'm running out of sanity and was wondering if anyone might have any ideas? -- *Mateusz Kaczynski*
