Have you read this thread http://search-hadoop.com/m/YGbb1sOLh2W9Z9z ?
Cheers On Thu, Jun 25, 2015 at 10:10 AM, Mateusz Kaczynski <[email protected]> wrote: > One of our clusters running HBase 0.98.6-cdh5.3.0 used to work (relatively) > smoothly until a couple of days ago, when out of the sudden jobs stated > grinding to a halt and getting killed upon reporting a massive amount of > errors of form: > > org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of > OutOfOrderScannerNextException: was there a rpc timeout? > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:410) > at > > org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:230) > at > > org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138) > at > > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) > at > > org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) > at > > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) > at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) > at org.apache.hadoop.mapred.Child.main(Child.java:262) > > HBase regionservers contain a bunch of: > WARN [B.defaultRpcServer.handler=16,queue=1,port=60020] ipc.RpcServer: > B.defaultRpcServer.handler=16,queue=1,port=60020: caught a > ClosedChannelException, this means that the server was processing a request > but the client went away. The error message was: null > > and: > INFO [regionserver60020.leaseChecker] regionserver.HRegionServer: Scanner > 1086 lease expired on region > > table,8bf8fc3cd0e842c00fb4e556bbbdcd0f,1420155383100.19f5ed7c735d33b2cf8997e0b373a1a7 > > in addition there are reports of compactions (not sure if relevant at all): > regionserver.HStore: Completed major compaction of 3 file(s) in cf of > > table,fc0caf49fa871a61702fa3781e160101,1420728621152.9ccc317ca180cabde13864d4600c8693. > into efd8bec4dbf54ccca5f1351bfe9890c3(size=5.9 G), total size for store is > 5.9 G. This selection was in queue for 0sec, and took 1mins, 57sec to > execute. > > I've adjusted the following, thinking it might be scanner cache size issue > (we're dealing with docs of circa 100kb): > hbase.rpc.timeout - 900000 > hbase.regionserver.lease.period - 450000 > hbase.client.scanner.timeout.period - 450000 > hbase.client.scanner.caching - (down to) 50 > > To no avail. So I stripped the hbase config from hbase-site.xml to bare > minumum but I can reproduce it with a striking accuracy. The minimalistic > job reads from a table(c 3500 regions, 17 nodes), uses NullOutputFormat but > doesn't write to it, mappers's map function doesn't do anything. > > It starts pretty fast getting through 1.75% of the specified scan in ~1 > minute. Then hits 2.5% in ~2m, 3% in ~3m. Then around 4m20s, a massive wave > of aforementioned OutOfOrderScannerNextException starts pouring in, slowing > the job down until it fails ~1h later. > > I checked the nodes memory and disk usage on the individual nodes - all > good, open file permissions are set relatively high, we're clearly not > hitting the limit. > > I'm running out of sanity and was wondering if anyone might have any ideas? > > > -- > *Mateusz Kaczynski* >
