OutOfOrderScannerNextException consistently killing mapreduce jobs

Mateusz Kaczynski Thu, 25 Jun 2015 10:12:49 -0700

One of our clusters running HBase 0.98.6-cdh5.3.0 used to work (relatively)
smoothly until a couple of days ago, when out of the sudden jobs stated
grinding to a halt and getting killed upon reporting a massive amount of
errors of form:


org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of
OutOfOrderScannerNextException: was there a rpc timeout?
at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:410)
at
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:230)
at
org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

HBase regionservers contain a bunch of:
WARN  [B.defaultRpcServer.handler=16,queue=1,port=60020] ipc.RpcServer:
B.defaultRpcServer.handler=16,queue=1,port=60020: caught a
ClosedChannelException, this means that the server was processing a request
but the client went away. The error message was: null

and:
INFO  [regionserver60020.leaseChecker] regionserver.HRegionServer: Scanner
1086 lease expired on region
table,8bf8fc3cd0e842c00fb4e556bbbdcd0f,1420155383100.19f5ed7c735d33b2cf8997e0b373a1a7

in addition there are reports of compactions (not sure if relevant at all):
regionserver.HStore: Completed major compaction of 3 file(s) in cf of
table,fc0caf49fa871a61702fa3781e160101,1420728621152.9ccc317ca180cabde13864d4600c8693.
into efd8bec4dbf54ccca5f1351bfe9890c3(size=5.9 G), total size for store is
5.9 G. This selection was in queue for 0sec, and took 1mins, 57sec to
execute.

I've adjusted the following, thinking it might be scanner cache size issue
(we're dealing with docs of circa 100kb):
hbase.rpc.timeout - 900000
hbase.regionserver.lease.period - 450000
hbase.client.scanner.timeout.period - 450000
hbase.client.scanner.caching - (down to) 50

To no avail. So I stripped the hbase config from hbase-site.xml to bare
minumum but I can reproduce it with a striking accuracy. The minimalistic
job reads from a table(c 3500 regions, 17 nodes), uses NullOutputFormat but
doesn't write to it, mappers's map function doesn't do anything.

It starts pretty fast getting through 1.75% of the specified scan in ~1
minute. Then hits 2.5% in ~2m, 3% in ~3m. Then around 4m20s, a massive wave
of aforementioned OutOfOrderScannerNextException starts pouring in, slowing
the job down until it fails ~1h later.

I checked the nodes memory and disk usage on the individual nodes - all
good, open file permissions are set relatively high, we're clearly not
hitting the limit.

I'm running out of sanity and was wondering if anyone might have any ideas?


-- 
*Mateusz Kaczynski*

OutOfOrderScannerNextException consistently killing mapreduce jobs

Reply via email to