Mateusz, - How many regions do you have in your table? - What is the cluster size? - What is the scan spec in your M/R job (time range, filters) - RS node spec (CPUs, RAM, disks)
Ted's link is a good start point. -Vlad On Thu, Jun 25, 2015 at 10:19 AM, Ted Yu <[email protected]> wrote: > Have you read this thread http://search-hadoop.com/m/YGbb1sOLh2W9Z9z ? > > Cheers > > On Thu, Jun 25, 2015 at 10:10 AM, Mateusz Kaczynski <[email protected]> > wrote: > > > One of our clusters running HBase 0.98.6-cdh5.3.0 used to work > (relatively) > > smoothly until a couple of days ago, when out of the sudden jobs stated > > grinding to a halt and getting killed upon reporting a massive amount of > > errors of form: > > > > org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of > > OutOfOrderScannerNextException: was there a rpc timeout? > > at > > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:410) > > at > > > > > org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:230) > > at > > > > > org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138) > > at > > > > > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483) > > at > > > > > org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76) > > at > > > > > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85) > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139) > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) > > at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:415) > > at > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) > > at org.apache.hadoop.mapred.Child.main(Child.java:262) > > > > HBase regionservers contain a bunch of: > > WARN [B.defaultRpcServer.handler=16,queue=1,port=60020] ipc.RpcServer: > > B.defaultRpcServer.handler=16,queue=1,port=60020: caught a > > ClosedChannelException, this means that the server was processing a > request > > but the client went away. The error message was: null > > > > and: > > INFO [regionserver60020.leaseChecker] regionserver.HRegionServer: > Scanner > > 1086 lease expired on region > > > > > table,8bf8fc3cd0e842c00fb4e556bbbdcd0f,1420155383100.19f5ed7c735d33b2cf8997e0b373a1a7 > > > > in addition there are reports of compactions (not sure if relevant at > all): > > regionserver.HStore: Completed major compaction of 3 file(s) in cf of > > > > > table,fc0caf49fa871a61702fa3781e160101,1420728621152.9ccc317ca180cabde13864d4600c8693. > > into efd8bec4dbf54ccca5f1351bfe9890c3(size=5.9 G), total size for store > is > > 5.9 G. This selection was in queue for 0sec, and took 1mins, 57sec to > > execute. > > > > I've adjusted the following, thinking it might be scanner cache size > issue > > (we're dealing with docs of circa 100kb): > > hbase.rpc.timeout - 900000 > > hbase.regionserver.lease.period - 450000 > > hbase.client.scanner.timeout.period - 450000 > > hbase.client.scanner.caching - (down to) 50 > > > > To no avail. So I stripped the hbase config from hbase-site.xml to bare > > minumum but I can reproduce it with a striking accuracy. The minimalistic > > job reads from a table(c 3500 regions, 17 nodes), uses NullOutputFormat > but > > doesn't write to it, mappers's map function doesn't do anything. > > > > It starts pretty fast getting through 1.75% of the specified scan in ~1 > > minute. Then hits 2.5% in ~2m, 3% in ~3m. Then around 4m20s, a massive > wave > > of aforementioned OutOfOrderScannerNextException starts pouring in, > slowing > > the job down until it fails ~1h later. > > > > I checked the nodes memory and disk usage on the individual nodes - all > > good, open file permissions are set relatively high, we're clearly not > > hitting the limit. > > > > I'm running out of sanity and was wondering if anyone might have any > ideas? > > > > > > -- > > *Mateusz Kaczynski* > > >
