I was experiencing aborted scans on certain conditions. In these cases I was simply missing so many rows that only a fraction was inputted, without warning. After lots of testing I was able to pinpoint and reproduce the error when scanning over a single region, single column family, single store file. So really just a single (major_compacted) storefile. I scan over this region using a single Scan in a local jobtracker context. (So not mapreduce, although this has exactly the same behaviour). Finally, I noticed the number of input rows is dependent on the hbase.client.scanner.caching property. See following example runs that scans over this region with a specific start and stop key:
-Dhbase.client.scanner.caching=1 inputrows=1506 -Dhbase.client.scanner.caching=10000 inputrows=1240 -Dhbase.client.scanner.caching=1240 inputrows=1506 -Dhbase.client.scanner.caching=1241 inputrows=1240 This is weird huh? So setting the cache to 1241 in this case aborts the scan silently. Removing the stoprow yields the same amout. Setting the caching to 1 with no stoprow yields all rows. (Several hundreds of thousands). Neither the client nor the regionserver log any warning whatsoever. I had the hbase.client.scanner.max.result.size set to 90100100. After removing this property it all works like a charm!!! All rows are properly inputted, regardless of hbase.client.scanner.caching. As an extra verification I checked the regionserver for warnings that I would expect without this property and this seems to be the case: 2012-07-25 11:46:52,889 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60 020, responseTooLarge for: next(-1937592840574159040, 10000) from x.x.x.x:39398: Size: 3 38.1m 2012-07-25 11:47:14,359 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60 020, responseTooLarge for: next(-1937592840574159040, 10000) from x.x.x.x:39407: Size: 1 86.6m So, anyone know what this could be? I am willing to debug this behaviour at the regionserver level, but before I do I want to make sure I am not running into something that has already been solved. This is on hbase-0.90.6-cdh3u4, using snappy.
