I wrote myself a Scanner wrapper that uses a producer/consumer queue to keep the client fed with a full buffer as much as possible. When scanning my table with scanner caching at 100 records, I see about a 24% uplift in performance (~35k records/sec with the ClientScanner and ~44k records/sec with my P/C scanner). However, when I set scanner caching to 5000, it's more of a wash compared to the standard ClientScanner: ~53k records/sec with the ClientScanner and ~60k records/sec with the P/C scanner.
I'm not sure what to make of those results. I think next I'll shut down HBase and read the HFiles directly, to see if there's a drop off in performance between reading them directly vs. via the RegionServer. I still think that to really solve this there needs to be sliding window of records in flight between disk and RS, and between RS and client. I'm thinking there's probably a single batch of records in flight between RS and client at the moment. Sandy On 5/23/13 8:45 AM, "Bryan Keller" <[email protected]> wrote: >I am considering scanning a snapshot instead of the table. I believe this >is what the ExportSnapshot class does. If I could use the scanning code >from ExportSnapshot then I will be able to scan the HDFS files directly >and bypass the regionservers. This could potentially give me a huge boost >in performance for full table scans. However, it doesn't really address >the poor scan performance against a table.
