Hi, I am trying to read all the data out of an HBase table using a scan and it is extremely slow.
Here are some characteristics of the data: 1. The total table size is tiny (~200MB) 2. The table has ~100 rows and ~200,000 columns in a SINGLE family. Thus the size of each cell is ~10bytes and the size of each row is ~2MB 3. Currently scanning the whole table takes ~400s (both in a distributed setting with 12 nodes or so and on a single node), thus 5sec/row 4. The row keys are unique 8 byte crypto hashes of sequential numbers 5. The scanner is set to fetch a FULL row at a time (scan.setBatch) and is set to fetch 100MB of data at a time (scan.setCaching) 6. Changing the caching size seems to have no effect on the total scan time at all 7. The column family is setup to keep a single version of the cells, no compression, and no block cache. Am I missing something ? Is there a way to optimize this ? I guess a general question I have is whether HBase is good datastore for storing many medium sized (~50GB), dense datasets with lots of columns when a lot of the queries require full table scans ? Thanks! Gurjeet
