Hi HBase users, I'm working HBase for the first time and I'm trying to sort out a performance issue. HBase is the data store for a small, focused web crawl I'm performing with Apache Nutch. I'm running in pseudo-distributed mode, meaning that Nutch, HBase and Hadoop are all on the same machine. The machine's a few years old and has only 4 gigs of RAM - much smaller than most HBase installs, I know.
When I first start my HBase processes I get about 60 seconds of fast performance. Hbase reads quickly and uses a healthy portion CPU cycles. After a minute or so, though, HBase slows dramatically. Reads sink to a glacial pace, and the CPU sits mostly idle. I notice this pattern when I run Nutch - particularly during read-heavy operations - but also when I run a simple row counter from the shell. At the moment " count 'my_table' " takes almost 4 hours to read through 500 000 rows. The reading is much faster at the start than the end. In the first 30 seconds, HBase counts 37000 rows, but in the 30 seconds between 8:00 and 8:30, only 1000 are counted. Looking through my Ganglia report I see a brief return to high performance around 3 hours into the count. I don't know what's causing this spike. Can anyone suggest what configuration parameters I should change to improve read performance? Or what reference materials I should consult to better understand the problem? Again, I'm totally new to HBase. I'm using HBase 0.90.4 and Hadoop 1.2.2. HBase has a heapsize of 1.5 Gigs. Here's a Ganglia report covering the 4 hours of " count 'my_table' ": http://imgur.com/Aa3eukZ Please let me know if I can provide any more information. Many thanks, Dave
