I have a small six nodes cluter. one node run master and namenode, another run secondary namenode. the other 4 nodes are datanodes and region servers. each node has 16GB memory and a 4 core cpu
my application is very simple. I use hbase to store data for a web spider. the table is: 1. url_db row key MD5(url). and there are other columns of the url. average length of a row is about 1k 2. out_link row key MD5(url1)+MD5(url2). and there are anchor text and other columns. average length is also less than 1K 3. in_link row key MD5(url2)+MD5(url1). 4. other tables with very few rows when a url is fetched by the fetcher, A link extractor will extract all the urls in this web page. so with a url, I need to insert new found urls to url_db and url+childurl to out_link and childurl+url to in_link. as for reading, there are a few map reduce tasks to select priority urls from url_db. it use full table scan of url_db and out_link. map reduce is running every hour and it takes tens of minutes to complete at the beginning, it's fast. but when url_db expands to tens of million urls. it slows down. And I found two of the 4 nodes become very high load but the other two have low load. I use top to find two nodes' load average is larger than 50 and the other two is less than 1. I tried to split the region and move them manully. But after some time, it is not balanced again. I am using hbase 0.94.11 with hadoop 1.0.0 is hbase 0.96/0.98 's balancer better for me or I shoud adjust some settings to?