Hi all, We're running hbase 0.90.3 for one read intensive application.
We find after long running(2 weeks or 1 month or longer), the read speed will become much lower. For example, a get_rows operation of thrift to fetch 20 rows (about 4k size every row) could take >2 second, sometimes even >5 seconds. When it happens, we can see cpu_wio keeps at about 10. But if we restart hbase(only master and regionservers) with stop-hbase.sh and start-hbase.sh, we can see the read speed back to normal immediately, which is <200 ms for every get_rows operation, and the cpu_wio drops to about 2. When the problem appears, there's no exception in logs, and no flush/compaction, nothing abnormal except a few warning logs sometimes like below: 2011-12-27 15:50:20,307 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 52 on 60020 took 1546 ms appending an edit to hlog; editcount=1, len~=9.8k Our cluster has 10 region servers, each with 25g heap size, 64% of which used for cache. The're some m/r jobs keep running in another cluster to feed data into the this hbase. Every night, we do flush and major compaction. Usually there's no flush or compaction in the daytime. Could anybody explain why the read speed could become lower after long running, and why it back to normal immediately after restarting hbase? Every advice will be highly appreciated. Thanks, Yi
