We have been preparing to enable replication between two large clusters. For the past couple of weeks, replication has been enabled via hbase-site.xml, but the replication state has been false (set false by issuing a stop_replication command).
The master is no longer cleaning any logs from /hbase/.oldlogs It reached 2MM+ logs using 140TB of data before we noticed that the hbase master heap was growing (about 2GB in use by the LogCleaner form the FileStatus objects of this directory). Looking at ReplicationLogCleaner the first check it makes is that if replication is stopped, then it prevents all logs from being cleaned which can lead to the master going OOM or HDFS running out of space. I would have expected once replication is stopped that it would allow logs to be cleaned and expired. Looking through JIRAs, I suspect this is the cause of https://issues.apache.org/jira/browse/HBASE-3489 I believe our fix will be to start_replication with no peers enabled, but I think the ReplicationLogCleaner should be changed. Anyone else care to weigh in with an opinion? (JD?) There's also some discussion about the "kill switch" that may be relevant here: https://issues.apache.org/jira/browse/HBASE-5222 Dave
