There is a class BookKeeperTools that has methods for complete recovery of a
node. The recovery of dead bookie involves updating zk first with the
replacement bookie and then replicating the necessary ledger entries. So, if
recovery process / target bookie dies before the actual entries could get
copied, then there can be data inconsistency issues.
Data copy can take time and thus increases the window during a which a node can
potentially fail. Is this an issue that needs to be addressed?
Also, this tool needs to be triggered manually for doing node recovery. Any
plans for automatic node recovery (similar to Hadoop HDFS) in which if a
goes down, then some background process replicates data to maintain the
replication factor (quorum).