Hi, There is a class BookKeeperTools that has methods for complete recovery of a node. The recovery of dead bookie involves updating zk first with the replacement bookie and then replicating the necessary ledger entries. So, if the recovery process / target bookie dies before the actual entries could get copied, then there can be data inconsistency issues.
Data copy can take time and thus increases the window during a which a node can potentially fail. Is this an issue that needs to be addressed? Also, this tool needs to be triggered manually for doing node recovery. Any plans for automatic node recovery (similar to Hadoop HDFS) in which if a machine goes down, then some background process replicates data to maintain the replication factor (quorum). -regards Amit