Sijie, Yes, that's precisely what I meant, we're running separate autorecovery processes, not daemons on all nodes.
Autorecovery processes run quietly until I stop a node, as soon as I stop a node, they're plagued with logs like the following, where the stopped node is (10.3.2.56): 2016-11-22 17:34:06,085 - ERROR - [bookkeeper-io-1-1:PerChannelBookieClient$2@284] - Could not connect to bookie: [id: 0xd3b0c759, L:/10.3.3.42:45164]/10.3.2.56:3181, current state CONNECTING : java.net.ConnectException: syscall:getsockopt(...): /10.3.2.56:3181 There seems to be waves of thousands and thousands of these logs while some data movement seems to be occurring, but it's really weird that it's constantly trying to connect to the failed node. Couldn't it realize it's down because it's not shown as available on zookeeper? We also see a couple of this logs, but really few of them compared to the previous. 2016-11-23 14:28:01,661 - WARN - [ReplicationWorker:RackawareEnsemblePlacementPolicy@543] - Failed to choose a bookie: excluded [<Bookie:10.3.2.57:3181>, <Bookie:10.3.2.195:3181> , <Bookie:10.3.2.158:3181>], fallback to choose bookie randomly from the cluster. The cluster currently has 6 nodes, and as I said before we're using ensemble size 3, write quorum 3 and ack quorum 2. Thanks, Sebastian On Tue, Nov 22, 2016 at 2:10 PM Sijie Guo <[email protected]> wrote: I think what Sebastian said is that manual recovery didn't even work. This seems to a bit strange to me. The autorecovery will check if the bookie is available or not. After that, it should rereplicate the data from other nodes in the ensemble. This seems to indicate something is broken. Sebastian, Can you point us some loggings? Sijie On Nov 19, 2016 9:46 AM, "Rithin Shetty" <[email protected]> wrote: A few things to note: Make sure 'autoRecoveryDaemonEnabled' set to true on all the bookie conf files; by default this is false. Otherwise recovery will not work. The auto recovery process tries to make sure that the data doesn't exist on the source node before replicating to destination. That might be the reason why it is trying to talk to the source node. --Rithin On Fri, Nov 18, 2016 at 12:00 PM, Sebastián Schepens < [email protected]> wrote: Hi guys, I'm running into some issues while trying to recover a decomissioned node. Both the recovery command and autorecovery processes fail trying to connect to the failing node, which seems reasonable because the node is down. But I don't get why it's trying to connect to that node, according to the documentation it should pull ledger data from other nodes in the ensemble (3) and replicate them. The recovery command also seems to completely ignore the destination node given as third argument. Could someone give us some help? Thanks, Sebastian
