Thanks! I will give this a shot.
On Thu, Jul 31, 2014 at 1:12 PM, Bryan Beaudreault <[email protected] > wrote: > We've done this a number of times without issue. Here's the general flow: > > 1) Shutdown namenode and zkfc on SNN > 2) Stop zkfc on ANN (ANN will remain active because there is no other > zkfc instance running to fail over to) > 3) Run hdfs zkfc -formatZK on ANN > 4) Start zkfc on ANN (will sync up with ANN and write state to zk) > 5) Push new configs to the new SNN, bootstrap namenode there > 6) Start namenode and zkfc on SNN > 7) Push updated configs to all other hdfs services (datanodes, etc) > 8) Restart hbasemaster if you are running hbase, jobtracker for MR > 9) Rolling restart datanodes > 10) Done > > You'll have to handle any other consumers of DFSClient, like your own code > or other apache projects. > > > > On Thu, Jul 31, 2014 at 3:35 PM, Colin Kincaid Williams <[email protected]> > wrote: > >> Hi Jing, >> >> Thanks for the response. I will try this out, and file an Apache jira. >> >> Best, >> >> Colin Williams >> >> >> On Thu, Jul 31, 2014 at 11:19 AM, Jing Zhao <[email protected]> wrote: >> >>> Hi Colin, >>> >>> I guess currently we may have to restart almost all the >>> daemons/services in order to swap out a standby NameNode (SBN): >>> >>> 1. The current active NameNode (ANN) needs to know the new SBN since in >>> the current implementation the SBN tries to send rollEditLog RPC request to >>> ANN periodically (thus if a NN failover happens later, the original ANN >>> needs to send this RPC to the correct NN). >>> 2. Looks like the DataNode currently cannot do real refreshment for NN. >>> Look at the code in BPOfferService: >>> >>> void refreshNNList(ArrayList<InetSocketAddress> addrs) throws >>> IOException { >>> Set<InetSocketAddress> oldAddrs = Sets.newHashSet(); >>> for (BPServiceActor actor : bpServices) { >>> oldAddrs.add(actor.getNNSocketAddress()); >>> } >>> Set<InetSocketAddress> newAddrs = Sets.newHashSet(addrs); >>> >>> if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty()) { >>> // Keep things simple for now -- we can implement this at a later >>> date. >>> throw new IOException( >>> "HA does not currently support adding a new standby to a >>> running DN. " + >>> "Please do a rolling restart of DNs to reconfigure the list of >>> NNs."); >>> } >>> } >>> >>> 3. If you're using automatic failover, you also need to update the >>> configuration of the ZKFC on the current ANN machine, since ZKFC will do >>> gracefully fencing by sending RPC to the other NN. >>> 4. Looks like we do not need to restart JournalNodes for the new SBN but >>> I have not tried before. >>> >>> Thus in general we may still have to restart all the services >>> (except JNs) and update their configurations. But this may be a rolling >>> restart process I guess: >>> 1. Shutdown the old SBN, bootstrap the new SBN, and start the new SBN. >>> 2. Keep the ANN and its corresponding ZKFC running, do a rolling restart >>> of all the DN to update their configurations >>> 3. After restarting all the DN, stop ANN and the ZKFC, and update their >>> configuration. The new SBN should become active. >>> >>> I have not tried the upper steps, thus please let me know if this >>> works or not. And I think we should also document the correct steps in >>> Apache. Could you please file an Apache jira? >>> >>> Thanks, >>> -Jing >>> >>> >>> >>> On Thu, Jul 31, 2014 at 9:37 AM, Colin Kincaid Williams <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> I'm trying to swap out a standby NameNode in a QJM / HA configuration. >>>> I believe the steps to achieve this would be something similar to: >>>> >>>> Use the Bootstrap standby command to prep the replacment standby. Or >>>> rsync if the command fails. >>>> >>>> Somehow update the datanodes, so they push the heartbeat / journal to >>>> the new standby >>>> >>>> Update the xml configuration on all nodes to reflect the replacment >>>> standby. >>>> >>>> Start the replacment standby >>>> >>>> Use some hadoop command to refresh the datanodes to the new NameNode >>>> configuration. >>>> >>>> I am not sure how to deal with the Journal switch, or if I am going >>>> about this the right way. Can anybody give me some suggestions here? >>>> >>>> >>>> Regards, >>>> >>>> Colin Williams >>>> >>>> >>> >>> CONFIDENTIALITY NOTICE >>> NOTICE: This message is intended for the use of the individual or entity >>> to which it is addressed and may contain information that is confidential, >>> privileged and exempt from disclosure under applicable law. If the reader >>> of this message is not the intended recipient, you are hereby notified that >>> any printing, copying, dissemination, distribution, disclosure or >>> forwarding of this communication is strictly prohibited. If you have >>> received this communication in error, please contact the sender immediately >>> and delete it from your system. Thank You. >> >> >> >
