Thanks a lot German. Now, I can understand its strange behavior, so we decided to use IP address itself as a server list, instead of hostname. The problem went away.
On Wed, Nov 6, 2013 at 8:34 PM, German Blanco < [email protected]> wrote: > Hello again, > > I don't think it is a good a idea to start a new thread with the same > issue. Please continue in the latest thread. > > could this be a DNS resolution caching problem? > See https://issues.apache.org/jira/browse/ZOOKEEPER-1506 > > The new server has the lowest sid. It is able to connect to all other > servers, but the rest of the servers don't seem able to connect to it. > Connections from this server to the rest are useless, since they are > dropped because of the sid comparison that you see in the log. > > You could try to change the server address in the configuration for the AWS > public IP address of the peers, just to test if that works ok. Or try > replacing the server with the highest sid, that should also work. Otherwise > (assuming the problem is DNS resolution), the only current workaround that > I can think of is the rolling restart, as you have noticed. > > > On Wed, Nov 6, 2013 at 6:39 PM, Diego Oliveira <[email protected]> wrote: > > > Bae, > > > > Just a note, when using Zookeeper in amazon AWS, the instance IP > > relocation at restart is a nightmare. One solution is to do as you sad, > > using an elastic IP, but the max number 5 is limiting. One option is to > > configure a VPC. I got this problems last year. > > > > Att, > > Diego. > > > > > > On Tue, Nov 5, 2013 at 4:18 PM, Bae, Jae Hyeon <[email protected]> > wrote: > > > > > I am attaching log file. Could you take a look why the new instance > > cannot > > > join quorum? > > > > > > > > > On Tue, Nov 5, 2013 at 9:52 AM, Bae, Jae Hyeon <[email protected]> > > wrote: > > > > > >> Thanks a lot Ben > > >> > > >> We are also using zookeeper in AWS with elastic IP. Why I asked this > > >> question is, when the bad Zookeeper EC2 instance is terminated and new > > >> instance is launched with the previous elastic IP, it cannot join > quorum > > >> without any specific error messages. But when I did rolling restart, > the > > >> new instance started normally, synchronized and joined quorum. > > >> > > >> As I understand German's response, the new instance should start, > > >> synchronize, and join quorum successfully without any impact on > existing > > >> instances but it didn't. I will investigate further. > > >> > > >> Thank you > > >> Best, Jae > > >> > > >> > > >> On Tue, Nov 5, 2013 at 8:24 AM, Ben Hall <[email protected]> wrote: > > >> > > >>> Hi Jae, > > >>> > > >>> I wrote that article several years ago. (tbh - I hope it is not > totally > > >>> out of date by now). I agree with German's points. > > >>> > > >>> The issue it was solving was to replace a bad server without having > to > > >>> shutdown the ensemble and without having to update the config files > on > > >>> each server. I would also add that this only works as long as the > > server > > >>> names and ports are the same - iirc at the time the article was > written > > >>> we > > >>> were using servers in AWS and referencing them either by assigned > > >>> hostnames such as zookeeper-[01|11] or by elastic IP's that could be > > >>> moved > > >>> from server to server. > > >>> > > >>> If I understand your question correctly, if you are "adding a new > > server" > > >>> such as going from 7 to 9 servers, then this approach won't benefit > you > > >>> as > > >>> you. > > >>> > > >>> We also used this approach when we would upgrade the servers, but > like > > >>> German said we did it one server at a time so that the Leader > election > > >>> could be natural. This allowed us to upgrade a pool of 11 servers > who > > >>> were responsible for many thousands of client connections without any > > >>> down > > >>> time. > > >>> > > >>> Thanks > > >>> Ben > > >>> > > >>> > > >>> On 11/5/13 6:51 AM, "German Blanco" <[email protected]> > > >>> wrote: > > >>> > > >>> >... and make sure that there is no rubbish in the data dir of the > new > > >>> >server. > > >>> > > > >>> > > > >>> >On Tue, Nov 5, 2013 at 3:49 PM, German Blanco < > > >>> >[email protected]> wrote: > > >>> > > > >>> >> Hello Jae, > > >>> >> > > >>> >> I think that the answer to your question is "no, there is no > benefit > > >>> in > > >>> >>a > > >>> >> rolling restart in that case". > > >>> >> If you remove a machine that was hosting a zookeeper server that > was > > >>> >>part > > >>> >> of a cluster, and replace it with a new machine, with a zookeeper > > >>> server > > >>> >> running the same software version and listening on the same IP and > > >>> >>ports, > > >>> >> then this new server will join the cluster, synchronize and start > > >>> >>working > > >>> >> normally. > > >>> >> I wouldn't recommend to replace more than one server at a time, > and > > I > > >>> >> think that it is better if the new server joins while the existing > > >>> >>quorum > > >>> >> is stable (avoid leader elections while the new server joins, i.e. > > >>> avoid > > >>> >> restarts or disconnections of the existing servers). > > >>> >> > > >>> >> Best regards, > > >>> >> > > >>> >> Germán. > > >>> >> > > >>> >> > > >>> >> On Tue, Nov 5, 2013 at 6:42 AM, Bae, Jae Hyeon < > [email protected]> > > >>> >>wrote: > > >>> >> > > >>> >>> Hi > > >>> >>> > > >>> >>> I read an article > > >>> >>> > > >>> >>> > > >>> >>> > > >>> > > http://www.benhallbenhall.com/2011/07/rolling-restart-in-apache-zookeepe > > >>> >>>r-to-dynamically-add-servers-to-the-ensemble/ > > >>> >>> > > >>> >>> My question is, even though failed hardware is replaced with the > > same > > >>> >>>IP > > >>> >>> address, do I need to do rolling restart for adding replaced > > hardware > > >>> >>>to > > >>> >>> the quorum? > > >>> >>> > > >>> >>> I am using zookeeper ver3.4.5. > > >>> >>> > > >>> >>> Thank you > > >>> >>> Best, Jae > > >>> >>> > > >>> >> > > >>> >> > > >>> > > >>> > > >> > > > > > > > > > -- > > Att. > > Diego de Oliveira > > System Architect > > [email protected] > > www.diegooliveira.com > > Never argue with a fool -- people might not be able to tell the > difference > > >
