Right, but I assume you could still see a race condition where the server just failed and zookeeper has not detected the failure yet. So zookeeper still thinks it's up, thus the client still thinks it's up, thus the client makes an RPC request that is doomed to fail :)
Using zookeeper is probably still a good idea, as it gives clients an almost-always-correct view of which servers are up and reduces the need for RPC retries. But the need is still there - taking the argument to its logical conclusion, a server can always crash in the middle of processing your (retry-able) request. Of course if your requests have side effects and are not idempotent then you face the fun task of determining if the side effects happened or not before the server died, but that's a whole other set of problems :) -- Ilya Sent from my iPhone On Jan 24, 2011, at 13:56, Bryan Duxbury <[email protected]> wrote: > If you use zookeeper, you can have the servers keep ephemeral nodes and have > clients watch those nodes. Then you can detect server failure. > > On Mon, Jan 24, 2011 at 1:38 PM, Ilya Maykov <[email protected]> wrote: > >> As others have mentioned, multiple servers + rolling upgrade + clients that >> retry to a different server when an RPC call fails. You could use zookeeper >> to keep clients informed about the set of servers that are *supposed* to be >> up so you don't try connecting to one that's in the process of upgrading, >> but you still have to handle the randomly-crashed-server case so zookeeper >> alone is likely not sufficient. >> >> -- Ilya >> >> Sent from my iPhone >> >> On Jan 24, 2011, at 8:08, Ted Dunning <[email protected]> wrote: >> >>> Zookeeper uses a similar strategy but allows for more forceful movement >> of >>> connections. >>> >>> I have used a similar strategy with other services with good results. >>> >>> On Mon, Jan 24, 2011 at 7:36 AM, Bryan Duxbury <[email protected]> >> wrote: >>> >>>> The strategy Rapleaf uses for purposes like this is to run multiple >>>> servers. >>>> The client is aware of all the possible servers, but usually only >> connects >>>> to one. When a connection becomes stale, you reconnect to another >> server. >>>> Then, to make your deploys less painful, you just deploy one server at a >>>> time. >>>> >>>> On Mon, Jan 24, 2011 at 1:33 AM, Phillip B Oldham >>>> <[email protected]>wrote: >>>> >>>>> We have a number of Python & Java thrift services which we are >>>>> manually deploying on a regular basis; usually early in the AM while >>>>> it's "quiet" since deployment causes service interruption. >>>>> >>>>> We'd like to move to continuous deployment, so that when our commits >>>>> successfully pass all the tests on our Hudson/Jenkins CI server >>>>> something (Hudson/Jenkins, Puppet, custom scripts) will deploy the >>>>> services without human intervention. The problem is that, in this >>>>> scenario, the services may be deployed multiple times a day. Since >>>>> each deployment causes service interruption we've held back. >>>>> >>>>> So, my question is: how would one avoid service interruption during >>>>> deployment? Is there a common tool/strategy for such tasks? >>>>> >>>>> -- >>>>> Phillip B Oldham >>>>> [email protected] >>>>> +44 (0) 7525 01 09 01 >>>>> >>>> >>
