If you use zookeeper, you can have the servers keep ephemeral nodes and have clients watch those nodes. Then you can detect server failure.
On Mon, Jan 24, 2011 at 1:38 PM, Ilya Maykov <[email protected]> wrote: > As others have mentioned, multiple servers + rolling upgrade + clients that > retry to a different server when an RPC call fails. You could use zookeeper > to keep clients informed about the set of servers that are *supposed* to be > up so you don't try connecting to one that's in the process of upgrading, > but you still have to handle the randomly-crashed-server case so zookeeper > alone is likely not sufficient. > > -- Ilya > > Sent from my iPhone > > On Jan 24, 2011, at 8:08, Ted Dunning <[email protected]> wrote: > > > Zookeeper uses a similar strategy but allows for more forceful movement > of > > connections. > > > > I have used a similar strategy with other services with good results. > > > > On Mon, Jan 24, 2011 at 7:36 AM, Bryan Duxbury <[email protected]> > wrote: > > > >> The strategy Rapleaf uses for purposes like this is to run multiple > >> servers. > >> The client is aware of all the possible servers, but usually only > connects > >> to one. When a connection becomes stale, you reconnect to another > server. > >> Then, to make your deploys less painful, you just deploy one server at a > >> time. > >> > >> On Mon, Jan 24, 2011 at 1:33 AM, Phillip B Oldham > >> <[email protected]>wrote: > >> > >>> We have a number of Python & Java thrift services which we are > >>> manually deploying on a regular basis; usually early in the AM while > >>> it's "quiet" since deployment causes service interruption. > >>> > >>> We'd like to move to continuous deployment, so that when our commits > >>> successfully pass all the tests on our Hudson/Jenkins CI server > >>> something (Hudson/Jenkins, Puppet, custom scripts) will deploy the > >>> services without human intervention. The problem is that, in this > >>> scenario, the services may be deployed multiple times a day. Since > >>> each deployment causes service interruption we've held back. > >>> > >>> So, my question is: how would one avoid service interruption during > >>> deployment? Is there a common tool/strategy for such tasks? > >>> > >>> -- > >>> Phillip B Oldham > >>> [email protected] > >>> +44 (0) 7525 01 09 01 > >>> > >> >
