As others have mentioned, multiple servers + rolling upgrade + clients that retry to a different server when an RPC call fails. You could use zookeeper to keep clients informed about the set of servers that are *supposed* to be up so you don't try connecting to one that's in the process of upgrading, but you still have to handle the randomly-crashed-server case so zookeeper alone is likely not sufficient.
-- Ilya Sent from my iPhone On Jan 24, 2011, at 8:08, Ted Dunning <[email protected]> wrote: > Zookeeper uses a similar strategy but allows for more forceful movement of > connections. > > I have used a similar strategy with other services with good results. > > On Mon, Jan 24, 2011 at 7:36 AM, Bryan Duxbury <[email protected]> wrote: > >> The strategy Rapleaf uses for purposes like this is to run multiple >> servers. >> The client is aware of all the possible servers, but usually only connects >> to one. When a connection becomes stale, you reconnect to another server. >> Then, to make your deploys less painful, you just deploy one server at a >> time. >> >> On Mon, Jan 24, 2011 at 1:33 AM, Phillip B Oldham >> <[email protected]>wrote: >> >>> We have a number of Python & Java thrift services which we are >>> manually deploying on a regular basis; usually early in the AM while >>> it's "quiet" since deployment causes service interruption. >>> >>> We'd like to move to continuous deployment, so that when our commits >>> successfully pass all the tests on our Hudson/Jenkins CI server >>> something (Hudson/Jenkins, Puppet, custom scripts) will deploy the >>> services without human intervention. The problem is that, in this >>> scenario, the services may be deployed multiple times a day. Since >>> each deployment causes service interruption we've held back. >>> >>> So, my question is: how would one avoid service interruption during >>> deployment? Is there a common tool/strategy for such tasks? >>> >>> -- >>> Phillip B Oldham >>> [email protected] >>> +44 (0) 7525 01 09 01 >>> >>
