If you use zookeeper, you can have the servers keep ephemeral nodes and have
clients watch those nodes. Then you can detect server failure.

On Mon, Jan 24, 2011 at 1:38 PM, Ilya Maykov <[email protected]> wrote:

> As others have mentioned, multiple servers + rolling upgrade + clients that
> retry to a different server when an RPC call fails. You could use zookeeper
> to keep clients informed about the set of servers that are *supposed* to be
> up so you don't try connecting to one that's in the process of upgrading,
> but you still have to handle the randomly-crashed-server case so zookeeper
> alone is likely not sufficient.
>
> -- Ilya
>
> Sent from my iPhone
>
> On Jan 24, 2011, at 8:08, Ted Dunning <[email protected]> wrote:
>
> > Zookeeper uses a similar strategy but allows for more forceful movement
> of
> > connections.
> >
> > I have used a similar strategy with other services with good results.
> >
> > On Mon, Jan 24, 2011 at 7:36 AM, Bryan Duxbury <[email protected]>
> wrote:
> >
> >> The strategy Rapleaf uses for purposes like this is to run multiple
> >> servers.
> >> The client is aware of all the possible servers, but usually only
> connects
> >> to one. When a connection becomes stale, you reconnect to another
> server.
> >> Then, to make your deploys less painful, you just deploy one server at a
> >> time.
> >>
> >> On Mon, Jan 24, 2011 at 1:33 AM, Phillip B Oldham
> >> <[email protected]>wrote:
> >>
> >>> We have a number of Python & Java thrift services which we are
> >>> manually deploying on a regular basis; usually early in the AM while
> >>> it's "quiet" since deployment causes service interruption.
> >>>
> >>> We'd like to move to continuous deployment, so that when our commits
> >>> successfully pass all the tests on our Hudson/Jenkins CI server
> >>> something (Hudson/Jenkins, Puppet, custom scripts) will deploy the
> >>> services without human intervention. The problem is that, in this
> >>> scenario, the services may be deployed multiple times a day. Since
> >>> each deployment causes service interruption we've held back.
> >>>
> >>> So, my question is: how would one avoid service interruption during
> >>> deployment? Is there a common tool/strategy for such tasks?
> >>>
> >>> --
> >>> Phillip B Oldham
> >>> [email protected]
> >>> +44 (0) 7525 01 09 01
> >>>
> >>
>

Reply via email to