Right, but I assume you could still see a race condition where the server just 
failed and zookeeper has not detected the failure yet. So zookeeper still 
thinks it's up, thus the client still thinks it's up, thus the client makes an 
RPC request that is doomed to fail :)

Using zookeeper is probably still a good idea, as it gives clients an 
almost-always-correct view of which servers are up and reduces the need for RPC 
retries. But the need is still there - taking the argument to its logical 
conclusion, a server can always crash in the middle of processing your 
(retry-able) request.

Of course if your requests have side effects and are not idempotent then you 
face the fun task of determining if the side effects happened or not before the 
server died, but that's a whole other set of problems :)

-- Ilya

Sent from my iPhone

On Jan 24, 2011, at 13:56, Bryan Duxbury <[email protected]> wrote:

> If you use zookeeper, you can have the servers keep ephemeral nodes and have
> clients watch those nodes. Then you can detect server failure.
> 
> On Mon, Jan 24, 2011 at 1:38 PM, Ilya Maykov <[email protected]> wrote:
> 
>> As others have mentioned, multiple servers + rolling upgrade + clients that
>> retry to a different server when an RPC call fails. You could use zookeeper
>> to keep clients informed about the set of servers that are *supposed* to be
>> up so you don't try connecting to one that's in the process of upgrading,
>> but you still have to handle the randomly-crashed-server case so zookeeper
>> alone is likely not sufficient.
>> 
>> -- Ilya
>> 
>> Sent from my iPhone
>> 
>> On Jan 24, 2011, at 8:08, Ted Dunning <[email protected]> wrote:
>> 
>>> Zookeeper uses a similar strategy but allows for more forceful movement
>> of
>>> connections.
>>> 
>>> I have used a similar strategy with other services with good results.
>>> 
>>> On Mon, Jan 24, 2011 at 7:36 AM, Bryan Duxbury <[email protected]>
>> wrote:
>>> 
>>>> The strategy Rapleaf uses for purposes like this is to run multiple
>>>> servers.
>>>> The client is aware of all the possible servers, but usually only
>> connects
>>>> to one. When a connection becomes stale, you reconnect to another
>> server.
>>>> Then, to make your deploys less painful, you just deploy one server at a
>>>> time.
>>>> 
>>>> On Mon, Jan 24, 2011 at 1:33 AM, Phillip B Oldham
>>>> <[email protected]>wrote:
>>>> 
>>>>> We have a number of Python & Java thrift services which we are
>>>>> manually deploying on a regular basis; usually early in the AM while
>>>>> it's "quiet" since deployment causes service interruption.
>>>>> 
>>>>> We'd like to move to continuous deployment, so that when our commits
>>>>> successfully pass all the tests on our Hudson/Jenkins CI server
>>>>> something (Hudson/Jenkins, Puppet, custom scripts) will deploy the
>>>>> services without human intervention. The problem is that, in this
>>>>> scenario, the services may be deployed multiple times a day. Since
>>>>> each deployment causes service interruption we've held back.
>>>>> 
>>>>> So, my question is: how would one avoid service interruption during
>>>>> deployment? Is there a common tool/strategy for such tasks?
>>>>> 
>>>>> --
>>>>> Phillip B Oldham
>>>>> [email protected]
>>>>> +44 (0) 7525 01 09 01
>>>>> 
>>>> 
>> 

Reply via email to