On Jun 16, 2012, at 2:16 AM, Dan Creswell wrote:

> On 14 June 2012 15:37, Gregg Wonderly <[email protected]> wrote:
> 
>> If you use a smart proxy, and put the lease renewal call inside the smart
>> proxy, and register a listener, you can see the renewal fail.  But, you
>> still have to know what that means based on how the service and the lease
>> interact.  To get a legitimate, two way, liveness test, you really have to
>> have a conversation with the server, from the client, and have a view of
>> the endpoint activities on the server.
>> 
>> There are lots of ways to engineer this, and both leasing or transactions
>> can be part of the solution.  But, in the end, you must decide what you
>> need to know, and then think through what you are expecting vs what is
>> actually achievable using the facilities you can deploy.
>> 
>> Most of the time, the true test, is merely to be able to use the
>> endpoint(s) end to end by making a call from the client to the service for
>> that liveness test.
>> 
>> Given all the possible forms of partial failure that can occur in a
>> distributed system.  You can't rely on detached functionality, such as
>> leases, as the "only" way to know that something is working on the other
>> end.
>> 
> 
> Indeed, options are ultimately limited by the fact that one cannot tell the
> difference between genuine machine failure and slowness due to excessive
> load or packet loss or network breakage (there is a proof for this, think
> it's due to Lynch but...).
> 
> One often tackles this sort of problem with a Failure Detector (
> http://www.cs.cornell.edu/home/sam/FDpapers.html). Leases are somewhat
> related in that they help form a view that something is wrong, what they
> don't (and can't) tell you is _what_ is wrong. They essentially rely on a
> form of active ping (the extension of the lease) to detect failure. Most
> importantly the Lease forms a contract between client and server such that
> _both_ can make an independent assumption about failure/loss after a period
> of time.
> 
> When one detects a failure, one can attempt to diagnose more accurately
> what is broken but it's tricky. Let's say we want to connect to a server
> using a TCP-based protocol. When connecting we can fail for several reasons
> including packet loss, excessive server load or simply because the
> connection queue is too big. Deducing which of those is the culprit is much
> more a debugging exercise than something one attempts to deal with in the
> system code.
> 
> To summarise:
> 
> (1) Build a model that can eventually deduce there has been a failure of
> some sort.
> (2) Build a recovery model that, given a failure, can restore whatever
> state is required and continue to make progress.

I've found this to be those most reliable and simple way to do complex system 
design.  Think about it from the perspective of keeping your APIs stateless 
while allowing the service to understand it's state, and how to reconfigure 
itself based on API calls into it.

What's the term I'm searching for?  The APIs should always be "successful" 
(unless the data is wrong), and the arguments should fully specify what is 
needed.  RESTful web services do this well.  I've had countless arguments about 
RPC being restful, but I can't seem to get anyone to agree that "invoke" is the 
operation.  They always say that the method name represents the operation.  I 
assert the method name is part of the data...

Gregg Wonderly

Reply via email to