The limitation will be how quickly the crash of a processor can be detected.
 Typically this takes at least a few seconds and depending on how bad false
positives are for you and how much your program does things like garbage
collection, it might take a few tens of seconds.  Normal program exit should
be detected almost instantly as should load shedding.

On Thu, Dec 16, 2010 at 10:17 PM, Samuel Rash <[email protected]> wrote:

> Can these approaches respond in under a few seconds? If a traffic source
> remains unclaimed for even a short while, we have a problem.
>

It can be made so.  This isn't a Zookeeper problem so much as a problem in
detecting failures in general in the presence of poor real-time guarantees
due to all manner of issues.


> Also, a host may "shed" traffic manually be releasing a subset of its
> paths.  In this way, all the other hosts watching only its location does
> prevent against the herd when it dies, but how do they know when it
> releases 50/625 traffic buckets?
>

In the design I proposed, the load shedding would be indicated by the
disappearance of the "I am handling this source" file.  That would
immediately notify the hosts waiting to handle that source who would all try
to claim the source.

Response time in this scenario would be a few milliseconds.


>
> I agree we might be able to make a more intelligent design that trades
> latency for watch efficiency, but the idea was that we'd use the simplest
> approach that gave us the lowest latency *if* the throughput of watches
> from zookeeper was sufficient (and it seems like it is from Mahadev's link)
>

It should work, but you have a quadratic scaling factor in your design.
 That is almost always bad before too long.  That quadratic term can be
avoided pretty easily.

Reply via email to