Stuart Bishop <stuart.bis...@canonical.com> writes:

> I find destroy-service/remove-application is particularly problematic,
> because the doomed units don't know they are being destroyed but rather is
> informed about departing one relation at a time (which is inherently racy,
> because the units the doomed service are related too will process their
> relation-departed hooks almost immediately and stop talking to the doomed
> service, while the doomed service still thinks it can access their
> resources while it falls apart one piece at a time).

Yes. I noticed this issue too, and I think it's a valid Juju bug. I'm
not sure what the best fix would be, but it probably involves some
streamlining of the stop-unit logic (and associated hook sequencing).

[...]
> One of the reasons test suites are currently flaky is that there are
> race conditions we have no reasonable way of solving, such as a
> database restarting itself while a hook on another unit is attempting
> to use it.

In theory this should be rule 0 of programming: handle errors (such as
your code failing to talk to a database). This is of course easier said
than done, but it's been the case forever.

Blind retries are by no means a silver bullet, just because (at least
conceptually) there's no way around at looking at the actual issue at
hand, when deciding how to handle it (e.g. retry).

If you are 100% confident that your code is "idempotent" (for some
definition that makes sense in your case), a blind retry mechanism might
simply mean that your code will take a bit longer to bubble up a failure
(for instance because it's stubbornly retrying a failure condition that
has no way out).

However it's often difficult to judge if some piece of logic is really
idempotent (expecially if the logic encompasses a lot of moving parts,
like a hook run, as opposed to some granular API call). So there's
always the *some* risk that a blind retry could do something unwanted or
even harmful.

If you want to be perfectly safe you should look at the failure at hand
and make sure you understand, before doing anything.

YMMV re real-world statistics of whether this argument is actually
relevant (e.g. "blind retry is good enough for me").

This is by no means an easy topic and it's one of the hard parts of
programming, as exemplified by this recent juju-dev thread:

https://lists.ubuntu.com/archives/juju-dev/2016-October/006091.html

It's also an area where some stardardization of failure modes in
distributed systems would probably help developing some better
automation than blind retry or even some form of AI/learning (the HTTP
spec and RESTful architectures were arguably designed with that in
mind).

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Reply via email to