On Tue, Feb 17, 2015 at 4:20 PM, Avery Payne <avery.p.pa...@gmail.com> wrote: > > On 2/17/2015 11:02 AM, Buck Evan wrote: >> >> I think there's only three cases here: >> >> 1. Users that would have gotten immediate failure, and no amount of >> spinning would help. These users will see their error delayed by $SVWAIT >> seconds, but no other difference. >> 2. Users that would have gotten immediate failure, but could have gotten >> a success within $SVWAIT seconds. All of these users will of course be glad >> of the change. >> 3. Users that would not have gotten immediate failure. None of these >> users will see the slightest change in behavior. >> >> Do you have a particular scenario in mind when you mention "breaking lots >> of existing installations elsewhere due to a default behavior change"? I >> don't see that there is any case this change would break. <snip>
Thanks for the thoughtful reply Avery. My background is also "maintaining business software", although putting it in those terms gives me horrific visions of java servlets and soap protocols. > I have to look at it from a viewpoint of "what is everything else in the > system expecting when this code is called". This means thinking in terms of > code-as-API, so that calls elsewhere don't break. As a matter of API, sv-check does sometimes take up to $SVWAIT seconds to fail. Any caller to sv-check will be expecting this (strictly limited) delay, in the exceptional case. My patch just extends this existing, documented behavior to the special case of "unable to open supervise/ok". The API is unchanged, just the amount of time to return the result is changed. > This happens because the use of "sv check (child)" follows the convention of > "check, and either succeed fast or fail fast", ... Either you're confused about what sv-check does, or I'm confused about what you're saying. sv-check generaly doesn't fail fast (except in the special case I'm trying to make no longer fail fast -- svrun is not started). Generally it will spin for $SVWAIT seconds before failing. > Without that fast-fail, the logged hint never occurs; the sysadmin now has to > figure out which of three possible services in a dependency chain are causing > the hang. Even if I put the above issue aside aside, you wouldn't get a hang, you'd get the failure message you're familiar with, just several seconds (default: 7) later. The sysadmin wouldn't search any more than previously. He would however find that the system fails less often, since it has that 7 seconds of tolerance now. This is how sv-check behaves already when a ./check script exits nonzero. > While this is > implemented differently from other installations, there are known cases > similar to what I am doing, where people have ./run scripts like this: > > #!/bin/sh > sv check child-service || exit 1 > exec parent-service This would still work just fine, just strictly more often.