Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.

2016-05-06 Thread William Reade
On Fri, May 6, 2016 at 11:26 PM, Eric Snow  wrote:

> On Fri, May 6, 2016 at 11:37 AM, William Reade
>  wrote:
> > On Fri, May 6, 2016 at 5:50 PM, Eric Snow 
> wrote:
> >>
> >> See https://bugs.launchpad.net/juju-core/+bug/1514874.
> >
> >
> > Mainly, we just shouldn't do it. The only *actual reason* for us to do
> this
> > is to support the manual provider; any other machine that happens to be
> dead
> > will be cleaned up by the responsible provisioner in good time. There's a
> > file we write to enable the suicidal behaviour when we enlist a manual
> > machine, and we *shouldn't* have ever written it for any other reason.
>
> So how about, other than the container bits, we drop the "uninstall"
> code entirely.  To address the manual provider case, during that
> provider's StartInstance() we add a "clean-up-juju" script somewhere
> on the machine?  Then folks can use that to confidently clean up after
> the fact.
>

I'm not keen to regress that without *very* compelling reasons to do so;
and a bit suspicious that installing a clean-up-juju script is itself going
to be a bit of a rabbit hole (once juju's upgraded, the things that need to
be cleaned up might change; at the very least that's an upgrade step to
maintain, but it's also way too easy to just forget that the stored script
could go out of date). Keeping it (as much as possible) in-process strikes
me as more generally reliable; the main thing is dialing way back on the
set of scenarios that can trigger it.

Yeah, if we simply did no automatic cleanup and (on manual provider)
> give users a script that would let them clean up afterward, then they
> could happily(?) manually repair their machines with much less
> trouble.
>

Still -1 myself: as you imply, I'm not sure anyone using this feature will
take on a per-machine manual step particularly "happily" ;).

Cheers
William
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.

2016-05-06 Thread Eric Snow
On Fri, May 6, 2016 at 11:37 AM, William Reade
 wrote:
> On Fri, May 6, 2016 at 5:50 PM, Eric Snow  wrote:
>>
>> See https://bugs.launchpad.net/juju-core/+bug/1514874.
>
>
> Mainly, we just shouldn't do it. The only *actual reason* for us to do this
> is to support the manual provider; any other machine that happens to be dead
> will be cleaned up by the responsible provisioner in good time. There's a
> file we write to enable the suicidal behaviour when we enlist a manual
> machine, and we *shouldn't* have ever written it for any other reason.

So how about, other than the container bits, we drop the "uninstall"
code entirely.  To address the manual provider case, during that
provider's StartInstance() we add a "clean-up-juju" script somewhere
on the machine?  Then folks can use that to confidently clean up after
the fact.

> So, narrowly, fixing this involves relocating the more-widely-useful
> (in-container loop device detach IIRC?) code inside MachineAgent.uninstall;

Yep.

> I don't think we want any of this -- it's all a consequence of
> inappropriately triggering the uninstall stuff in the first place.

Yeah, if we simply did no automatic cleanup and (on manual provider)
give users a script that would let them clean up afterward, then they
could happily(?) manually repair their machines with much less
trouble.

>> * Could this lead to collisions if the instance is re-purposed as a
>> different machine?  I suppose it could also expose sensitive data when
>> likewise re-purposed, since it won't necessarily be in the same model
>> or controller.  However, given the need for admin access that probably
>> isn't a likely problem.
>
> If substrates are leaking data between separate instances, the substrate is
> screwed up, and there's not much we can do about it. The manual provider has
> no such guarantees, and that's the only justification for trying to clean up
> so drastically in the first place.

Fair enough.

-eric

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.

2016-05-06 Thread William Reade
On Fri, May 6, 2016 at 5:50 PM, Eric Snow  wrote:

> See https://bugs.launchpad.net/juju-core/+bug/1514874.
>

Mainly, we just shouldn't do it. The only *actual reason* for us to do this
is to support the manual provider; any other machine that happens to be
dead will be cleaned up by the responsible provisioner in good time.
There's a file we write to enable the suicidal behaviour when we enlist a
manual machine, and we *shouldn't* have ever written it for any other
reason.

But then people started adding post-death cleanup logic to the agent, and
the only way to trigger that was to write that file; so they started
writing that file on normal shutdown paths, so that they could trigger the
post-death logic in all cases, and I can only assume they decided that
nuking the installation was acceptable precisely *because* the provisioner
would be taking down the machine anyway. And note that because there *is* a
responsible provisioner that will be taking the machine down, that shutdown
logic might or might not happen, rendering it ultimately pretty unreliable
[0] as well as spectacularly terrible in the lp:1514874 cases. /sigh

So, narrowly, fixing this involves relocating the more-widely-useful
(in-container loop device detach IIRC?) code inside MachineAgent.uninstall;
and then just not ever writing the file that triggers actual uninstall,
putting it back to where it was intended: a nasty but constrained hack to
avoid polluting manual machines (which are only "borrowed") any more than
necessary.

(A bit more broadly: ErrTerminateAgent is kinda dumb, but not
*particularly* harmful in practice [1] without the file of death backing it
up. The various ways in which, e.g. api connection failure can trigger it
are a bit enthusiastic, perhaps, but if *all it did was terminate the
agent* it'd be fine: someone who, e.g. changed their agent.conf's password
would invoke jujud and see it stop -- can't do anything with this config --
but be perfectly able to work again if the conf were fixed. Don't know what
the deal is with the correct-password-refused thing, though: that should be
investigated further.)

The tricky bit is relocating the cleanup code in uninstall -- I suspect the
reason this whole sorry saga kicked off is because whoever added it was in
a hurry and thought they didn't have a place to put this logic (inside the
machiner, *before* setting Dead, would be closest to correct -- but that'd
require us to synchronise with any other workers that might use the
resources we want to clean up, and I suspect that would have been the
roadblock). Andrew, I think you had more detail last time we discussed
this: is there anything else in uninstall (besides loop-device stuff) that
needs to run *anywhere* except a manual machine? and, what will we actually
need to sync with in the machiner? (or, do you have alternative ideas?)

Cheers
William

[0] although with the watcher resolution of 5s, it's actually got a pretty
good chance of running.
[1] still utterly terrible in theory, it would be lovely to see explicit
handover between contexts -- e.g. it is *not* the machiner's job to return
ErrTerminateAgent, it is whatever's-responsible-for-the-machiner's job to
handle a local ErrMachineDead and convert that as necessary for the context
it's running in.

PS Oh, uh, couple of comments on the below:

1. do nothing
> This would be easy :) but does not help with the pain point.
> 2. be smarter about retrying (e.g. retry connecting to API) when
> running into fatal errors
> This would probably be good to do but the effort might not pay for
> itself.
> 3. do not clean up (data dir, init service, or either)
> Leaving the init service installed really isn't an option because
> we don't want
> the init service to try restarting the agent over and over.
> Leaving the data dir
> in place isn't good because it will conflict with any new agent
> dir the controller
> tries to put on the instance in the future.
>

we don't need to worry about the init service because we're set up not to
be restarted if we return 0, which we should on ErrTerminateAgent.


> 4. add a DO_NOT_UNINSTALL or DO_NOT_CLEAN_UP flag (e.g. in the
> machine/model config or as a sentinel file on instances) and do not
> clean up if it is set to true (default?)
> This would provide a reasonable quick fix for the above bug, even if
> temporary and even if it defaults to false.
> 5. disable (instead of uninstall) the init services and move the data
> dir to some obvious but out-of-the-way place (e.g.
> /home/ubuntu/agent-data-dir-XXMODELUUIDXX-machine-X) instead of
> deleting it
> This is a reasonable longer term solution as the concerns
> described for 3 are
> addressed and manual intervention becomes more feasible.
> 6. same as 5 but also provide an "enable-juju" (or "revive-juju",
> "repair-juju", etc.) command *on the machine* that would re-enable the
> init services and restore (or rebuild) the agent's data dir
> This would make

Machine agents uninstall themselves upon worker.ErrTerminateAgent.

2016-05-06 Thread Eric Snow
See https://bugs.launchpad.net/juju-core/+bug/1514874.

When starting, the machine agent starts up several workers in a runner
(more or less the same in 1.25 and 2.0).  Then it waits for the runner
to stop.  If one of the workers has a "fatal" error then the runner
stops and the machine agent handles the error.  If the error is
worker.ErrTerminateAgent then the machine agent uninstalls (cleans up)
itself.  This amounts to deleting its data dir and uninstalling its
jujud and mongo init services.  However, all other files are left in
place* and the machine/instance is not removed from Juju (agent stays
"lost").

Notably, the unit agent handles fatal errors a bit differently, with
some retry logic (e.g. for API connections) and never cleans up after
itself.

The problem here is that a fatal error *may* be recoverable with
manual intervention or with retries.  Cleaning up like we do makes it
harder to manually recover.  Such errors are theoretically extremely
uncommon so we don't worry about doing anything other than stop and
"uninstall".  However, as seen with the above-referenced bug report,
bugs can lead to fatal errors happening with enough frequency that the
agent's clean-up becomes a pain point.

The history of this behavior isn't particularly clear so I'd first
like to hear what the original rationale was for cleaning up the agent
when "terminating" it.  Then I'd like to know if perhaps that
rationale has changed.  Finally, I'd like us to consider alternatives
that allow for better recoverability.

Regarding that third part, we have a number of options.  I introduced
several in the above-referenced bug report and will expand on them
here:

1. do nothing
This would be easy :) but does not help with the pain point.
2. be smarter about retrying (e.g. retry connecting to API) when
running into fatal errors
This would probably be good to do but the effort might not pay for itself.
3. do not clean up (data dir, init service, or either)
Leaving the init service installed really isn't an option because
we don't want
the init service to try restarting the agent over and over.
Leaving the data dir
in place isn't good because it will conflict with any new agent
dir the controller
tries to put on the instance in the future.
4. add a DO_NOT_UNINSTALL or DO_NOT_CLEAN_UP flag (e.g. in the
machine/model config or as a sentinel file on instances) and do not
clean up if it is set to true (default?)
This would provide a reasonable quick fix for the above bug, even if
temporary and even if it defaults to false.
5. disable (instead of uninstall) the init services and move the data
dir to some obvious but out-of-the-way place (e.g.
/home/ubuntu/agent-data-dir-XXMODELUUIDXX-machine-X) instead of
deleting it
This is a reasonable longer term solution as the concerns
described for 3 are
addressed and manual intervention becomes more feasible.
6. same as 5 but also provide an "enable-juju" (or "revive-juju",
"repair-juju", etc.) command *on the machine* that would re-enable the
init services and restore (or rebuild) the agent's data dir
This would make it even easier to manually recover.
7. first try to automatically recover (from machine or controller)
This would require a sizable effort for a problem that shouldn't
normally happen.
8. do what the unit agent does
I haven't looked closely enough to see if this is a good fit.

I'd consider 4 with a default of false to be an acceptable quick fix.
Additionally, I'll advocate for 6 (or 5) as the most appropriate
solution in support of manual recovery from "fatal" errors.

-eric


* Could this lead to collisions if the instance is re-purposed as a
different machine?  I suppose it could also expose sensitive data when
likewise re-purposed, since it won't necessarily be in the same model
or controller.  However, given the need for admin access that probably
isn't a likely problem.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev