Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.
On Mon, May 9, 2016 at 10:50 AM, Eric Snowwrote: > I'll check it with 1.25 too, though I expect that the result will be the same. Under 1.25, using the local provider, I was able to reproduce the behavior. On the and even got the same confusing log entry: 2016-05-09 16:48:51 DEBUG juju.cmd.jujud machine.go:1741 uninstall file "/var/lib/juju/uninstall-agent" does not exist So it's not readily apparent what is deleting the files and removing the init services. However, it *is* happening. -eric -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.
On Mon, May 9, 2016 at 1:56 AM, Andrew Wilkinswrote: > But... I've just looked at the 1.25 branch again, and the fix *was* made > there. And from Jorge's comment > https://bugs.launchpad.net/juju-core/+bug/1514874/comments/4, we can see > that the uninstall logic isn't actually running (see `uninstall file > "/var/lib/juju/uninstall-agent" does not exist` > https://github.com/juju/juju/blob/1.25/cmd/jujud/agent/machine.go#L1741) > > I'm not sure what to make of that. Eric, have you confirmed that that code > is what's causing the issue? Are we sure we're not barking up the wrong > tree? Those logs certainly don't line up with the reported bug. :/ It's possible that they aren't the right logs. Regardless, my initial investigation was based on inspecting code that "uninstalls" the agent. The only such code I found was machineAgent.uninstallAgent() (and in the manual provder's Destroy()). To be sure about it, I verified the issue using the LXD provider under 2.0, following the steps Jorge specified in the bug: 1) Modify the apipassword on the /var/lib/juju/agents/machine-8/agent.conf 2) Restart jujud-machine-8 The agent is uninstalled and the the uninstall file is there: 2016-05-09 16:21:48 INFO juju.agent uninstall.go:47 agent already marked ready for uninstall However, the machine is still there: $ juju status ... 0 started 10.235.227.113 juju-af8e08f2-3953-4bcb-84de-767a15f46b4f-machine-0 xenial I'll check it with 1.25 too, though I expect that the result will be the same. -eric -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.
On Mon, May 9, 2016 at 11:03 AM, Andrew Wilkins < andrew.wilk...@canonical.com> wrote: Meta: a name like "uninstall-agent" is really misleading if it actually >> means "machine-X is definitely dead". I never had the slightest indication >> that was what it was meant to mean. But in that case, why *don't* we write >> the file on certain API connection failures? There is logic that >> *explicitly* checks if the machine is Dead -- surely that's a trigger too? >> > > We do, but it looks like we've regressed since the MADE branch landed. > api.ErrConnectImpossible now causes uninstall-agent, which means a > (possibly temporary) authorization error will once again cause an uninstall: > > https://github.com/juju/juju/blob/master/worker/apicaller/connect.go#L176 > Yeah, we discussed that at the time and evidently something went wrong -- I was still misconceiving uninstall-agent as a manual-specific, provision-time-only switch, which had been incorrectly overloaded with is-it-dead-no-matter-the-provider meaning. > (1) record (in agent config?) that a machine is manual >>> (2) only ever do anything uninstall-related for manual machines >>> (3) only ever do uninstall-related things if the machine actually is >>> Dead >>> (4) drop lxc-specific logic from uninstall *when LXC support is removed* >>> >> >> Generally SGTM, but confused re (4) -- doesn't that have to be >> fixed/moved/removed first earlier, so that we can switch off the suicide >> behaviour in other cases without regressing? >> > > We can remove the LXC/loop bits now if we're fine with leaking loop > devices, which is probably OK assuming we are actually removing LXC before > 2.0. Loop devices has to be explicitly enabled anyway. > Sounds like a plan. Agent config seems like a decent place to store manual-ness; and having a specific ErrAgentEntityDead STM to be the best signal for triggering a check and potential uninstall. (in fact: for great safety, it can and probably should be cmd/jujud/agent.errEntityDead, which gets set in a manifold- or engine-config Filter field -- so it only ever gets injected in response to specific errors from specific workers. It's unquestionably the agent's decision to make; and it's the workers' responsibility to return precise errors, rooted in their own contexts, that don't prejudge how the client might want to respond.) Cheers William -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.
On Mon, May 9, 2016 at 4:38 PM William Readewrote: > On Mon, May 9, 2016 at 9:56 AM, Andrew Wilkins < > andrew.wilk...@canonical.com> wrote: > >> The reason it's done at the last moment is to avoid having dangling >>> database entries. If we uninstall the agent (i.e. delete /var/lib/juju, >>> remove systemd/upstart), then if the agent fails before we get to >>> EnsureDead, then the entity will never be removed from state. >>> >> >> The *only* thing that should happen after setting dead is the uninstall >> -- anything else that's required to happen before cleanup *must* happen >> before setting dead, which *means* "all my responsibilities are 100% >> fulfilled". >> > > > I don't think I suggested above that we should do anything else other > than uninstall? > > Sorry I misunderstood -- I was focused on the loop-device stuff that had > grown in the uninstall logic. > > >> The *only* justification for the post-death logic in the manual case is >>> because there's no responsible provisioner component to hand over to -- and >>> frankly I wish we'd just written that to SSH in and clean up, instead of >>> taking on this ongoing hassle. >>> >> >>> >> As an alternative, we could (should) only ever write the /var/lib/juju/uninstall-agent file from worker/machiner, first checking there's no assigned units, and no storage attached. >>> >>> Why would we *ever* want to write it at runtime? We know if it's a >>> manual machine at provisioning time, so we can write the File Of Death >>> OAOO. All the other mucking about with it is the source of these (serious!) >>> bugs. >>> >> >> The point is not to distinguish between manual vs. non-manual. Yes, we >> can write something that records that fact OAOO. >> >> The point of "write from the machiner" was to signal that the machine is >> actually dead, and removed from state, vs. "my credentials are invalid, >> better shut down now". >> > > Yeah, we definitely shouldn't return ErrTerminateAgent from apicaller on > mere invalid-credentials. (No worker should return ErrTerminateAgent *at > all*, really -- not their job. Having a couple of different workers that > can return ErrAgentEntityDead, which can then be interpreted by the next > layer? Fine :).) > > >> So we can write a file to confine uninstall to manual machines -- that >> much is easy, I don't think anyone will disagree with doing that. But we >> should not ignore the bug that prompted this thread, even if it's confined >> to manual machines. >> > > Right; and, yeah, it's almost certainly better to leak manual machines > than it is to uninstall them accidentally -- so long as we know that's the > tradeoff we're making ;). > > >> Andrew, I think you had more detail last time we discussed this: is there > anything else in uninstall (besides loop-device stuff) that needs to run > *anywhere* except a manual machine? and, what will we actually need to > sync > with in the machiner? (or, do you have alternative ideas?) > No, I don't think there is anything else to be done in uninstall, apart from loop detach and manual machine cleanup. I'm not sure about moving the uninstall logic to the machiner, for reasons described above. We could improve the current state of affairs, though, by only writing the uninstall-agent file from the machiner >>> >>> Strong -1 on moving uninstall logic: if it has to happen (which it does, >>> in *rare* cases that are *always* detectable pre-provisioning), uninstall >>> is where it should happen, post-machine-death; and also strong -1 on >>> writing uninstall-agent in *any* circumstances except manual machine >>> provisioning, we have had *way* too many problems with this "clever" >>> feature being invoked when it shouldn't be. >>> >> >> I don't want to belabour the point, but just to be clear: the >> uninstall-agent file exists to record the fact that he machine is in fact >> Dead, and uninstall should go ahead. That logic was put in specifically to >> prevent the referenced bug. We can and should improve it to only do this >> for manual machines. >> > > Meta: a name like "uninstall-agent" is really misleading if it actually > means "machine-X is definitely dead". I never had the slightest indication > that was what it was meant to mean. But in that case, why *don't* we write > the file on certain API connection failures? There is logic that > *explicitly* checks if the machine is Dead -- surely that's a trigger too? > We do, but it looks like we've regressed since the MADE branch landed. api.ErrConnectImpossible now causes uninstall-agent, which means a (possibly temporary) authorization error will once again cause an uninstall: https://github.com/juju/juju/blob/master/worker/apicaller/connect.go#L176 FWIW, the loop stuff can be dropped when the LXC container support is removed. Nobody ever added support for loop in the LXD provider, and I think we should implement support for it
Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.
On Mon, May 9, 2016 at 9:56 AM, Andrew Wilkinswrote: > The reason it's done at the last moment is to avoid having dangling >> database entries. If we uninstall the agent (i.e. delete /var/lib/juju, >> remove systemd/upstart), then if the agent fails before we get to >> EnsureDead, then the entity will never be removed from state. >> > > The *only* thing that should happen after setting dead is the uninstall -- > anything else that's required to happen before cleanup *must* happen before > setting dead, which *means* "all my responsibilities are 100% fulfilled". > > I don't think I suggested above that we should do anything else other than uninstall? Sorry I misunderstood -- I was focused on the loop-device stuff that had grown in the uninstall logic. > The *only* justification for the post-death logic in the manual case is >> because there's no responsible provisioner component to hand over to -- and >> frankly I wish we'd just written that to SSH in and clean up, instead of >> taking on this ongoing hassle. >> > >> > As an alternative, we could (should) only ever write the >>> /var/lib/juju/uninstall-agent file from worker/machiner, first checking >>> there's no assigned units, and no storage attached. >>> >> >> Why would we *ever* want to write it at runtime? We know if it's a manual >> machine at provisioning time, so we can write the File Of Death OAOO. All >> the other mucking about with it is the source of these (serious!) bugs. >> > > The point is not to distinguish between manual vs. non-manual. Yes, we can > write something that records that fact OAOO. > > The point of "write from the machiner" was to signal that the machine is > actually dead, and removed from state, vs. "my credentials are invalid, > better shut down now". > Yeah, we definitely shouldn't return ErrTerminateAgent from apicaller on mere invalid-credentials. (No worker should return ErrTerminateAgent *at all*, really -- not their job. Having a couple of different workers that can return ErrAgentEntityDead, which can then be interpreted by the next layer? Fine :).) > So we can write a file to confine uninstall to manual machines -- that > much is easy, I don't think anyone will disagree with doing that. But we > should not ignore the bug that prompted this thread, even if it's confined > to manual machines. > Right; and, yeah, it's almost certainly better to leak manual machines than it is to uninstall them accidentally -- so long as we know that's the tradeoff we're making ;). > Andrew, I think you had more detail last time we discussed this: is there anything else in uninstall (besides loop-device stuff) that needs to run *anywhere* except a manual machine? and, what will we actually need to sync with in the machiner? (or, do you have alternative ideas?) >>> >>> No, I don't think there is anything else to be done in uninstall, apart >>> from loop detach and manual machine cleanup. I'm not sure about moving the >>> uninstall logic to the machiner, for reasons described above. We could >>> improve the current state of affairs, though, by only writing the >>> uninstall-agent file from the machiner >>> >> >> Strong -1 on moving uninstall logic: if it has to happen (which it does, >> in *rare* cases that are *always* detectable pre-provisioning), uninstall >> is where it should happen, post-machine-death; and also strong -1 on >> writing uninstall-agent in *any* circumstances except manual machine >> provisioning, we have had *way* too many problems with this "clever" >> feature being invoked when it shouldn't be. >> > > I don't want to belabour the point, but just to be clear: the > uninstall-agent file exists to record the fact that he machine is in fact > Dead, and uninstall should go ahead. That logic was put in specifically to > prevent the referenced bug. We can and should improve it to only do this > for manual machines. > Meta: a name like "uninstall-agent" is really misleading if it actually means "machine-X is definitely dead". I never had the slightest indication that was what it was meant to mean. But in that case, why *don't* we write the file on certain API connection failures? There is logic that *explicitly* checks if the machine is Dead -- surely that's a trigger too? FWIW, the loop stuff can be dropped when the LXC container support is >>> removed. Nobody ever added support for loop in the LXD provider, and I >>> think we should implement support for it differently to how it was done for >>> LXC anyway (losetup on host, expose to container; as opposed to expose all >>> loop devices to all LXD containers and losetup in container). >>> >> >> +1000 to that. So... can't we just (1) fix the manual provisioning to >> write the file; (2) drop all other use of uninstall-agent; (3) drop the >> lxc-specific logic in uninstall -- and then we're done? >> > > For first steps, I think so. But I do think we should fix >
Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.
On Mon, May 9, 2016 at 2:28 PM William Readewrote: > On Mon, May 9, 2016 at 3:28 AM, Andrew Wilkins < > andrew.wilk...@canonical.com> wrote: > >> On Sat, May 7, 2016 at 1:37 AM William Reade >> wrote: >> >>> On Fri, May 6, 2016 at 5:50 PM, Eric Snow >>> wrote: >>> See https://bugs.launchpad.net/juju-core/+bug/1514874. >>> >>> >> So I think this issue is fixed in 2.0, but looks like the changes never >> got backported to 1.25. From your options, we do have (the opposite of) a >> DO_NOT_UNINSTALL file (it's actually called >> "/var/lib/juju/uninstall-agent"; only if it exists do we uninstall). >> >> (And now that I think of it, we're only writing uninstall-agent for the >> manual provider's bootstrap machine, and not other manual machines, so >> we're currently leaving Juju bits behind on manual machines added to an >> environment.) >> > > Except we're *also* writing it on every machine, for Very Bad Reasons, > right? So we *are* still cleaning up all machines, but there's a latent > manual provider bug that'll need addressing. > Yes, sorry, it does appear that we're doing it on all machines. Disregard my parenthetical remark. And yes, we should really only write that file for manual machines. But... I've just looked at the 1.25 branch again, and the fix *was* made there. And from Jorge's comment https://bugs.launchpad.net/juju-core/+bug/1514874/comments/4, we can see that the uninstall logic isn't actually running (see `uninstall file "/var/lib/juju/uninstall-agent" does not exist` https://github.com/juju/juju/blob/1.25/cmd/jujud/agent/machine.go#L1741) I'm not sure what to make of that. Eric, have you confirmed that that code is what's causing the issue? Are we sure we're not barking up the wrong tree? The reason it's done at the last moment is to avoid having dangling >> database entries. If we uninstall the agent (i.e. delete /var/lib/juju, >> remove systemd/upstart), then if the agent fails before we get to >> EnsureDead, then the entity will never be removed from state. >> > > The *only* thing that should happen after setting dead is the uninstall -- > anything else that's required to happen before cleanup *must* happen before > setting dead, which *means* "all my responsibilities are 100% fulfilled". > I don't think I suggested above that we should do anything else other than uninstall? The *only* justification for the post-death logic in the manual case is > because there's no responsible provisioner component to hand over to -- and > frankly I wish we'd just written that to SSH in and clean up, instead of > taking on this ongoing hassle. > > As an alternative, we could (should) only ever write the >> /var/lib/juju/uninstall-agent file from worker/machiner, first checking >> there's no assigned units, and no storage attached. >> > > Why would we *ever* want to write it at runtime? We know if it's a manual > machine at provisioning time, so we can write the File Of Death OAOO. All > the other mucking about with it is the source of these (serious!) bugs. > The point is not to distinguish between manual vs. non-manual. Yes, we can write something that records that fact OAOO. The point of "write from the machiner" was to signal that the machine is actually dead, and removed from state, vs. "my credentials are invalid, better shut down now". So we can write a file to confine uninstall to manual machines -- that much is easy, I don't think anyone will disagree with doing that. But we should not ignore the bug that prompted this thread, even if it's confined to manual machines. Andrew, I think you had more detail last time we discussed this: is there >>> anything else in uninstall (besides loop-device stuff) that needs to run >>> *anywhere* except a manual machine? and, what will we actually need to sync >>> with in the machiner? (or, do you have alternative ideas?) >>> >> >> No, I don't think there is anything else to be done in uninstall, apart >> from loop detach and manual machine cleanup. I'm not sure about moving the >> uninstall logic to the machiner, for reasons described above. We could >> improve the current state of affairs, though, by only writing the >> uninstall-agent file from the machiner >> > > Strong -1 on moving uninstall logic: if it has to happen (which it does, > in *rare* cases that are *always* detectable pre-provisioning), uninstall > is where it should happen, post-machine-death; and also strong -1 on > writing uninstall-agent in *any* circumstances except manual machine > provisioning, we have had *way* too many problems with this "clever" > feature being invoked when it shouldn't be. > I don't want to belabour the point, but just to be clear: the uninstall-agent file exists to record the fact that he machine is in fact Dead, and uninstall should go ahead. That logic was put in specifically to prevent the referenced bug. We can and should improve it to only do this for manual
Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.
On Mon, May 9, 2016 at 3:28 AM, Andrew Wilkinswrote: > On Sat, May 7, 2016 at 1:37 AM William Reade > wrote: > >> On Fri, May 6, 2016 at 5:50 PM, Eric Snow >> wrote: >> >>> See https://bugs.launchpad.net/juju-core/+bug/1514874. >> >> > So I think this issue is fixed in 2.0, but looks like the changes never > got backported to 1.25. From your options, we do have (the opposite of) a > DO_NOT_UNINSTALL file (it's actually called > "/var/lib/juju/uninstall-agent"; only if it exists do we uninstall). > > (And now that I think of it, we're only writing uninstall-agent for the > manual provider's bootstrap machine, and not other manual machines, so > we're currently leaving Juju bits behind on manual machines added to an > environment.) > Except we're *also* writing it on every machine, for Very Bad Reasons, right? So we *are* still cleaning up all machines, but there's a latent manual provider bug that'll need addressing. > The reason it's done at the last moment is to avoid having dangling > database entries. If we uninstall the agent (i.e. delete /var/lib/juju, > remove systemd/upstart), then if the agent fails before we get to > EnsureDead, then the entity will never be removed from state. > The *only* thing that should happen after setting dead is the uninstall -- anything else that's required to happen before cleanup *must* happen before setting dead, which *means* "all my responsibilities are 100% fulfilled". The *only* justification for the post-death logic in the manual case is because there's no responsible provisioner component to hand over to -- and frankly I wish we'd just written that to SSH in and clean up, instead of taking on this ongoing hassle. As an alternative, we could (should) only ever write the > /var/lib/juju/uninstall-agent file from worker/machiner, first checking > there's no assigned units, and no storage attached. > Why would we *ever* want to write it at runtime? We know if it's a manual machine at provisioning time, so we can write the File Of Death OAOO. All the other mucking about with it is the source of these (serious!) bugs. Andrew, I think you had more detail last time we discussed this: is there >> anything else in uninstall (besides loop-device stuff) that needs to run >> *anywhere* except a manual machine? and, what will we actually need to sync >> with in the machiner? (or, do you have alternative ideas?) >> > > No, I don't think there is anything else to be done in uninstall, apart > from loop detach and manual machine cleanup. I'm not sure about moving the > uninstall logic to the machiner, for reasons described above. We could > improve the current state of affairs, though, by only writing the > uninstall-agent file from the machiner > Strong -1 on moving uninstall logic: if it has to happen (which it does, in *rare* cases that are *always* detectable pre-provisioning), uninstall is where it should happen, post-machine-death; and also strong -1 on writing uninstall-agent in *any* circumstances except manual machine provisioning, we have had *way* too many problems with this "clever" feature being invoked when it shouldn't be. FWIW, the loop stuff can be dropped when the LXC container support is > removed. Nobody ever added support for loop in the LXD provider, and I > think we should implement support for it differently to how it was done for > LXC anyway (losetup on host, expose to container; as opposed to expose all > loop devices to all LXD containers and losetup in container). > +1000 to that. So... can't we just (1) fix the manual provisioning to write the file; (2) drop all other use of uninstall-agent; (3) drop the lxc-specific logic in uninstall -- and then we're done? Cheers William -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev