Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.

2016-05-09 Thread Eric Snow
On Mon, May 9, 2016 at 10:50 AM, Eric Snow  wrote:
> I'll check it with 1.25 too, though I expect that the result will be the same.

Under 1.25, using the local provider, I was able to reproduce the
behavior.  On the  and even got the same confusing log entry:

  2016-05-09 16:48:51 DEBUG juju.cmd.jujud machine.go:1741 uninstall
file "/var/lib/juju/uninstall-agent" does not exist

So it's not readily apparent what is deleting the files and removing
the init services.  However, it *is* happening.

-eric

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.

2016-05-09 Thread Eric Snow
On Mon, May 9, 2016 at 1:56 AM, Andrew Wilkins
 wrote:
> But... I've just looked at the 1.25 branch again, and the fix *was* made
> there. And from Jorge's comment
> https://bugs.launchpad.net/juju-core/+bug/1514874/comments/4, we can see
> that the uninstall logic isn't actually running (see `uninstall file
> "/var/lib/juju/uninstall-agent" does not exist`
> https://github.com/juju/juju/blob/1.25/cmd/jujud/agent/machine.go#L1741)
>
> I'm not sure what to make of that. Eric, have you confirmed that that code
> is what's causing the issue? Are we sure we're not barking up the wrong
> tree?

Those logs certainly don't line up with the reported bug. :/  It's
possible that they aren't the right logs.  Regardless, my initial
investigation was based on inspecting code that "uninstalls" the
agent.  The only such code I found was machineAgent.uninstallAgent()
(and in the manual provder's Destroy()).

To be sure about it, I verified the issue using the LXD provider under
2.0, following the steps Jorge specified in the bug:

1) Modify the apipassword on the /var/lib/juju/agents/machine-8/agent.conf
2) Restart jujud-machine-8

The agent is uninstalled and the the uninstall file is there:

  2016-05-09 16:21:48 INFO juju.agent uninstall.go:47 agent already
marked ready for uninstall

However, the machine is still there:

  $ juju status
  ...
  0  started 10.235.227.113
juju-af8e08f2-3953-4bcb-84de-767a15f46b4f-machine-0 xenial

I'll check it with 1.25 too, though I expect that the result will be the same.

-eric

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.

2016-05-09 Thread William Reade
On Mon, May 9, 2016 at 11:03 AM, Andrew Wilkins <
andrew.wilk...@canonical.com> wrote:

Meta: a name like "uninstall-agent" is really misleading if it actually
>> means "machine-X is definitely dead". I never had the slightest indication
>> that was what it was meant to mean. But in that case, why *don't* we write
>> the file on certain API connection failures? There is logic that
>> *explicitly* checks if the machine is Dead -- surely that's a trigger too?
>>
>
> We do, but it looks like we've regressed since the MADE branch landed.
> api.ErrConnectImpossible now causes uninstall-agent, which means a
> (possibly temporary) authorization error will once again cause an uninstall:
>
> https://github.com/juju/juju/blob/master/worker/apicaller/connect.go#L176
>

Yeah, we discussed that at the time and evidently something went wrong -- I
was still misconceiving uninstall-agent as a manual-specific,
provision-time-only switch, which had been incorrectly overloaded with
is-it-dead-no-matter-the-provider meaning.


>  (1) record (in agent config?) that a machine is manual
>>>  (2) only ever do anything uninstall-related for manual machines
>>>  (3) only ever do uninstall-related things if the machine actually is
>>> Dead
>>>  (4) drop lxc-specific logic from uninstall *when LXC support is removed*
>>>
>>
>> Generally SGTM, but confused re (4) -- doesn't that have to be
>> fixed/moved/removed first earlier, so that we can switch off the suicide
>> behaviour in other cases without regressing?
>>
>
> We can remove the LXC/loop bits now if we're fine with leaking loop
> devices, which is probably OK assuming we are actually removing LXC before
> 2.0. Loop devices has to be explicitly enabled anyway.
>

Sounds like a plan. Agent config seems like a decent place to store
manual-ness; and having a specific ErrAgentEntityDead STM to be the best
signal for triggering a check and potential uninstall.

(in fact: for great safety, it can and probably should be
cmd/jujud/agent.errEntityDead, which gets set in a manifold- or
engine-config Filter field -- so it only ever gets injected in response to
specific errors from specific workers. It's unquestionably the agent's
decision to make; and it's the workers' responsibility to return precise
errors, rooted in their own contexts, that don't prejudge how the client
might want to respond.)

Cheers
William
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.

2016-05-09 Thread Andrew Wilkins
On Mon, May 9, 2016 at 4:38 PM William Reade 
wrote:

> On Mon, May 9, 2016 at 9:56 AM, Andrew Wilkins <
> andrew.wilk...@canonical.com> wrote:
>
>> The reason it's done at the last moment is to avoid having dangling
>>> database entries. If we uninstall the agent (i.e. delete /var/lib/juju,
>>> remove systemd/upstart), then if the agent fails before we get to
>>> EnsureDead, then the entity will never be removed from state.
>>>
>>
>> The *only* thing that should happen after setting dead is the uninstall
>> -- anything else that's required to happen before cleanup *must* happen
>> before setting dead, which *means* "all my responsibilities are 100%
>> fulfilled".
>>
>
> > I don't think I suggested above that we should do anything else other
> than uninstall?
>
> Sorry I misunderstood -- I was focused on the loop-device stuff that had
> grown in the uninstall logic.
>
>
>> The *only* justification for the post-death logic in the manual case is
>>> because there's no responsible provisioner component to hand over to -- and
>>> frankly I wish we'd just written that to SSH in and clean up, instead of
>>> taking on this ongoing hassle.
>>>
>>
>>>
>> As an alternative, we could (should) only ever write the
 /var/lib/juju/uninstall-agent file from worker/machiner, first checking
 there's no assigned units, and no storage attached.

>>>
>>> Why would we *ever* want to write it at runtime? We know if it's a
>>> manual machine at provisioning time, so we can write the File Of Death
>>> OAOO. All the other mucking about with it is the source of these (serious!)
>>> bugs.
>>>
>>
>> The point is not to distinguish between manual vs. non-manual. Yes, we
>> can write something that records that fact OAOO.
>>
>> The point of "write from the machiner" was to signal that the machine is
>> actually dead, and removed from state, vs. "my credentials are invalid,
>> better shut down now".
>>
>
> Yeah, we definitely shouldn't return ErrTerminateAgent from apicaller on
> mere invalid-credentials. (No worker should return ErrTerminateAgent *at
> all*, really -- not their job. Having a couple of different workers that
> can return ErrAgentEntityDead, which can then be interpreted by the next
> layer? Fine :).)
>
>
>> So we can write a file to confine uninstall to manual machines -- that
>> much is easy, I don't think anyone will disagree with doing that. But we
>> should not ignore the bug that prompted this thread, even if it's confined
>> to manual machines.
>>
>
> Right; and, yeah, it's almost certainly better to leak manual machines
> than it is to uninstall them accidentally -- so long as we know that's the
> tradeoff we're making ;).
>
>
>> Andrew, I think you had more detail last time we discussed this: is there
> anything else in uninstall (besides loop-device stuff) that needs to run
> *anywhere* except a manual machine? and, what will we actually need to 
> sync
> with in the machiner? (or, do you have alternative ideas?)
>

 No, I don't think there is anything else to be done in uninstall, apart
 from loop detach and manual machine cleanup. I'm not sure about moving the
 uninstall logic to the machiner, for reasons described above. We could
 improve the current state of affairs, though, by only writing the
 uninstall-agent file from the machiner

>>>
>>> Strong -1 on moving uninstall logic: if it has to happen (which it does,
>>> in *rare* cases that are *always* detectable pre-provisioning), uninstall
>>> is where it should happen, post-machine-death; and also strong -1 on
>>> writing uninstall-agent in *any* circumstances except manual machine
>>> provisioning, we have had *way* too many problems with this "clever"
>>> feature being invoked when it shouldn't be.
>>>
>>
>> I don't want to belabour the point, but just to be clear: the
>> uninstall-agent file exists to record the fact that he machine is in fact
>> Dead, and uninstall should go ahead. That logic was put in specifically to
>> prevent the referenced bug. We can and should improve it to only do this
>> for manual machines.
>>
>
> Meta: a name like "uninstall-agent" is really misleading if it actually
> means "machine-X is definitely dead". I never had the slightest indication
> that was what it was meant to mean. But in that case, why *don't* we write
> the file on certain API connection failures? There is logic that
> *explicitly* checks if the machine is Dead -- surely that's a trigger too?
>

We do, but it looks like we've regressed since the MADE branch landed.
api.ErrConnectImpossible now causes uninstall-agent, which means a
(possibly temporary) authorization error will once again cause an uninstall:

https://github.com/juju/juju/blob/master/worker/apicaller/connect.go#L176

FWIW, the loop stuff can be dropped when the LXC container support is
 removed. Nobody ever added support for loop in the LXD provider, and I
 think we should implement support for it 

Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.

2016-05-09 Thread William Reade
On Mon, May 9, 2016 at 9:56 AM, Andrew Wilkins  wrote:

> The reason it's done at the last moment is to avoid having dangling
>> database entries. If we uninstall the agent (i.e. delete /var/lib/juju,
>> remove systemd/upstart), then if the agent fails before we get to
>> EnsureDead, then the entity will never be removed from state.
>>
>
> The *only* thing that should happen after setting dead is the uninstall --
> anything else that's required to happen before cleanup *must* happen before
> setting dead, which *means* "all my responsibilities are 100% fulfilled".
>

> I don't think I suggested above that we should do anything else other
than uninstall?

Sorry I misunderstood -- I was focused on the loop-device stuff that had
grown in the uninstall logic.


> The *only* justification for the post-death logic in the manual case is
>> because there's no responsible provisioner component to hand over to -- and
>> frankly I wish we'd just written that to SSH in and clean up, instead of
>> taking on this ongoing hassle.
>>
>
>>
> As an alternative, we could (should) only ever write the
>>> /var/lib/juju/uninstall-agent file from worker/machiner, first checking
>>> there's no assigned units, and no storage attached.
>>>
>>
>> Why would we *ever* want to write it at runtime? We know if it's a manual
>> machine at provisioning time, so we can write the File Of Death OAOO. All
>> the other mucking about with it is the source of these (serious!) bugs.
>>
>
> The point is not to distinguish between manual vs. non-manual. Yes, we can
> write something that records that fact OAOO.
>
> The point of "write from the machiner" was to signal that the machine is
> actually dead, and removed from state, vs. "my credentials are invalid,
> better shut down now".
>

Yeah, we definitely shouldn't return ErrTerminateAgent from apicaller on
mere invalid-credentials. (No worker should return ErrTerminateAgent *at
all*, really -- not their job. Having a couple of different workers that
can return ErrAgentEntityDead, which can then be interpreted by the next
layer? Fine :).)


> So we can write a file to confine uninstall to manual machines -- that
> much is easy, I don't think anyone will disagree with doing that. But we
> should not ignore the bug that prompted this thread, even if it's confined
> to manual machines.
>

Right; and, yeah, it's almost certainly better to leak manual machines than
it is to uninstall them accidentally -- so long as we know that's the
tradeoff we're making ;).


> Andrew, I think you had more detail last time we discussed this: is there
 anything else in uninstall (besides loop-device stuff) that needs to run
 *anywhere* except a manual machine? and, what will we actually need to sync
 with in the machiner? (or, do you have alternative ideas?)

>>>
>>> No, I don't think there is anything else to be done in uninstall, apart
>>> from loop detach and manual machine cleanup. I'm not sure about moving the
>>> uninstall logic to the machiner, for reasons described above. We could
>>> improve the current state of affairs, though, by only writing the
>>> uninstall-agent file from the machiner
>>>
>>
>> Strong -1 on moving uninstall logic: if it has to happen (which it does,
>> in *rare* cases that are *always* detectable pre-provisioning), uninstall
>> is where it should happen, post-machine-death; and also strong -1 on
>> writing uninstall-agent in *any* circumstances except manual machine
>> provisioning, we have had *way* too many problems with this "clever"
>> feature being invoked when it shouldn't be.
>>
>
> I don't want to belabour the point, but just to be clear: the
> uninstall-agent file exists to record the fact that he machine is in fact
> Dead, and uninstall should go ahead. That logic was put in specifically to
> prevent the referenced bug. We can and should improve it to only do this
> for manual machines.
>

Meta: a name like "uninstall-agent" is really misleading if it actually
means "machine-X is definitely dead". I never had the slightest indication
that was what it was meant to mean. But in that case, why *don't* we write
the file on certain API connection failures? There is logic that
*explicitly* checks if the machine is Dead -- surely that's a trigger too?

FWIW, the loop stuff can be dropped when the LXC container support is
>>> removed. Nobody ever added support for loop in the LXD provider, and I
>>> think we should implement support for it differently to how it was done for
>>> LXC anyway (losetup on host, expose to container; as opposed to expose all
>>> loop devices to all LXD containers and losetup in container).
>>>
>>
>> +1000 to that. So... can't we just (1) fix the manual provisioning to
>> write the file; (2) drop all other use of uninstall-agent; (3) drop the
>> lxc-specific logic in uninstall -- and then we're done?
>>
>
> For first steps, I think so. But I do think we should fix
> 

Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.

2016-05-09 Thread Andrew Wilkins
On Mon, May 9, 2016 at 2:28 PM William Reade 
wrote:

> On Mon, May 9, 2016 at 3:28 AM, Andrew Wilkins <
> andrew.wilk...@canonical.com> wrote:
>
>> On Sat, May 7, 2016 at 1:37 AM William Reade 
>> wrote:
>>
>>> On Fri, May 6, 2016 at 5:50 PM, Eric Snow 
>>> wrote:
>>>
 See https://bugs.launchpad.net/juju-core/+bug/1514874.
>>>
>>>
>> So I think this issue is fixed in 2.0, but looks like the changes never
>> got backported to 1.25. From your options, we do have (the opposite of) a
>> DO_NOT_UNINSTALL file (it's actually called
>> "/var/lib/juju/uninstall-agent"; only if it exists do we uninstall).
>>
>> (And now that I think of it, we're only writing uninstall-agent for the
>> manual provider's bootstrap machine, and not other manual machines, so
>> we're currently leaving Juju bits behind on manual machines added to an
>> environment.)
>>
>
> Except we're *also* writing it on every machine, for Very Bad Reasons,
> right? So we *are* still cleaning up all machines, but there's a latent
> manual provider bug that'll need addressing.
>

Yes, sorry, it does appear that we're doing it on all machines. Disregard
my parenthetical remark. And yes, we should really only write that file for
manual machines.

But... I've just looked at the 1.25 branch again, and the fix *was* made
there. And from Jorge's comment
https://bugs.launchpad.net/juju-core/+bug/1514874/comments/4, we can see
that the uninstall logic isn't actually running (see `uninstall file
"/var/lib/juju/uninstall-agent" does not exist`
https://github.com/juju/juju/blob/1.25/cmd/jujud/agent/machine.go#L1741)

I'm not sure what to make of that. Eric, have you confirmed that that code
is what's causing the issue? Are we sure we're not barking up the wrong
tree?

The reason it's done at the last moment is to avoid having dangling
>> database entries. If we uninstall the agent (i.e. delete /var/lib/juju,
>> remove systemd/upstart), then if the agent fails before we get to
>> EnsureDead, then the entity will never be removed from state.
>>
>
> The *only* thing that should happen after setting dead is the uninstall --
> anything else that's required to happen before cleanup *must* happen before
> setting dead, which *means* "all my responsibilities are 100% fulfilled".
>

I don't think I suggested above that we should do anything else other than
uninstall?

The *only* justification for the post-death logic in the manual case is
> because there's no responsible provisioner component to hand over to -- and
> frankly I wish we'd just written that to SSH in and clean up, instead of
> taking on this ongoing hassle.
>

>
As an alternative, we could (should) only ever write the
>> /var/lib/juju/uninstall-agent file from worker/machiner, first checking
>> there's no assigned units, and no storage attached.
>>
>
> Why would we *ever* want to write it at runtime? We know if it's a manual
> machine at provisioning time, so we can write the File Of Death OAOO. All
> the other mucking about with it is the source of these (serious!) bugs.
>

The point is not to distinguish between manual vs. non-manual. Yes, we can
write something that records that fact OAOO.

The point of "write from the machiner" was to signal that the machine is
actually dead, and removed from state, vs. "my credentials are invalid,
better shut down now".

So we can write a file to confine uninstall to manual machines -- that much
is easy, I don't think anyone will disagree with doing that. But we should
not ignore the bug that prompted this thread, even if it's confined to
manual machines.

Andrew, I think you had more detail last time we discussed this: is there
>>> anything else in uninstall (besides loop-device stuff) that needs to run
>>> *anywhere* except a manual machine? and, what will we actually need to sync
>>> with in the machiner? (or, do you have alternative ideas?)
>>>
>>
>> No, I don't think there is anything else to be done in uninstall, apart
>> from loop detach and manual machine cleanup. I'm not sure about moving the
>> uninstall logic to the machiner, for reasons described above. We could
>> improve the current state of affairs, though, by only writing the
>> uninstall-agent file from the machiner
>>
>
> Strong -1 on moving uninstall logic: if it has to happen (which it does,
> in *rare* cases that are *always* detectable pre-provisioning), uninstall
> is where it should happen, post-machine-death; and also strong -1 on
> writing uninstall-agent in *any* circumstances except manual machine
> provisioning, we have had *way* too many problems with this "clever"
> feature being invoked when it shouldn't be.
>

I don't want to belabour the point, but just to be clear: the
uninstall-agent file exists to record the fact that he machine is in fact
Dead, and uninstall should go ahead. That logic was put in specifically to
prevent the referenced bug. We can and should improve it to only do this
for manual 

Re: Machine agents uninstall themselves upon worker.ErrTerminateAgent.

2016-05-09 Thread William Reade
On Mon, May 9, 2016 at 3:28 AM, Andrew Wilkins  wrote:

> On Sat, May 7, 2016 at 1:37 AM William Reade 
> wrote:
>
>> On Fri, May 6, 2016 at 5:50 PM, Eric Snow 
>> wrote:
>>
>>> See https://bugs.launchpad.net/juju-core/+bug/1514874.
>>
>>
> So I think this issue is fixed in 2.0, but looks like the changes never
> got backported to 1.25. From your options, we do have (the opposite of) a
> DO_NOT_UNINSTALL file (it's actually called
> "/var/lib/juju/uninstall-agent"; only if it exists do we uninstall).
>
> (And now that I think of it, we're only writing uninstall-agent for the
> manual provider's bootstrap machine, and not other manual machines, so
> we're currently leaving Juju bits behind on manual machines added to an
> environment.)
>

Except we're *also* writing it on every machine, for Very Bad Reasons,
right? So we *are* still cleaning up all machines, but there's a latent
manual provider bug that'll need addressing.


> The reason it's done at the last moment is to avoid having dangling
> database entries. If we uninstall the agent (i.e. delete /var/lib/juju,
> remove systemd/upstart), then if the agent fails before we get to
> EnsureDead, then the entity will never be removed from state.
>

The *only* thing that should happen after setting dead is the uninstall --
anything else that's required to happen before cleanup *must* happen before
setting dead, which *means* "all my responsibilities are 100% fulfilled".
The *only* justification for the post-death logic in the manual case is
because there's no responsible provisioner component to hand over to -- and
frankly I wish we'd just written that to SSH in and clean up, instead of
taking on this ongoing hassle.

As an alternative, we could (should) only ever write the
> /var/lib/juju/uninstall-agent file from worker/machiner, first checking
> there's no assigned units, and no storage attached.
>

Why would we *ever* want to write it at runtime? We know if it's a manual
machine at provisioning time, so we can write the File Of Death OAOO. All
the other mucking about with it is the source of these (serious!) bugs.

Andrew, I think you had more detail last time we discussed this: is there
>> anything else in uninstall (besides loop-device stuff) that needs to run
>> *anywhere* except a manual machine? and, what will we actually need to sync
>> with in the machiner? (or, do you have alternative ideas?)
>>
>
> No, I don't think there is anything else to be done in uninstall, apart
> from loop detach and manual machine cleanup. I'm not sure about moving the
> uninstall logic to the machiner, for reasons described above. We could
> improve the current state of affairs, though, by only writing the
> uninstall-agent file from the machiner
>

Strong -1 on moving uninstall logic: if it has to happen (which it does, in
*rare* cases that are *always* detectable pre-provisioning), uninstall is
where it should happen, post-machine-death; and also strong -1 on writing
uninstall-agent in *any* circumstances except manual machine provisioning,
we have had *way* too many problems with this "clever" feature being
invoked when it shouldn't be.

FWIW, the loop stuff can be dropped when the LXC container support is
> removed. Nobody ever added support for loop in the LXD provider, and I
> think we should implement support for it differently to how it was done for
> LXC anyway (losetup on host, expose to container; as opposed to expose all
> loop devices to all LXD containers and losetup in container).
>

+1000 to that. So... can't we just (1) fix the manual provisioning to write
the file; (2) drop all other use of uninstall-agent; (3) drop the
lxc-specific logic in uninstall -- and then we're done?

Cheers
William
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev