[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-07 Thread Mark Bannister
At Thu May 7 15:13:14 UTC 2020, Kumar Kartikeya Dwivedi  wrote:
> Well, maybe you're mixing two issues into one. The original report was
> about session scopes failing due to a conflict of job types, which you
> should not be seeing on 219-67

Sorry if I confused the thread by mentioning systemd-run, in my
ignorance I presumed that systemd-run was doing more or less the same
as opening a new SSH session would have done inside the machinery of
systemd.  So let's forget about that for a moment, other than
recording the fact that I reproduced it once with systemd-run on
219-67.  Prior to my last response this entire thread was about one
single issue occurring when SSH sessions were opening and closing.

The original report is definitely happening on 219-67 which means that
the github issue you found is not the cause (because that patch is not
included in 219-67).  This much is for certain.

I wasn't sure how to follow your exact steps because I wasn't clear
when you wrote session-N.scope whether I was expected to replace N
with a number and if so where that number comes from?  Please confirm.

Thanks
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-07 Thread Mark Bannister
On Wed May 6 17:49:02 UTC 2020, Kumar Kartikeya Dwivedi  wrote:
> This issue is NOT reproducible with the version you reported
> originally. The stock CentOS image on my cloud provider comes with the
> same version (219-67.el7_7.4) that you said you see this issue with,
> but that can't be, as it doesn't include this change:
> https://github.com/systemd-rhel/rhel-7/commit/0600681f04e3818282a2d518ec3e6afee85f7978,
> unless I'm missing something. It took me a while to figure this out...

You're missing something, although I don't know what, because we're
definitely seeing this on 219-67 from
http://vault.centos.org/7.7.1908/os/Source/SPackages/ and it
definitely does not contain the change that you've linked to.  The
only custom patch we've applied is
https://github.com/systemd/systemd/commit/cf99f8eacf1c864b19a6a02edea78c43f3185cb7.
Something else is clearly going on here.

Thanks for the suggestions on how to reproduce.  I managed to
reproduce it once (first time lucky) with:

# service=$(systemd-run --uid=$RUNUSER /tmp/myscript 2>&1 | awk
'{sub("\\.$", ""); print $NF}'); systemctl --no-block stop $service &
systemctl --no-block stop user-$RUNUID.slice; sleep 1; systemctl
list-jobs

... where /tmp/myscript is:

#!/bin/sh
trap '' INT TERM HUP
sleep 1d

However, after the first success, repeated attempts failed to reproduce it.

> All in all, you can't do much about these errors, they're mostly
> harmless (at worst you need to retry the login)

While most of the time this is triggered by SSH session, I have
discovered that we have one script that is using systemd-run and which
failed to launch a command with this 'transaction is destructive'
error.  Could it be that when this problem is triggered, a systemd-run
could fail?

> and upgrading to v228
> (the patchset mentioned in the previous exchange) or higher should fix
> all of this for you.

Alas systemd is so big and so baked into everything these days that we
daren't do more than add a handful of custom patches to the version
shipped with the release.  So if there is a small patch we can add
that fixes it in both 219-67 and 219-73, that would be immensely
helpful.

Best regards
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-06 Thread Mark Bannister
> On Tue, 5 May 2020 at 16:27, Mark Bannister  wrote:
> > Perhaps I'm missing something or oversimplifying, but I tried to
> > reproduce the problem as follows: I created a script that ignores INT,
> > TERM and HUP and then loops indefinitely writing the time every second
> > to a log file.  I SSHed into the box, ran the script, and then from
> > another terminal window killed the parent (bash) and also retried the
> > experiment by killing the bash parent (sshd).  In both cases the
> > parent bash and sshd processes died (as expected), my script kept
> > running but the parent PID of the script was changed to 1 (systemd).
> > I did not receive any error messages in /var/log/messages.
>
> It has to be the last remaining session of the user, and you should
> inspect with systemctl list-jobs if systemd is trying to stop the
> scope and slice units, just to be sure, before attempting to login.
> Otherwise, if you're logged in from two places on the machine for a
> user, you won't hit the issue, obviously.

I repeated the exact same test with a different user account, one
set-up specifically for this test and nothing else, so there was only
one SSH session.  But 'systemctl list-jobs' always reported 'No jobs
running' even after the SSH session was terminated.  My script kept
running with the parent PID of 1.  Thanks for your help so far, but I
have still been unable to reproduce the problem.  What am I doing
wrong?

Thanks
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-05 Thread Mark Bannister
On Mon May 4 12:44:12 UTC 2020, Kumar Kartikeya Dwivedi  wrote:
> For your case, it could be that some process running under
> session-N.scope is not responding to SIGTERM, and this in turn is
> making session-N.scope take time to stop (it has a default timeout of
> 90s), because it has to wait for all processes run under it to die
> before exiting. This also means that your user-UID.slice is waiting
> for session-N.scope before being stopped.

So if I'm understanding correctly, your suggestion is that if an SSH
session runs a process that ignores SIGTERM and the session is closed
and then within 90 seconds the same user attempts to open a new SSH
session, it should trigger this error condition?

Perhaps I'm missing something or oversimplifying, but I tried to
reproduce the problem as follows: I created a script that ignores INT,
TERM and HUP and then loops indefinitely writing the time every second
to a log file.  I SSHed into the box, ran the script, and then from
another terminal window killed the parent (bash) and also retried the
experiment by killing the bash parent (sshd).  In both cases the
parent bash and sshd processes died (as expected), my script kept
running but the parent PID of the script was changed to 1 (systemd).
I did not receive any error messages in /var/log/messages.

Any ideas how I might reproduce this problem at will?

Thanks
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-04 Thread Mark Bannister
On Thu Apr 30 19:01:51 UTC 2020, Kumar Kartikeya Dwivedi  wrote:
> My educated guess is that the session slice is taking time to clean up
> because of the default stop timeout due to processes in the session
> not responding to SIGTERM (it is waiting for dependent units to stop
> before stopping itself by virtue of being ordered after them), and in
> the meantime a new login is initiated. Now, the stop job on the user
> slice is still waiting, and looking at the code in logind,
> manager_start_scope uses the job mode fail, which means the start job
> for the user slice won't be able to cancel the sitting stop job, hence
> the transaction cannot be applied.

When you say the session slice is taking time to clean up, please
excuse my ignorance as
I'm not really up to speed on how session slices are being managed by
systemd, but why
would it matter if a slice takes time to clean up?  Is there a limit
on how many SSH session
slices a user can have or something?  I don't see how that would cause
this particular error.

This is happening a few times a day.  Looking at the logs, we have SSH
sessions opening
and closing successfully all day long for various different accounts.
Occasionally we see
the problematic error message.  I've correlated with other messages
and discovered that
at the same time that we get this in /var/log/messages (the original
reported error):

2020-05-03T16:09:38.735265-04:00 jupiter systemd[1]: Requested
transaction contradicts existing jobs: Transaction for
session-240481.scope/start is destructive (user-17132.slice has 'stop'
job queued, but 'start' is included in transaction).
2020-05-03T16:09:38.735430-04:00 jupiter systemd-logind[1588]: Failed
to start session scope session-240481.scope: Transaction for
session-240481.scope/start is destructive (user-17132.slice has 'stop'
job queued, but 'start' is included in transaction).
2020-05-03T16:09:38.735642-04:00 jupiter systemd[1]: Removed slice
User Slice of jupiter-user.

... we also see this message in /var/log/secure:

020-05-03T16:09:38.737122-04:00 jupiter sshd[11031]:
pam_systemd(sshd:session): Failed to create session: Resource deadlock
avoided
2020-05-03T16:09:38.737380-04:00 jupiter sshd[11031]:
pam_unix(sshd:session): session opened for user jupiter-user by
(uid=0)

Does this 'Resource deadlock avoided' message from pam_systemd help
identify the root
cause, or is that just a side-effect?

Thanks
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-04-30 Thread Mark Bannister
> On Mon, 2020-04-27 at 11:40 +0100, Mark Bannister wrote:

> > One of the error messages I've been trying to explain is reported like this:
> > 2020-04-22T00:45:47.047710-04:00 jupiter systemd[1]: Requested transaction
> > contradicts existing jobs: Transaction for session-752473.scope/start is
> > destructive (user-16027.slice has 'stop' job queued, but 'start' is
> > included in transaction).

> > This also coincides with a message from sshd:
> > 2020-04-22T00:45:47.049949-04:00 jupiter sshd[20984]: 
> > pam_unix(sshd:session): session opened for user userlp by (uid=0)
> > 2020-04-22T00:45:51.181665-04:00 jupiter sshd[20984]: 
> > pam_unix(sshd:session): session closed for user userlp

> > I'm not sure where to begin to troubleshoot the problem.  The systemd
> > error is occurring sporadically and I haven't yet found a way to
> > reproduce it.  I also can't say I actually understand the error
> > message or what I'm supposed to do about it.  How is it even possible
> > for stop and start jobs to clash like this?

On Mon Apr 27 19:59:11 UTC 2020, Uoti Urpala wrote;
> See the systemctl documentation for the --job-mode option. Basically
> this should be due to an external request from somewhere to start the
> unit while systemd is in the middle of processing a previously started
> stop command for it.

The documentation says that the default option for --job-mode is "replace", i.e.
any conflicting pending job will be replaced, as necessary.

How does it make sense for systemd to allow stop and start requests to
contradict anyway?  This is the bit I'm finding very hard to understand.
I don't think your explanation can be right.  Surely, if a start command is
issued while a previous stop is still being processed, that is not an error.
The start request should be followed as soon as the stop is finished.

Also I'm unclear on what stop/start job is involved here, it's not sshd.service
that's going to stop and start when a session is opened for a user.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-04-27 Thread Mark Bannister
I've been trying to find the cause of a 'transaction is destructive' error
from systemd.  I'm testing systemd-219-67.el7_7.4 with the following patch
applied:

https://github.com/systemd/systemd/commit/cf99f8eacf1c864b19a6a02edea78c43f3185cb7

One of the error messages I've been trying to explain is reported like this:

2020-04-22T00:45:47.047710-04:00 jupiter systemd[1]: Requested transaction
contradicts existing jobs: Transaction for session-752473.scope/start is
destructive (user-16027.slice has 'stop' job queued, but 'start' is
included in transaction).
2020-04-22T00:45:47.047985-04:00 jupiter systemd[1]: Removed slice User
Slice of userlp.
2020-04-22T00:45:47.048159-04:00 jupiter systemd-logind[1402]: Failed to
start session scope session-752473.scope: Transaction for
session-752473.scope/start is destructive (user-16027.slice has 'stop' job
queued, but 'start' is included in transaction).


This also coincides with a message from sshd:

2020-04-22T00:45:47.049949-04:00 jupiter sshd[20984]:
pam_unix(sshd:session): session opened for user userlp by (uid=0)
2020-04-22T00:45:51.181665-04:00 jupiter sshd[20984]:
pam_unix(sshd:session): session closed for user userlp


I'm not sure where to begin to troubleshoot the problem.  The systemd error
is occurring sporadically and I haven't yet found a way to reproduce it.  I
also can't say I actually understand the error message or what I'm supposed
to do about it.  How is it even possible for stop and start jobs to clash
like this?

If someone could please provide a little guidance here, your help will be
much appreciated.

Thanks in advance!
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel