Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-14 Thread Reindl Harald



Am 30.04.20 um 13:20 schrieb Mark Bannister:
>> On Mon, 2020-04-27 at 11:40 +0100, Mark Bannister wrote:
> 
>>> One of the error messages I've been trying to explain is reported like this:
>>> 2020-04-22T00:45:47.047710-04:00 jupiter systemd[1]: Requested transaction
>>> contradicts existing jobs: Transaction for session-752473.scope/start is
>>> destructive (user-16027.slice has 'stop' job queued, but 'start' is
>>> included in transaction).

my porblem with all that scopes, sessions and slices is that the logging
is completly useless without any username or servicename and that it's
often splittet in all 3 of them
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-07 Thread Kumar Kartikeya Dwivedi
On Thu, 7 May 2020 at 21:48, Mark Bannister  wrote:
> I wasn't sure how to follow your exact steps because I wasn't clear
> when you wrote session-N.scope whether I was expected to replace N
> with a number and if so where that number comes from?  Please confirm.

The 'N' in session-N.scope is replaced by the active session ID, the N
is what you see in loginctl under the SESSION column for your user.

For example: $ loginctl

SESSION  UID USER SEAT  TTY
 75 1000 kkd  seat0 tty2

1 sessions listed.

Here, my session scope is session-75.scope. However, there may be
multiple sessions, so the best way to find out on CentOS is doing cat
/proc/self/cgroup, you will see something like
"0::/user.slice/user-1000.slice/session-75.scope", the scope unit name
is present there. `loginctl user-status` should list the active
session with an asterisk (*) in the Sessions: field.

You can also use:
$ busctl get-property org.freedesktop.login1
/org/freedesktop/login1/session/self org.freedesktop.login1.Session Id
which returns: s "75".

If you are able to reproduce this on 219-67, please also attach the output of:
systemctl show session-N.scope
where 'N' is your session ID.

--
Kartikeya
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-07 Thread Mark Bannister
At Thu May 7 15:13:14 UTC 2020, Kumar Kartikeya Dwivedi  wrote:
> Well, maybe you're mixing two issues into one. The original report was
> about session scopes failing due to a conflict of job types, which you
> should not be seeing on 219-67

Sorry if I confused the thread by mentioning systemd-run, in my
ignorance I presumed that systemd-run was doing more or less the same
as opening a new SSH session would have done inside the machinery of
systemd.  So let's forget about that for a moment, other than
recording the fact that I reproduced it once with systemd-run on
219-67.  Prior to my last response this entire thread was about one
single issue occurring when SSH sessions were opening and closing.

The original report is definitely happening on 219-67 which means that
the github issue you found is not the cause (because that patch is not
included in 219-67).  This much is for certain.

I wasn't sure how to follow your exact steps because I wasn't clear
when you wrote session-N.scope whether I was expected to replace N
with a number and if so where that number comes from?  Please confirm.

Thanks
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-07 Thread Kumar Kartikeya Dwivedi
On Thu, 7 May 2020 at 19:31, Mark Bannister  wrote:
>
> You're missing something, although I don't know what, because we're
> definitely seeing this on 219-67 from
> http://vault.centos.org/7.7.1908/os/Source/SPackages/ and it
> definitely does not contain the change that you've linked to.  The
> only custom patch we've applied is
> https://github.com/systemd/systemd/commit/cf99f8eacf1c864b19a6a02edea78c43f3185cb7.
> Something else is clearly going on here.
>

Well, maybe you're mixing two issues into one. The original report was
about session scopes failing due to a conflict of job types, which you
should not be seeing on 219-67 (I have no idea how you would without
the session scope having a Requires= dependency on the slice, which
wouldn't be the case on that version), but it is certainly possible to
get this message otherwise in other cases where two jobs conflict and
one of those cannot be deleted due to the relationship constraints. I
cannot tell without more context why, but it would be wise to first
try and reproduce the problem with the latest systemd version to see
if it is already fixed. These errors are less common now as a lot of
effort has been spent on finding and fixing some of these corner cases
over the years, especially after v219.

> Thanks for the suggestions on how to reproduce.  I managed to
> reproduce it once (first time lucky) with:
>
> # service=$(systemd-run --uid=$RUNUSER /tmp/myscript 2>&1 | awk
> '{sub("\\.$", ""); print $NF}'); systemctl --no-block stop $service &
> systemctl --no-block stop user-$RUNUID.slice; sleep 1; systemctl
> list-jobs
>
> ... where /tmp/myscript is:
>
> #!/bin/sh
> trap '' INT TERM HUP
> sleep 1d
>
> However, after the first success, repeated attempts failed to reproduce it.
>

Sigh...
What you posted above is very different from what I asked you to
follow (where the conflicting job was queued by logind). I can't
really tell what you did there. How did you trigger the conflicting
job, and what was the job mode used?

If you tried logging in to see if it triggers the problem, well, this
unit runs under the system instance (it very likely does, as it
appears from your systemd-run invocation), it never creates any
dependency on the user slice at all, and therefore does *not* keep it
waiting, and hence the race is still as narrow, unless you specify
--slice=user-UID.slice.

Moreover, on 219-67, it would never create a conflict (given that
systemctl show lists user-UID.slice under Wants= and not Requires=).

Also, were you able to reproduce with the steps I listed?

> While most of the time this is triggered by SSH session, I have
> discovered that we have one script that is using systemd-run and which
> failed to launch a command with this 'transaction is destructive'
> error.  Could it be that when this problem is triggered, a systemd-run
> could fail?

You should see which service is said to have conflicting job types
(given that you applied the verbose logging patch). However, if you
can't reproduce them with the latest systemd, it would be best to ask
your distribution maintainers to integrate the fixes back into the
systemd version they maintain.

> Alas systemd is so big and so baked into everything these days that we
> daren't do more than add a handful of custom patches to the version
> shipped with the release.  So if there is a small patch we can add
> that fixes it in both 219-67 and 219-73, that would be immensely
> helpful.

I think the PR I mentioned before should help with the logind issue,
but I can't tell if it applies cleanly to 219 (and it really only
helps post 219-69+, as far I am concerned, the logind message atleast
cannot appear on 219-67 (where the scope Wants= the slice); convince
me by showing evidence to the contrary). If you're seeing this with
other cases, those changes won't help you.

--
Kartikeya
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-07 Thread Mark Bannister
On Wed May 6 17:49:02 UTC 2020, Kumar Kartikeya Dwivedi  wrote:
> This issue is NOT reproducible with the version you reported
> originally. The stock CentOS image on my cloud provider comes with the
> same version (219-67.el7_7.4) that you said you see this issue with,
> but that can't be, as it doesn't include this change:
> https://github.com/systemd-rhel/rhel-7/commit/0600681f04e3818282a2d518ec3e6afee85f7978,
> unless I'm missing something. It took me a while to figure this out...

You're missing something, although I don't know what, because we're
definitely seeing this on 219-67 from
http://vault.centos.org/7.7.1908/os/Source/SPackages/ and it
definitely does not contain the change that you've linked to.  The
only custom patch we've applied is
https://github.com/systemd/systemd/commit/cf99f8eacf1c864b19a6a02edea78c43f3185cb7.
Something else is clearly going on here.

Thanks for the suggestions on how to reproduce.  I managed to
reproduce it once (first time lucky) with:

# service=$(systemd-run --uid=$RUNUSER /tmp/myscript 2>&1 | awk
'{sub("\\.$", ""); print $NF}'); systemctl --no-block stop $service &
systemctl --no-block stop user-$RUNUID.slice; sleep 1; systemctl
list-jobs

... where /tmp/myscript is:

#!/bin/sh
trap '' INT TERM HUP
sleep 1d

However, after the first success, repeated attempts failed to reproduce it.

> All in all, you can't do much about these errors, they're mostly
> harmless (at worst you need to retry the login)

While most of the time this is triggered by SSH session, I have
discovered that we have one script that is using systemd-run and which
failed to launch a command with this 'transaction is destructive'
error.  Could it be that when this problem is triggered, a systemd-run
could fail?

> and upgrading to v228
> (the patchset mentioned in the previous exchange) or higher should fix
> all of this for you.

Alas systemd is so big and so baked into everything these days that we
daren't do more than add a handful of custom patches to the version
shipped with the release.  So if there is a small patch we can add
that fixes it in both 219-67 and 219-73, that would be immensely
helpful.

Best regards
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-06 Thread Kumar Kartikeya Dwivedi
On Wed, 6 May 2020 at 15:58, Mark Bannister  wrote:
>
> I repeated the exact same test with a different user account, one
> set-up specifically for this test and nothing else, so there was only
> one SSH session.  But 'systemctl list-jobs' always reported 'No jobs
> running' even after the SSH session was terminated.  My script kept
> running with the parent PID of 1.  Thanks for your help so far, but I
> have still been unable to reproduce the problem.  What am I doing
> wrong?
>

Alright, so I dug into this, and managed to reproduce it at last.

This issue is NOT reproducible with the version you reported
originally. The stock CentOS image on my cloud provider comes with the
same version (219-67.el7_7.4) that you said you see this issue with,
but that can't be, as it doesn't include this change:
https://github.com/systemd-rhel/rhel-7/commit/0600681f04e3818282a2d518ec3e6afee85f7978,
unless I'm missing something. It took me a while to figure this out...

This was added in 219-69 (see the tag). I guess you built your copy
with some custom patches, which pulled it in. This is the reason
essentially why the transaction becomes destructive. Prior to this
change, the session scope added a Wants= dependency, instead of
Requires=, on user-UID.slice, so the conflicting start job could
simply be *dropped* instead of failing the transaction on a login
attempt, but the change upgrades Slice= deps to Requires=, so they
start failing when stop is waiting and start is included in a newly
generated transaction. You can verify this with systemctl show
session-N.scope, the 219-67 says Wants=user-UID.slice, and 219-69 and
upwards say Requires=user-UID.slice.

One key thing I missed was that systemd on CentOS wouldn't clean up
the session for the user since KillUserProcesses=no by default, and
also the fact that the user slice's stop job is not added until all
session scopes finish cleaning up. So the scope is simply abandoned,
and should exist until all processes die themselves.

So, observing this issue is not easy, it would only trigger if you try
to login right when systemd tries to stop the user slice and the job
exists in queue (which is a very short window), and it took a lot of
tries before I could reproduce it (for a start I manually queued the
stop job for the slice with a hung session scope and tried logging in
to verify the transaction problem).

So this is racy in nature, the sequence of events can be that your
processes die, session scopes are gone, and now the manager will try
to shut down the user slice, and if you happen to be amidst a login at
this moment, pam_systemd's call for CreateSession would conflict with
the already waiting stop job. It doesn't matter if KillUserProcesses=
is set or not, that just decides if the scopes are cleaned up on
logout (they would become inactive when all session processes are gone
anyway). It does decide when the user-UID.slice would be stopped,
however.

The reason you didn't see any jobs was that the scope was abandoned,
and your long-lived process pinned it. So the session scope should
never stop, as a consequence neither the slice.

Here are a series of steps to widen the race (by queuing a stop job
for user slice)  and reproduce it:
1. Create a script badproc.sh; chmod +x badproc.sh
#!/bin/bash
trap "" SIGHUP
trap "" SIGINT
trap "" SIGTERM
exec "$@"

2. From user's shell, do: ./badproc.sh sleep infinity &; disown; ^D (logout)
The scope is now abandoned, or being stopped (if KillUserProcesses=yes)
3. From root's shell, do:
systemctl --no-block stop session-N.scope; systemctl --no-block stop
user-UID.slice;
systemctl list-jobs should list both jobs, the scope job as running,
and the slice job as waiting.

The sequence that actually happens is processes dying, (last) session
scope stopping, user slice stopping, in order. The point at which it
fails is the user slice stopping, which is a very very short time
window, so it makes sense if you only see these sporadically.

4. Try a login, which gives me:
May 06 17:03:12 systemd[1]: Requested transaction contradicts existing
jobs: Transaction is destructive.
May 06 17:03:12 sshd[3134]: pam_systemd(sshd:session): Failed to
create session: Resource deadlock avoided
May 06 17:03:12 systemd-logind[3078]: Failed to start session scope
session-37.scope: Transaction is destructive.

All in all, you can't do much about these errors, they're mostly
harmless (at worst you need to retry the login), and upgrading to v228
(the patchset mentioned in the previous exchange) or higher should fix
all of this for you.

--
Kartikeya
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-06 Thread Mark Bannister
> On Tue, 5 May 2020 at 16:27, Mark Bannister  wrote:
> > Perhaps I'm missing something or oversimplifying, but I tried to
> > reproduce the problem as follows: I created a script that ignores INT,
> > TERM and HUP and then loops indefinitely writing the time every second
> > to a log file.  I SSHed into the box, ran the script, and then from
> > another terminal window killed the parent (bash) and also retried the
> > experiment by killing the bash parent (sshd).  In both cases the
> > parent bash and sshd processes died (as expected), my script kept
> > running but the parent PID of the script was changed to 1 (systemd).
> > I did not receive any error messages in /var/log/messages.
>
> It has to be the last remaining session of the user, and you should
> inspect with systemctl list-jobs if systemd is trying to stop the
> scope and slice units, just to be sure, before attempting to login.
> Otherwise, if you're logged in from two places on the machine for a
> user, you won't hit the issue, obviously.

I repeated the exact same test with a different user account, one
set-up specifically for this test and nothing else, so there was only
one SSH session.  But 'systemctl list-jobs' always reported 'No jobs
running' even after the SSH session was terminated.  My script kept
running with the parent PID of 1.  Thanks for your help so far, but I
have still been unable to reproduce the problem.  What am I doing
wrong?

Thanks
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-05 Thread Kumar Kartikeya Dwivedi
On Tue, 5 May 2020 at 16:27, Mark Bannister  wrote:
>
> So if I'm understanding correctly, your suggestion is that if an SSH
> session runs a process that ignores SIGTERM and the session is closed
> and then within 90 seconds the same user attempts to open a new SSH
> session, it should trigger this error condition?

Given the session you just logged out of was the last session, yes,
otherwise logind won't GC your user. The only requirement is for
something to delay the session scope's stopping, which should keep the
user slice's stop job waiting, which would allow you to reproduce this
at will.

>
> Perhaps I'm missing something or oversimplifying, but I tried to
> reproduce the problem as follows: I created a script that ignores INT,
> TERM and HUP and then loops indefinitely writing the time every second
> to a log file.  I SSHed into the box, ran the script, and then from
> another terminal window killed the parent (bash) and also retried the
> experiment by killing the bash parent (sshd).  In both cases the
> parent bash and sshd processes died (as expected), my script kept
> running but the parent PID of the script was changed to 1 (systemd).
> I did not receive any error messages in /var/log/messages.
>

It has to be the last remaining session of the user, and you should
inspect with systemctl list-jobs if systemd is trying to stop the
scope and slice units, just to be sure, before attempting to login.
Otherwise, if you're logged in from two places on the machine for a
user, you won't hit the issue, obviously.

--
Kartikeya
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-05 Thread Mark Bannister
On Mon May 4 12:44:12 UTC 2020, Kumar Kartikeya Dwivedi  wrote:
> For your case, it could be that some process running under
> session-N.scope is not responding to SIGTERM, and this in turn is
> making session-N.scope take time to stop (it has a default timeout of
> 90s), because it has to wait for all processes run under it to die
> before exiting. This also means that your user-UID.slice is waiting
> for session-N.scope before being stopped.

So if I'm understanding correctly, your suggestion is that if an SSH
session runs a process that ignores SIGTERM and the session is closed
and then within 90 seconds the same user attempts to open a new SSH
session, it should trigger this error condition?

Perhaps I'm missing something or oversimplifying, but I tried to
reproduce the problem as follows: I created a script that ignores INT,
TERM and HUP and then loops indefinitely writing the time every second
to a log file.  I SSHed into the box, ran the script, and then from
another terminal window killed the parent (bash) and also retried the
experiment by killing the bash parent (sshd).  In both cases the
parent bash and sshd processes died (as expected), my script kept
running but the parent PID of the script was changed to 1 (systemd).
I did not receive any error messages in /var/log/messages.

Any ideas how I might reproduce this problem at will?

Thanks
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-04 Thread Kumar Kartikeya Dwivedi
On Mon, 4 May 2020 at 16:02, Mark Bannister  wrote:
>
> ...
>
> When you say the session slice is taking time to clean up, please
> excuse my ignorance as
> I'm not really up to speed on how session slices are being managed by
> systemd, but why
> would it matter if a slice takes time to clean up?  Is there a limit
> on how many SSH session
> slices a user can have or something?  I don't see how that would cause
> this particular error.
>

I think session slice was poorly worded, I only realize that now. I
should have said user slice.

This is the cgroup hierarchy after a few logins:

user-UID.slice
 `- user@UID.service
 `- session-1.scope
 `-  ...
 `- session-N.scope

I see that CentOS 7 doesn't really support systemd --user, so
user@UID.service isn't present on your system, but I'll include it in
the complete picture anyway.

You can have multiple session scopes, one per login. Unless linger is
enabled, when the last session goes away, systemd-logind will GC the
user object it keeps, by stopping the session-N.scope(s),
user@UID.service and user-UID.slice, and then freeing the internal
bookkeeping structures, which means the session scopes, user-UID.slice
and user@UID.service will have stop jobs enqueued, and since session
scopes and user@UID.service come after their parent slice, they will
be stopped before and user-UID.slice will wait for them to stop before
its stop job is run by systemd.

For your case, it could be that some process running under
session-N.scope is not responding to SIGTERM, and this in turn is
making session-N.scope take time to stop (it has a default timeout of
90s), because it has to wait for all processes run under it to die
before exiting. This also means that your user-UID.slice is waiting
for session-N.scope before being stopped.

Now, if you try to login at this point, v219 has no checks to prevent
this problem:
UNIT  JOBSTATUS
session-N.scope : stop : running
user-UID.slice  : stop : waiting

and logind tries to start session-N+1.scope, which needs user-UID.slice/start.

The job mode used is "fail", which means if a conflicting job which
start cannot merge into is queued (like stop), fail the transaction.
This is why you see "user-17132.slice has 'stop' job queued, but
'start' is included in transaction". The transaction being mentioned
in this error is the one generated from session-N+1.scope/start.

The fixes landed in v228 (see
https://github.com/systemd/systemd/pull/1918), as the job mode used by
manager_start_unit is changed to use "replace" job mode, which ends up
resolving the problem as by the time manager_start_scope is called
(which still uses "fail"), user-UID.slice already has start job
waiting which cancelled the sitting stop job from previous call, so
login may delay but not fail. This problem you're seeing only happens
if you login again in under 90s or less after logging out of the last
user session (considering the default timeout).

So, upgrading systemd should solve your problem. If that is not an
option, I am not sure there is much you can do. You'd need to modify
the TimeoutStopSec= value for scopes, which isn't configurable as the
scope is created at runtime by logind, short of changing the default
for PID 1, which affects every other service on the system.

Enabling linger is another solution. It should fix the problem by not
tearing down the user slice on last logout, but also means the user
slice is created on boot. Though since you don't have systemd --user,
it should not take up any resources (unless I'm missing something),
otherwise it wouldn't be desirable at all, ofcourse. Note that you'd
have to enable it for every user, or allow users to do so through
policykit, if you want this to work for everyone. Creating a file for
each username under /var/lib/systemd/linger (like
/var/lib/systemd/linger/kkd), or allowing users to call loginctl
enable-linger themselves should suffice.

> ...
>
> ... we also see this message in /var/log/secure:
>
> 020-05-03T16:09:38.737122-04:00 jupiter sshd[11031]:
> pam_systemd(sshd:session): Failed to create session: Resource deadlock
> avoided

Yes, this is the string for the errno returned from the bus call,
EDEADLOCK, as the transaction is destructive in nature.

> Does this 'Resource deadlock avoided' message from pam_systemd help
> identify the root
> cause, or is that just a side-effect?

A side-effect, see above.

--
Kartikeya
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-05-04 Thread Mark Bannister
On Thu Apr 30 19:01:51 UTC 2020, Kumar Kartikeya Dwivedi  wrote:
> My educated guess is that the session slice is taking time to clean up
> because of the default stop timeout due to processes in the session
> not responding to SIGTERM (it is waiting for dependent units to stop
> before stopping itself by virtue of being ordered after them), and in
> the meantime a new login is initiated. Now, the stop job on the user
> slice is still waiting, and looking at the code in logind,
> manager_start_scope uses the job mode fail, which means the start job
> for the user slice won't be able to cancel the sitting stop job, hence
> the transaction cannot be applied.

When you say the session slice is taking time to clean up, please
excuse my ignorance as
I'm not really up to speed on how session slices are being managed by
systemd, but why
would it matter if a slice takes time to clean up?  Is there a limit
on how many SSH session
slices a user can have or something?  I don't see how that would cause
this particular error.

This is happening a few times a day.  Looking at the logs, we have SSH
sessions opening
and closing successfully all day long for various different accounts.
Occasionally we see
the problematic error message.  I've correlated with other messages
and discovered that
at the same time that we get this in /var/log/messages (the original
reported error):

2020-05-03T16:09:38.735265-04:00 jupiter systemd[1]: Requested
transaction contradicts existing jobs: Transaction for
session-240481.scope/start is destructive (user-17132.slice has 'stop'
job queued, but 'start' is included in transaction).
2020-05-03T16:09:38.735430-04:00 jupiter systemd-logind[1588]: Failed
to start session scope session-240481.scope: Transaction for
session-240481.scope/start is destructive (user-17132.slice has 'stop'
job queued, but 'start' is included in transaction).
2020-05-03T16:09:38.735642-04:00 jupiter systemd[1]: Removed slice
User Slice of jupiter-user.

... we also see this message in /var/log/secure:

020-05-03T16:09:38.737122-04:00 jupiter sshd[11031]:
pam_systemd(sshd:session): Failed to create session: Resource deadlock
avoided
2020-05-03T16:09:38.737380-04:00 jupiter sshd[11031]:
pam_unix(sshd:session): session opened for user jupiter-user by
(uid=0)

Does this 'Resource deadlock avoided' message from pam_systemd help
identify the root
cause, or is that just a side-effect?

Thanks
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-04-30 Thread Kumar Kartikeya Dwivedi
On Thu, 30 Apr 2020 at 23:43, Uoti Urpala  wrote:
>
> On Thu, 2020-04-30 at 22:18 +0530, Kumar Kartikeya Dwivedi wrote:
> > waiting for the stop request initiated previously to finish). Even if
> > you use fail as the job mode, the error you get back is "Transaction
> > is destructive" or so, not this.
>
> IIRC the patch he mentioned in his mail was one that changed the error
> message from "Transaction is destructive" to this.
>

I see, it seems I overlooked that part. Thanks for correcting me.

Then this is indeed the stage where it tries to apply the transaction,
rather than conflicting jobs in the same transaction, so what I said
above probably does not apply at all.

I used a test case locally to try and reproduce the problem, and it
only manifests when the JOB_FAIL job mode is used, otherwise the stop
job on the unit is cancelled and replaced with the start job.

Something like this:

# anchor.service
[Unit]
Requires=that.slice
After=that.slice
[Service]
ExecStart=/bin/true

# other.service
[Service]
ExecStart=/bin/sleep infinity
/* delay the stop a bit */
ExecStop=/bin/sleep 10

Now, for
$ systemctl start other.service --no-block; systemctl --user stop
that.slice --no-block;

$ systemctl start anchor.service --job-mode fail
fails, while
$ systemctl start anchor.service --job-mode replace

Either that or the stop job installed is irreversible (can emulate
that by using --job-mode replace-irreversibly when stopping
that.slice). So these are the only possibilities. The code hasn't
changed at all since v219 too.

My educated guess is that the session slice is taking time to clean up
because of the default stop timeout due to processes in the session
not responding to SIGTERM (it is waiting for dependent units to stop
before stopping itself by virtue of being ordered after them), and in
the meantime a new login is initiated. Now, the stop job on the user
slice is still waiting, and looking at the code in logind,
manager_start_scope uses the job mode fail, which means the start job
for the user slice won't be able to cancel the sitting stop job, hence
the transaction cannot be applied.

--
Kartikeya
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-04-30 Thread Uoti Urpala
On Thu, 2020-04-30 at 22:18 +0530, Kumar Kartikeya Dwivedi wrote:
> waiting for the stop request initiated previously to finish). Even if
> you use fail as the job mode, the error you get back is "Transaction
> is destructive" or so, not this.

IIRC the patch he mentioned in his mail was one that changed the error
message from "Transaction is destructive" to this.


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-04-30 Thread Kumar Kartikeya Dwivedi
On Thu, 30 Apr 2020 at 16:56, Mark Bannister  wrote:
>
> The documentation says that the default option for --job-mode is "replace", 
> i.e.
> any conflicting pending job will be replaced, as necessary.
>
> How does it make sense for systemd to allow stop and start requests to
> contradict anyway?  This is the bit I'm finding very hard to understand.
> I don't think your explanation can be right.  Surely, if a start command is
> issued while a previous stop is still being processed, that is not an error.
> The start request should be followed as soon as the stop is finished.

Yes, this error is not raised by external jobs (when you use the
replace job mode, the job sitting in the unit's job slot is always
unseated if possible, and the other job either becomes 'runnable'
instantly, or in some cases waits for the other job to finish before
becoming runnable, emulating a queue (e.g. service unit's start job
waiting for the stop request initiated previously to finish). Even if
you use fail as the job mode, the error you get back is "Transaction
is destructive" or so, not this.

The problem you're seeing is caused when the same transaction pulls
two conflicting/unmergeable jobs, for example a unit being started
adds a start job and some unit in the dependency chain has Conflicts
on it, which would pull in a stop job for the same unit. These are
contradictory state change requests, and there's no easy way to
resolve this automatically (unlike merging where jobs that make the
unit reach a compatible state are merged into the same job).

This is has been reported before:
* The first one explains in detail when and why this happens:
https://github.com/systemd/systemd/issues/11440
* This one is more of a real life example:
https://github.com/systemd/systemd/issues/2830

The easiest way to debug problems like these is to enable debug
logging for the service manager, examine the transaction in the logs,
and inspect your units in the dependency chain to pin down the culprit.

--
Kartikeya
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-04-30 Thread Mark Bannister
> On Mon, 2020-04-27 at 11:40 +0100, Mark Bannister wrote:

> > One of the error messages I've been trying to explain is reported like this:
> > 2020-04-22T00:45:47.047710-04:00 jupiter systemd[1]: Requested transaction
> > contradicts existing jobs: Transaction for session-752473.scope/start is
> > destructive (user-16027.slice has 'stop' job queued, but 'start' is
> > included in transaction).

> > This also coincides with a message from sshd:
> > 2020-04-22T00:45:47.049949-04:00 jupiter sshd[20984]: 
> > pam_unix(sshd:session): session opened for user userlp by (uid=0)
> > 2020-04-22T00:45:51.181665-04:00 jupiter sshd[20984]: 
> > pam_unix(sshd:session): session closed for user userlp

> > I'm not sure where to begin to troubleshoot the problem.  The systemd
> > error is occurring sporadically and I haven't yet found a way to
> > reproduce it.  I also can't say I actually understand the error
> > message or what I'm supposed to do about it.  How is it even possible
> > for stop and start jobs to clash like this?

On Mon Apr 27 19:59:11 UTC 2020, Uoti Urpala wrote;
> See the systemctl documentation for the --job-mode option. Basically
> this should be due to an external request from somewhere to start the
> unit while systemd is in the middle of processing a previously started
> stop command for it.

The documentation says that the default option for --job-mode is "replace", i.e.
any conflicting pending job will be replaced, as necessary.

How does it make sense for systemd to allow stop and start requests to
contradict anyway?  This is the bit I'm finding very hard to understand.
I don't think your explanation can be right.  Surely, if a start command is
issued while a previous stop is still being processed, that is not an error.
The start request should be followed as soon as the stop is finished.

Also I'm unclear on what stop/start job is involved here, it's not sshd.service
that's going to stop and start when a session is opened for a user.
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-04-27 Thread Uoti Urpala
On Mon, 2020-04-27 at 11:40 +0100, Mark Bannister wrote:
> I'm not sure where to begin to troubleshoot the problem.  The systemd
> error is occurring sporadically and I haven't yet found a way to
> reproduce it.  I also can't say I actually understand the error
> message or what I'm supposed to do about it.  How is it even possible
> for stop and start jobs to clash like this?

See the systemctl documentation for the --job-mode option. Basically
this should be due to an external request from somewhere to start the
unit while systemd is in the middle of processing a previously started
stop command for it.

The log messages suggest that this might be related to trying to start
a new ssh session while the per-user slice is still being shut down
after closing all previous login sessions; I'm not familiar enough with
this area to tell how things are supposed to work in this situation.


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Requested transaction contradicts existing jobs: start is destructive

2020-04-27 Thread Mark Bannister
I've been trying to find the cause of a 'transaction is destructive' error
from systemd.  I'm testing systemd-219-67.el7_7.4 with the following patch
applied:

https://github.com/systemd/systemd/commit/cf99f8eacf1c864b19a6a02edea78c43f3185cb7

One of the error messages I've been trying to explain is reported like this:

2020-04-22T00:45:47.047710-04:00 jupiter systemd[1]: Requested transaction
contradicts existing jobs: Transaction for session-752473.scope/start is
destructive (user-16027.slice has 'stop' job queued, but 'start' is
included in transaction).
2020-04-22T00:45:47.047985-04:00 jupiter systemd[1]: Removed slice User
Slice of userlp.
2020-04-22T00:45:47.048159-04:00 jupiter systemd-logind[1402]: Failed to
start session scope session-752473.scope: Transaction for
session-752473.scope/start is destructive (user-16027.slice has 'stop' job
queued, but 'start' is included in transaction).


This also coincides with a message from sshd:

2020-04-22T00:45:47.049949-04:00 jupiter sshd[20984]:
pam_unix(sshd:session): session opened for user userlp by (uid=0)
2020-04-22T00:45:51.181665-04:00 jupiter sshd[20984]:
pam_unix(sshd:session): session closed for user userlp


I'm not sure where to begin to troubleshoot the problem.  The systemd error
is occurring sporadically and I haven't yet found a way to reproduce it.  I
also can't say I actually understand the error message or what I'm supposed
to do about it.  How is it even possible for stop and start jobs to clash
like this?

If someone could please provide a little guidance here, your help will be
much appreciated.

Thanks in advance!
Mark
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel