[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
At Thu May 7 15:13:14 UTC 2020, Kumar Kartikeya Dwivedi wrote: > Well, maybe you're mixing two issues into one. The original report was > about session scopes failing due to a conflict of job types, which you > should not be seeing on 219-67 Sorry if I confused the thread by mentioning systemd-run, in my ignorance I presumed that systemd-run was doing more or less the same as opening a new SSH session would have done inside the machinery of systemd. So let's forget about that for a moment, other than recording the fact that I reproduced it once with systemd-run on 219-67. Prior to my last response this entire thread was about one single issue occurring when SSH sessions were opening and closing. The original report is definitely happening on 219-67 which means that the github issue you found is not the cause (because that patch is not included in 219-67). This much is for certain. I wasn't sure how to follow your exact steps because I wasn't clear when you wrote session-N.scope whether I was expected to replace N with a number and if so where that number comes from? Please confirm. Thanks Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Wed May 6 17:49:02 UTC 2020, Kumar Kartikeya Dwivedi wrote: > This issue is NOT reproducible with the version you reported > originally. The stock CentOS image on my cloud provider comes with the > same version (219-67.el7_7.4) that you said you see this issue with, > but that can't be, as it doesn't include this change: > https://github.com/systemd-rhel/rhel-7/commit/0600681f04e3818282a2d518ec3e6afee85f7978, > unless I'm missing something. It took me a while to figure this out... You're missing something, although I don't know what, because we're definitely seeing this on 219-67 from http://vault.centos.org/7.7.1908/os/Source/SPackages/ and it definitely does not contain the change that you've linked to. The only custom patch we've applied is https://github.com/systemd/systemd/commit/cf99f8eacf1c864b19a6a02edea78c43f3185cb7. Something else is clearly going on here. Thanks for the suggestions on how to reproduce. I managed to reproduce it once (first time lucky) with: # service=$(systemd-run --uid=$RUNUSER /tmp/myscript 2>&1 | awk '{sub("\\.$", ""); print $NF}'); systemctl --no-block stop $service & systemctl --no-block stop user-$RUNUID.slice; sleep 1; systemctl list-jobs ... where /tmp/myscript is: #!/bin/sh trap '' INT TERM HUP sleep 1d However, after the first success, repeated attempts failed to reproduce it. > All in all, you can't do much about these errors, they're mostly > harmless (at worst you need to retry the login) While most of the time this is triggered by SSH session, I have discovered that we have one script that is using systemd-run and which failed to launch a command with this 'transaction is destructive' error. Could it be that when this problem is triggered, a systemd-run could fail? > and upgrading to v228 > (the patchset mentioned in the previous exchange) or higher should fix > all of this for you. Alas systemd is so big and so baked into everything these days that we daren't do more than add a handful of custom patches to the version shipped with the release. So if there is a small patch we can add that fixes it in both 219-67 and 219-73, that would be immensely helpful. Best regards Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
> On Tue, 5 May 2020 at 16:27, Mark Bannister wrote: > > Perhaps I'm missing something or oversimplifying, but I tried to > > reproduce the problem as follows: I created a script that ignores INT, > > TERM and HUP and then loops indefinitely writing the time every second > > to a log file. I SSHed into the box, ran the script, and then from > > another terminal window killed the parent (bash) and also retried the > > experiment by killing the bash parent (sshd). In both cases the > > parent bash and sshd processes died (as expected), my script kept > > running but the parent PID of the script was changed to 1 (systemd). > > I did not receive any error messages in /var/log/messages. > > It has to be the last remaining session of the user, and you should > inspect with systemctl list-jobs if systemd is trying to stop the > scope and slice units, just to be sure, before attempting to login. > Otherwise, if you're logged in from two places on the machine for a > user, you won't hit the issue, obviously. I repeated the exact same test with a different user account, one set-up specifically for this test and nothing else, so there was only one SSH session. But 'systemctl list-jobs' always reported 'No jobs running' even after the SSH session was terminated. My script kept running with the parent PID of 1. Thanks for your help so far, but I have still been unable to reproduce the problem. What am I doing wrong? Thanks Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Mon May 4 12:44:12 UTC 2020, Kumar Kartikeya Dwivedi wrote: > For your case, it could be that some process running under > session-N.scope is not responding to SIGTERM, and this in turn is > making session-N.scope take time to stop (it has a default timeout of > 90s), because it has to wait for all processes run under it to die > before exiting. This also means that your user-UID.slice is waiting > for session-N.scope before being stopped. So if I'm understanding correctly, your suggestion is that if an SSH session runs a process that ignores SIGTERM and the session is closed and then within 90 seconds the same user attempts to open a new SSH session, it should trigger this error condition? Perhaps I'm missing something or oversimplifying, but I tried to reproduce the problem as follows: I created a script that ignores INT, TERM and HUP and then loops indefinitely writing the time every second to a log file. I SSHed into the box, ran the script, and then from another terminal window killed the parent (bash) and also retried the experiment by killing the bash parent (sshd). In both cases the parent bash and sshd processes died (as expected), my script kept running but the parent PID of the script was changed to 1 (systemd). I did not receive any error messages in /var/log/messages. Any ideas how I might reproduce this problem at will? Thanks Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Thu Apr 30 19:01:51 UTC 2020, Kumar Kartikeya Dwivedi wrote: > My educated guess is that the session slice is taking time to clean up > because of the default stop timeout due to processes in the session > not responding to SIGTERM (it is waiting for dependent units to stop > before stopping itself by virtue of being ordered after them), and in > the meantime a new login is initiated. Now, the stop job on the user > slice is still waiting, and looking at the code in logind, > manager_start_scope uses the job mode fail, which means the start job > for the user slice won't be able to cancel the sitting stop job, hence > the transaction cannot be applied. When you say the session slice is taking time to clean up, please excuse my ignorance as I'm not really up to speed on how session slices are being managed by systemd, but why would it matter if a slice takes time to clean up? Is there a limit on how many SSH session slices a user can have or something? I don't see how that would cause this particular error. This is happening a few times a day. Looking at the logs, we have SSH sessions opening and closing successfully all day long for various different accounts. Occasionally we see the problematic error message. I've correlated with other messages and discovered that at the same time that we get this in /var/log/messages (the original reported error): 2020-05-03T16:09:38.735265-04:00 jupiter systemd[1]: Requested transaction contradicts existing jobs: Transaction for session-240481.scope/start is destructive (user-17132.slice has 'stop' job queued, but 'start' is included in transaction). 2020-05-03T16:09:38.735430-04:00 jupiter systemd-logind[1588]: Failed to start session scope session-240481.scope: Transaction for session-240481.scope/start is destructive (user-17132.slice has 'stop' job queued, but 'start' is included in transaction). 2020-05-03T16:09:38.735642-04:00 jupiter systemd[1]: Removed slice User Slice of jupiter-user. ... we also see this message in /var/log/secure: 020-05-03T16:09:38.737122-04:00 jupiter sshd[11031]: pam_systemd(sshd:session): Failed to create session: Resource deadlock avoided 2020-05-03T16:09:38.737380-04:00 jupiter sshd[11031]: pam_unix(sshd:session): session opened for user jupiter-user by (uid=0) Does this 'Resource deadlock avoided' message from pam_systemd help identify the root cause, or is that just a side-effect? Thanks Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
> On Mon, 2020-04-27 at 11:40 +0100, Mark Bannister wrote: > > One of the error messages I've been trying to explain is reported like this: > > 2020-04-22T00:45:47.047710-04:00 jupiter systemd[1]: Requested transaction > > contradicts existing jobs: Transaction for session-752473.scope/start is > > destructive (user-16027.slice has 'stop' job queued, but 'start' is > > included in transaction). > > This also coincides with a message from sshd: > > 2020-04-22T00:45:47.049949-04:00 jupiter sshd[20984]: > > pam_unix(sshd:session): session opened for user userlp by (uid=0) > > 2020-04-22T00:45:51.181665-04:00 jupiter sshd[20984]: > > pam_unix(sshd:session): session closed for user userlp > > I'm not sure where to begin to troubleshoot the problem. The systemd > > error is occurring sporadically and I haven't yet found a way to > > reproduce it. I also can't say I actually understand the error > > message or what I'm supposed to do about it. How is it even possible > > for stop and start jobs to clash like this? On Mon Apr 27 19:59:11 UTC 2020, Uoti Urpala wrote; > See the systemctl documentation for the --job-mode option. Basically > this should be due to an external request from somewhere to start the > unit while systemd is in the middle of processing a previously started > stop command for it. The documentation says that the default option for --job-mode is "replace", i.e. any conflicting pending job will be replaced, as necessary. How does it make sense for systemd to allow stop and start requests to contradict anyway? This is the bit I'm finding very hard to understand. I don't think your explanation can be right. Surely, if a start command is issued while a previous stop is still being processed, that is not an error. The start request should be followed as soon as the stop is finished. Also I'm unclear on what stop/start job is involved here, it's not sshd.service that's going to stop and start when a session is opened for a user. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
I've been trying to find the cause of a 'transaction is destructive' error from systemd. I'm testing systemd-219-67.el7_7.4 with the following patch applied: https://github.com/systemd/systemd/commit/cf99f8eacf1c864b19a6a02edea78c43f3185cb7 One of the error messages I've been trying to explain is reported like this: 2020-04-22T00:45:47.047710-04:00 jupiter systemd[1]: Requested transaction contradicts existing jobs: Transaction for session-752473.scope/start is destructive (user-16027.slice has 'stop' job queued, but 'start' is included in transaction). 2020-04-22T00:45:47.047985-04:00 jupiter systemd[1]: Removed slice User Slice of userlp. 2020-04-22T00:45:47.048159-04:00 jupiter systemd-logind[1402]: Failed to start session scope session-752473.scope: Transaction for session-752473.scope/start is destructive (user-16027.slice has 'stop' job queued, but 'start' is included in transaction). This also coincides with a message from sshd: 2020-04-22T00:45:47.049949-04:00 jupiter sshd[20984]: pam_unix(sshd:session): session opened for user userlp by (uid=0) 2020-04-22T00:45:51.181665-04:00 jupiter sshd[20984]: pam_unix(sshd:session): session closed for user userlp I'm not sure where to begin to troubleshoot the problem. The systemd error is occurring sporadically and I haven't yet found a way to reproduce it. I also can't say I actually understand the error message or what I'm supposed to do about it. How is it even possible for stop and start jobs to clash like this? If someone could please provide a little guidance here, your help will be much appreciated. Thanks in advance! Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel