Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive
Am 30.04.20 um 13:20 schrieb Mark Bannister: >> On Mon, 2020-04-27 at 11:40 +0100, Mark Bannister wrote: > >>> One of the error messages I've been trying to explain is reported like this: >>> 2020-04-22T00:45:47.047710-04:00 jupiter systemd[1]: Requested transaction >>> contradicts existing jobs: Transaction for session-752473.scope/start is >>> destructive (user-16027.slice has 'stop' job queued, but 'start' is >>> included in transaction). my porblem with all that scopes, sessions and slices is that the logging is completly useless without any username or servicename and that it's often splittet in all 3 of them ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Thu, 7 May 2020 at 21:48, Mark Bannister wrote: > I wasn't sure how to follow your exact steps because I wasn't clear > when you wrote session-N.scope whether I was expected to replace N > with a number and if so where that number comes from? Please confirm. The 'N' in session-N.scope is replaced by the active session ID, the N is what you see in loginctl under the SESSION column for your user. For example: $ loginctl SESSION UID USER SEAT TTY 75 1000 kkd seat0 tty2 1 sessions listed. Here, my session scope is session-75.scope. However, there may be multiple sessions, so the best way to find out on CentOS is doing cat /proc/self/cgroup, you will see something like "0::/user.slice/user-1000.slice/session-75.scope", the scope unit name is present there. `loginctl user-status` should list the active session with an asterisk (*) in the Sessions: field. You can also use: $ busctl get-property org.freedesktop.login1 /org/freedesktop/login1/session/self org.freedesktop.login1.Session Id which returns: s "75". If you are able to reproduce this on 219-67, please also attach the output of: systemctl show session-N.scope where 'N' is your session ID. -- Kartikeya ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
At Thu May 7 15:13:14 UTC 2020, Kumar Kartikeya Dwivedi wrote: > Well, maybe you're mixing two issues into one. The original report was > about session scopes failing due to a conflict of job types, which you > should not be seeing on 219-67 Sorry if I confused the thread by mentioning systemd-run, in my ignorance I presumed that systemd-run was doing more or less the same as opening a new SSH session would have done inside the machinery of systemd. So let's forget about that for a moment, other than recording the fact that I reproduced it once with systemd-run on 219-67. Prior to my last response this entire thread was about one single issue occurring when SSH sessions were opening and closing. The original report is definitely happening on 219-67 which means that the github issue you found is not the cause (because that patch is not included in 219-67). This much is for certain. I wasn't sure how to follow your exact steps because I wasn't clear when you wrote session-N.scope whether I was expected to replace N with a number and if so where that number comes from? Please confirm. Thanks Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Thu, 7 May 2020 at 19:31, Mark Bannister wrote: > > You're missing something, although I don't know what, because we're > definitely seeing this on 219-67 from > http://vault.centos.org/7.7.1908/os/Source/SPackages/ and it > definitely does not contain the change that you've linked to. The > only custom patch we've applied is > https://github.com/systemd/systemd/commit/cf99f8eacf1c864b19a6a02edea78c43f3185cb7. > Something else is clearly going on here. > Well, maybe you're mixing two issues into one. The original report was about session scopes failing due to a conflict of job types, which you should not be seeing on 219-67 (I have no idea how you would without the session scope having a Requires= dependency on the slice, which wouldn't be the case on that version), but it is certainly possible to get this message otherwise in other cases where two jobs conflict and one of those cannot be deleted due to the relationship constraints. I cannot tell without more context why, but it would be wise to first try and reproduce the problem with the latest systemd version to see if it is already fixed. These errors are less common now as a lot of effort has been spent on finding and fixing some of these corner cases over the years, especially after v219. > Thanks for the suggestions on how to reproduce. I managed to > reproduce it once (first time lucky) with: > > # service=$(systemd-run --uid=$RUNUSER /tmp/myscript 2>&1 | awk > '{sub("\\.$", ""); print $NF}'); systemctl --no-block stop $service & > systemctl --no-block stop user-$RUNUID.slice; sleep 1; systemctl > list-jobs > > ... where /tmp/myscript is: > > #!/bin/sh > trap '' INT TERM HUP > sleep 1d > > However, after the first success, repeated attempts failed to reproduce it. > Sigh... What you posted above is very different from what I asked you to follow (where the conflicting job was queued by logind). I can't really tell what you did there. How did you trigger the conflicting job, and what was the job mode used? If you tried logging in to see if it triggers the problem, well, this unit runs under the system instance (it very likely does, as it appears from your systemd-run invocation), it never creates any dependency on the user slice at all, and therefore does *not* keep it waiting, and hence the race is still as narrow, unless you specify --slice=user-UID.slice. Moreover, on 219-67, it would never create a conflict (given that systemctl show lists user-UID.slice under Wants= and not Requires=). Also, were you able to reproduce with the steps I listed? > While most of the time this is triggered by SSH session, I have > discovered that we have one script that is using systemd-run and which > failed to launch a command with this 'transaction is destructive' > error. Could it be that when this problem is triggered, a systemd-run > could fail? You should see which service is said to have conflicting job types (given that you applied the verbose logging patch). However, if you can't reproduce them with the latest systemd, it would be best to ask your distribution maintainers to integrate the fixes back into the systemd version they maintain. > Alas systemd is so big and so baked into everything these days that we > daren't do more than add a handful of custom patches to the version > shipped with the release. So if there is a small patch we can add > that fixes it in both 219-67 and 219-73, that would be immensely > helpful. I think the PR I mentioned before should help with the logind issue, but I can't tell if it applies cleanly to 219 (and it really only helps post 219-69+, as far I am concerned, the logind message atleast cannot appear on 219-67 (where the scope Wants= the slice); convince me by showing evidence to the contrary). If you're seeing this with other cases, those changes won't help you. -- Kartikeya ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Wed May 6 17:49:02 UTC 2020, Kumar Kartikeya Dwivedi wrote: > This issue is NOT reproducible with the version you reported > originally. The stock CentOS image on my cloud provider comes with the > same version (219-67.el7_7.4) that you said you see this issue with, > but that can't be, as it doesn't include this change: > https://github.com/systemd-rhel/rhel-7/commit/0600681f04e3818282a2d518ec3e6afee85f7978, > unless I'm missing something. It took me a while to figure this out... You're missing something, although I don't know what, because we're definitely seeing this on 219-67 from http://vault.centos.org/7.7.1908/os/Source/SPackages/ and it definitely does not contain the change that you've linked to. The only custom patch we've applied is https://github.com/systemd/systemd/commit/cf99f8eacf1c864b19a6a02edea78c43f3185cb7. Something else is clearly going on here. Thanks for the suggestions on how to reproduce. I managed to reproduce it once (first time lucky) with: # service=$(systemd-run --uid=$RUNUSER /tmp/myscript 2>&1 | awk '{sub("\\.$", ""); print $NF}'); systemctl --no-block stop $service & systemctl --no-block stop user-$RUNUID.slice; sleep 1; systemctl list-jobs ... where /tmp/myscript is: #!/bin/sh trap '' INT TERM HUP sleep 1d However, after the first success, repeated attempts failed to reproduce it. > All in all, you can't do much about these errors, they're mostly > harmless (at worst you need to retry the login) While most of the time this is triggered by SSH session, I have discovered that we have one script that is using systemd-run and which failed to launch a command with this 'transaction is destructive' error. Could it be that when this problem is triggered, a systemd-run could fail? > and upgrading to v228 > (the patchset mentioned in the previous exchange) or higher should fix > all of this for you. Alas systemd is so big and so baked into everything these days that we daren't do more than add a handful of custom patches to the version shipped with the release. So if there is a small patch we can add that fixes it in both 219-67 and 219-73, that would be immensely helpful. Best regards Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Wed, 6 May 2020 at 15:58, Mark Bannister wrote: > > I repeated the exact same test with a different user account, one > set-up specifically for this test and nothing else, so there was only > one SSH session. But 'systemctl list-jobs' always reported 'No jobs > running' even after the SSH session was terminated. My script kept > running with the parent PID of 1. Thanks for your help so far, but I > have still been unable to reproduce the problem. What am I doing > wrong? > Alright, so I dug into this, and managed to reproduce it at last. This issue is NOT reproducible with the version you reported originally. The stock CentOS image on my cloud provider comes with the same version (219-67.el7_7.4) that you said you see this issue with, but that can't be, as it doesn't include this change: https://github.com/systemd-rhel/rhel-7/commit/0600681f04e3818282a2d518ec3e6afee85f7978, unless I'm missing something. It took me a while to figure this out... This was added in 219-69 (see the tag). I guess you built your copy with some custom patches, which pulled it in. This is the reason essentially why the transaction becomes destructive. Prior to this change, the session scope added a Wants= dependency, instead of Requires=, on user-UID.slice, so the conflicting start job could simply be *dropped* instead of failing the transaction on a login attempt, but the change upgrades Slice= deps to Requires=, so they start failing when stop is waiting and start is included in a newly generated transaction. You can verify this with systemctl show session-N.scope, the 219-67 says Wants=user-UID.slice, and 219-69 and upwards say Requires=user-UID.slice. One key thing I missed was that systemd on CentOS wouldn't clean up the session for the user since KillUserProcesses=no by default, and also the fact that the user slice's stop job is not added until all session scopes finish cleaning up. So the scope is simply abandoned, and should exist until all processes die themselves. So, observing this issue is not easy, it would only trigger if you try to login right when systemd tries to stop the user slice and the job exists in queue (which is a very short window), and it took a lot of tries before I could reproduce it (for a start I manually queued the stop job for the slice with a hung session scope and tried logging in to verify the transaction problem). So this is racy in nature, the sequence of events can be that your processes die, session scopes are gone, and now the manager will try to shut down the user slice, and if you happen to be amidst a login at this moment, pam_systemd's call for CreateSession would conflict with the already waiting stop job. It doesn't matter if KillUserProcesses= is set or not, that just decides if the scopes are cleaned up on logout (they would become inactive when all session processes are gone anyway). It does decide when the user-UID.slice would be stopped, however. The reason you didn't see any jobs was that the scope was abandoned, and your long-lived process pinned it. So the session scope should never stop, as a consequence neither the slice. Here are a series of steps to widen the race (by queuing a stop job for user slice) and reproduce it: 1. Create a script badproc.sh; chmod +x badproc.sh #!/bin/bash trap "" SIGHUP trap "" SIGINT trap "" SIGTERM exec "$@" 2. From user's shell, do: ./badproc.sh sleep infinity &; disown; ^D (logout) The scope is now abandoned, or being stopped (if KillUserProcesses=yes) 3. From root's shell, do: systemctl --no-block stop session-N.scope; systemctl --no-block stop user-UID.slice; systemctl list-jobs should list both jobs, the scope job as running, and the slice job as waiting. The sequence that actually happens is processes dying, (last) session scope stopping, user slice stopping, in order. The point at which it fails is the user slice stopping, which is a very very short time window, so it makes sense if you only see these sporadically. 4. Try a login, which gives me: May 06 17:03:12 systemd[1]: Requested transaction contradicts existing jobs: Transaction is destructive. May 06 17:03:12 sshd[3134]: pam_systemd(sshd:session): Failed to create session: Resource deadlock avoided May 06 17:03:12 systemd-logind[3078]: Failed to start session scope session-37.scope: Transaction is destructive. All in all, you can't do much about these errors, they're mostly harmless (at worst you need to retry the login), and upgrading to v228 (the patchset mentioned in the previous exchange) or higher should fix all of this for you. -- Kartikeya ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
> On Tue, 5 May 2020 at 16:27, Mark Bannister wrote: > > Perhaps I'm missing something or oversimplifying, but I tried to > > reproduce the problem as follows: I created a script that ignores INT, > > TERM and HUP and then loops indefinitely writing the time every second > > to a log file. I SSHed into the box, ran the script, and then from > > another terminal window killed the parent (bash) and also retried the > > experiment by killing the bash parent (sshd). In both cases the > > parent bash and sshd processes died (as expected), my script kept > > running but the parent PID of the script was changed to 1 (systemd). > > I did not receive any error messages in /var/log/messages. > > It has to be the last remaining session of the user, and you should > inspect with systemctl list-jobs if systemd is trying to stop the > scope and slice units, just to be sure, before attempting to login. > Otherwise, if you're logged in from two places on the machine for a > user, you won't hit the issue, obviously. I repeated the exact same test with a different user account, one set-up specifically for this test and nothing else, so there was only one SSH session. But 'systemctl list-jobs' always reported 'No jobs running' even after the SSH session was terminated. My script kept running with the parent PID of 1. Thanks for your help so far, but I have still been unable to reproduce the problem. What am I doing wrong? Thanks Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Tue, 5 May 2020 at 16:27, Mark Bannister wrote: > > So if I'm understanding correctly, your suggestion is that if an SSH > session runs a process that ignores SIGTERM and the session is closed > and then within 90 seconds the same user attempts to open a new SSH > session, it should trigger this error condition? Given the session you just logged out of was the last session, yes, otherwise logind won't GC your user. The only requirement is for something to delay the session scope's stopping, which should keep the user slice's stop job waiting, which would allow you to reproduce this at will. > > Perhaps I'm missing something or oversimplifying, but I tried to > reproduce the problem as follows: I created a script that ignores INT, > TERM and HUP and then loops indefinitely writing the time every second > to a log file. I SSHed into the box, ran the script, and then from > another terminal window killed the parent (bash) and also retried the > experiment by killing the bash parent (sshd). In both cases the > parent bash and sshd processes died (as expected), my script kept > running but the parent PID of the script was changed to 1 (systemd). > I did not receive any error messages in /var/log/messages. > It has to be the last remaining session of the user, and you should inspect with systemctl list-jobs if systemd is trying to stop the scope and slice units, just to be sure, before attempting to login. Otherwise, if you're logged in from two places on the machine for a user, you won't hit the issue, obviously. -- Kartikeya ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Mon May 4 12:44:12 UTC 2020, Kumar Kartikeya Dwivedi wrote: > For your case, it could be that some process running under > session-N.scope is not responding to SIGTERM, and this in turn is > making session-N.scope take time to stop (it has a default timeout of > 90s), because it has to wait for all processes run under it to die > before exiting. This also means that your user-UID.slice is waiting > for session-N.scope before being stopped. So if I'm understanding correctly, your suggestion is that if an SSH session runs a process that ignores SIGTERM and the session is closed and then within 90 seconds the same user attempts to open a new SSH session, it should trigger this error condition? Perhaps I'm missing something or oversimplifying, but I tried to reproduce the problem as follows: I created a script that ignores INT, TERM and HUP and then loops indefinitely writing the time every second to a log file. I SSHed into the box, ran the script, and then from another terminal window killed the parent (bash) and also retried the experiment by killing the bash parent (sshd). In both cases the parent bash and sshd processes died (as expected), my script kept running but the parent PID of the script was changed to 1 (systemd). I did not receive any error messages in /var/log/messages. Any ideas how I might reproduce this problem at will? Thanks Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Mon, 4 May 2020 at 16:02, Mark Bannister wrote: > > ... > > When you say the session slice is taking time to clean up, please > excuse my ignorance as > I'm not really up to speed on how session slices are being managed by > systemd, but why > would it matter if a slice takes time to clean up? Is there a limit > on how many SSH session > slices a user can have or something? I don't see how that would cause > this particular error. > I think session slice was poorly worded, I only realize that now. I should have said user slice. This is the cgroup hierarchy after a few logins: user-UID.slice `- user@UID.service `- session-1.scope `- ... `- session-N.scope I see that CentOS 7 doesn't really support systemd --user, so user@UID.service isn't present on your system, but I'll include it in the complete picture anyway. You can have multiple session scopes, one per login. Unless linger is enabled, when the last session goes away, systemd-logind will GC the user object it keeps, by stopping the session-N.scope(s), user@UID.service and user-UID.slice, and then freeing the internal bookkeeping structures, which means the session scopes, user-UID.slice and user@UID.service will have stop jobs enqueued, and since session scopes and user@UID.service come after their parent slice, they will be stopped before and user-UID.slice will wait for them to stop before its stop job is run by systemd. For your case, it could be that some process running under session-N.scope is not responding to SIGTERM, and this in turn is making session-N.scope take time to stop (it has a default timeout of 90s), because it has to wait for all processes run under it to die before exiting. This also means that your user-UID.slice is waiting for session-N.scope before being stopped. Now, if you try to login at this point, v219 has no checks to prevent this problem: UNIT JOBSTATUS session-N.scope : stop : running user-UID.slice : stop : waiting and logind tries to start session-N+1.scope, which needs user-UID.slice/start. The job mode used is "fail", which means if a conflicting job which start cannot merge into is queued (like stop), fail the transaction. This is why you see "user-17132.slice has 'stop' job queued, but 'start' is included in transaction". The transaction being mentioned in this error is the one generated from session-N+1.scope/start. The fixes landed in v228 (see https://github.com/systemd/systemd/pull/1918), as the job mode used by manager_start_unit is changed to use "replace" job mode, which ends up resolving the problem as by the time manager_start_scope is called (which still uses "fail"), user-UID.slice already has start job waiting which cancelled the sitting stop job from previous call, so login may delay but not fail. This problem you're seeing only happens if you login again in under 90s or less after logging out of the last user session (considering the default timeout). So, upgrading systemd should solve your problem. If that is not an option, I am not sure there is much you can do. You'd need to modify the TimeoutStopSec= value for scopes, which isn't configurable as the scope is created at runtime by logind, short of changing the default for PID 1, which affects every other service on the system. Enabling linger is another solution. It should fix the problem by not tearing down the user slice on last logout, but also means the user slice is created on boot. Though since you don't have systemd --user, it should not take up any resources (unless I'm missing something), otherwise it wouldn't be desirable at all, ofcourse. Note that you'd have to enable it for every user, or allow users to do so through policykit, if you want this to work for everyone. Creating a file for each username under /var/lib/systemd/linger (like /var/lib/systemd/linger/kkd), or allowing users to call loginctl enable-linger themselves should suffice. > ... > > ... we also see this message in /var/log/secure: > > 020-05-03T16:09:38.737122-04:00 jupiter sshd[11031]: > pam_systemd(sshd:session): Failed to create session: Resource deadlock > avoided Yes, this is the string for the errno returned from the bus call, EDEADLOCK, as the transaction is destructive in nature. > Does this 'Resource deadlock avoided' message from pam_systemd help > identify the root > cause, or is that just a side-effect? A side-effect, see above. -- Kartikeya ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Thu Apr 30 19:01:51 UTC 2020, Kumar Kartikeya Dwivedi wrote: > My educated guess is that the session slice is taking time to clean up > because of the default stop timeout due to processes in the session > not responding to SIGTERM (it is waiting for dependent units to stop > before stopping itself by virtue of being ordered after them), and in > the meantime a new login is initiated. Now, the stop job on the user > slice is still waiting, and looking at the code in logind, > manager_start_scope uses the job mode fail, which means the start job > for the user slice won't be able to cancel the sitting stop job, hence > the transaction cannot be applied. When you say the session slice is taking time to clean up, please excuse my ignorance as I'm not really up to speed on how session slices are being managed by systemd, but why would it matter if a slice takes time to clean up? Is there a limit on how many SSH session slices a user can have or something? I don't see how that would cause this particular error. This is happening a few times a day. Looking at the logs, we have SSH sessions opening and closing successfully all day long for various different accounts. Occasionally we see the problematic error message. I've correlated with other messages and discovered that at the same time that we get this in /var/log/messages (the original reported error): 2020-05-03T16:09:38.735265-04:00 jupiter systemd[1]: Requested transaction contradicts existing jobs: Transaction for session-240481.scope/start is destructive (user-17132.slice has 'stop' job queued, but 'start' is included in transaction). 2020-05-03T16:09:38.735430-04:00 jupiter systemd-logind[1588]: Failed to start session scope session-240481.scope: Transaction for session-240481.scope/start is destructive (user-17132.slice has 'stop' job queued, but 'start' is included in transaction). 2020-05-03T16:09:38.735642-04:00 jupiter systemd[1]: Removed slice User Slice of jupiter-user. ... we also see this message in /var/log/secure: 020-05-03T16:09:38.737122-04:00 jupiter sshd[11031]: pam_systemd(sshd:session): Failed to create session: Resource deadlock avoided 2020-05-03T16:09:38.737380-04:00 jupiter sshd[11031]: pam_unix(sshd:session): session opened for user jupiter-user by (uid=0) Does this 'Resource deadlock avoided' message from pam_systemd help identify the root cause, or is that just a side-effect? Thanks Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Thu, 30 Apr 2020 at 23:43, Uoti Urpala wrote: > > On Thu, 2020-04-30 at 22:18 +0530, Kumar Kartikeya Dwivedi wrote: > > waiting for the stop request initiated previously to finish). Even if > > you use fail as the job mode, the error you get back is "Transaction > > is destructive" or so, not this. > > IIRC the patch he mentioned in his mail was one that changed the error > message from "Transaction is destructive" to this. > I see, it seems I overlooked that part. Thanks for correcting me. Then this is indeed the stage where it tries to apply the transaction, rather than conflicting jobs in the same transaction, so what I said above probably does not apply at all. I used a test case locally to try and reproduce the problem, and it only manifests when the JOB_FAIL job mode is used, otherwise the stop job on the unit is cancelled and replaced with the start job. Something like this: # anchor.service [Unit] Requires=that.slice After=that.slice [Service] ExecStart=/bin/true # other.service [Service] ExecStart=/bin/sleep infinity /* delay the stop a bit */ ExecStop=/bin/sleep 10 Now, for $ systemctl start other.service --no-block; systemctl --user stop that.slice --no-block; $ systemctl start anchor.service --job-mode fail fails, while $ systemctl start anchor.service --job-mode replace Either that or the stop job installed is irreversible (can emulate that by using --job-mode replace-irreversibly when stopping that.slice). So these are the only possibilities. The code hasn't changed at all since v219 too. My educated guess is that the session slice is taking time to clean up because of the default stop timeout due to processes in the session not responding to SIGTERM (it is waiting for dependent units to stop before stopping itself by virtue of being ordered after them), and in the meantime a new login is initiated. Now, the stop job on the user slice is still waiting, and looking at the code in logind, manager_start_scope uses the job mode fail, which means the start job for the user slice won't be able to cancel the sitting stop job, hence the transaction cannot be applied. -- Kartikeya ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Thu, 2020-04-30 at 22:18 +0530, Kumar Kartikeya Dwivedi wrote: > waiting for the stop request initiated previously to finish). Even if > you use fail as the job mode, the error you get back is "Transaction > is destructive" or so, not this. IIRC the patch he mentioned in his mail was one that changed the error message from "Transaction is destructive" to this. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Thu, 30 Apr 2020 at 16:56, Mark Bannister wrote: > > The documentation says that the default option for --job-mode is "replace", > i.e. > any conflicting pending job will be replaced, as necessary. > > How does it make sense for systemd to allow stop and start requests to > contradict anyway? This is the bit I'm finding very hard to understand. > I don't think your explanation can be right. Surely, if a start command is > issued while a previous stop is still being processed, that is not an error. > The start request should be followed as soon as the stop is finished. Yes, this error is not raised by external jobs (when you use the replace job mode, the job sitting in the unit's job slot is always unseated if possible, and the other job either becomes 'runnable' instantly, or in some cases waits for the other job to finish before becoming runnable, emulating a queue (e.g. service unit's start job waiting for the stop request initiated previously to finish). Even if you use fail as the job mode, the error you get back is "Transaction is destructive" or so, not this. The problem you're seeing is caused when the same transaction pulls two conflicting/unmergeable jobs, for example a unit being started adds a start job and some unit in the dependency chain has Conflicts on it, which would pull in a stop job for the same unit. These are contradictory state change requests, and there's no easy way to resolve this automatically (unlike merging where jobs that make the unit reach a compatible state are merged into the same job). This is has been reported before: * The first one explains in detail when and why this happens: https://github.com/systemd/systemd/issues/11440 * This one is more of a real life example: https://github.com/systemd/systemd/issues/2830 The easiest way to debug problems like these is to enable debug logging for the service manager, examine the transaction in the logs, and inspect your units in the dependency chain to pin down the culprit. -- Kartikeya ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
> On Mon, 2020-04-27 at 11:40 +0100, Mark Bannister wrote: > > One of the error messages I've been trying to explain is reported like this: > > 2020-04-22T00:45:47.047710-04:00 jupiter systemd[1]: Requested transaction > > contradicts existing jobs: Transaction for session-752473.scope/start is > > destructive (user-16027.slice has 'stop' job queued, but 'start' is > > included in transaction). > > This also coincides with a message from sshd: > > 2020-04-22T00:45:47.049949-04:00 jupiter sshd[20984]: > > pam_unix(sshd:session): session opened for user userlp by (uid=0) > > 2020-04-22T00:45:51.181665-04:00 jupiter sshd[20984]: > > pam_unix(sshd:session): session closed for user userlp > > I'm not sure where to begin to troubleshoot the problem. The systemd > > error is occurring sporadically and I haven't yet found a way to > > reproduce it. I also can't say I actually understand the error > > message or what I'm supposed to do about it. How is it even possible > > for stop and start jobs to clash like this? On Mon Apr 27 19:59:11 UTC 2020, Uoti Urpala wrote; > See the systemctl documentation for the --job-mode option. Basically > this should be due to an external request from somewhere to start the > unit while systemd is in the middle of processing a previously started > stop command for it. The documentation says that the default option for --job-mode is "replace", i.e. any conflicting pending job will be replaced, as necessary. How does it make sense for systemd to allow stop and start requests to contradict anyway? This is the bit I'm finding very hard to understand. I don't think your explanation can be right. Surely, if a start command is issued while a previous stop is still being processed, that is not an error. The start request should be followed as soon as the stop is finished. Also I'm unclear on what stop/start job is involved here, it's not sshd.service that's going to stop and start when a session is opened for a user. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] Requested transaction contradicts existing jobs: start is destructive
On Mon, 2020-04-27 at 11:40 +0100, Mark Bannister wrote: > I'm not sure where to begin to troubleshoot the problem. The systemd > error is occurring sporadically and I haven't yet found a way to > reproduce it. I also can't say I actually understand the error > message or what I'm supposed to do about it. How is it even possible > for stop and start jobs to clash like this? See the systemctl documentation for the --job-mode option. Basically this should be due to an external request from somewhere to start the unit while systemd is in the middle of processing a previously started stop command for it. The log messages suggest that this might be related to trying to start a new ssh session while the per-user slice is still being shut down after closing all previous login sessions; I'm not familiar enough with this area to tell how things are supposed to work in this situation. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel
[systemd-devel] Requested transaction contradicts existing jobs: start is destructive
I've been trying to find the cause of a 'transaction is destructive' error from systemd. I'm testing systemd-219-67.el7_7.4 with the following patch applied: https://github.com/systemd/systemd/commit/cf99f8eacf1c864b19a6a02edea78c43f3185cb7 One of the error messages I've been trying to explain is reported like this: 2020-04-22T00:45:47.047710-04:00 jupiter systemd[1]: Requested transaction contradicts existing jobs: Transaction for session-752473.scope/start is destructive (user-16027.slice has 'stop' job queued, but 'start' is included in transaction). 2020-04-22T00:45:47.047985-04:00 jupiter systemd[1]: Removed slice User Slice of userlp. 2020-04-22T00:45:47.048159-04:00 jupiter systemd-logind[1402]: Failed to start session scope session-752473.scope: Transaction for session-752473.scope/start is destructive (user-16027.slice has 'stop' job queued, but 'start' is included in transaction). This also coincides with a message from sshd: 2020-04-22T00:45:47.049949-04:00 jupiter sshd[20984]: pam_unix(sshd:session): session opened for user userlp by (uid=0) 2020-04-22T00:45:51.181665-04:00 jupiter sshd[20984]: pam_unix(sshd:session): session closed for user userlp I'm not sure where to begin to troubleshoot the problem. The systemd error is occurring sporadically and I haven't yet found a way to reproduce it. I also can't say I actually understand the error message or what I'm supposed to do about it. How is it even possible for stop and start jobs to clash like this? If someone could please provide a little guidance here, your help will be much appreciated. Thanks in advance! Mark ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel