Don, as I stated earlier, I am sorry that this broke things for you, but ...
There was on major and one minor reason for submitting the set of job control patches. The minor one is such that I would have no problems if the SLURM developers would withdraw job control, and is: without job control, there will always be situations where users are forced to use kill -9 or similar to get rid of salloc. Since salloc is then not in control of the sub-processes it runs. The major reason for keeping this is the proprietary system we use, the technical details are below. I am not writing these to enter into a discussion of Cray features, but hope that the work at ORNL proceeds to replace ALPS with SLURM. Then the problem described below would not exist. That problem basically forced us to either * disable interactive sessions on our systems (unacceptable since both researchers and newcomers rely on interactive sessions to test out how a particular combination of parameters work) or * use job control as a trade-off. The situation in December was such that it took only a single mistake of one user to make an entire multi-cabinet system unusable, i.e. if SLURM were a firewall or malware protection program, we would not even have this discussion. In December, when job control had not been added, we twice had stretches in the order of 10-12 hours where the machine became blocked and no new jobs would run, the messed up salloc sessions both happened in the evening. > Perhaps I was not clear in my description of the problem, but the patch > you supplied most emphatically does *not* solve the problem that the bash > shell is crashing and getting into the SIGTTIN loop before ever issuing > its own prompt to the user! The "/bin/bash" command should have > executed the shell and allowed the user to enter commands, and not > immediately terminated. > Have you tried this with other shells? This is coming from bash itself, which tries to become the foreground process in order to perform its job control. When run within a job script, salloc does not allow another process to come into the foreground, only if run in interactive mode. The behaviour is expected, it may be that other shells less aggressively try to move into the foreground. > You seem to imply that it is somehow illegitimate to execute the "salloc" > within a script. I submit that it is almost second nature for > Linux/Unix programmers to create various "wrapper" scripts to invoke > commands (including "salloc") with certain fixed parameters, while > allowing easy substitution of other parameters. If I understand you correctly, you mean a shell invoking a shell such as % cat a #!/bin/bash /bin/sh -c /bin/bash % ./a % ps f -o pid,ppid,sid,pgid,tpgid,stat,wchan,cmd PID PPID SID PGID TPGID STAT WCHAN CMD 25480 25477 25480 25480 26024 Ss wait -bash 25739 25480 25480 25739 26024 S wait \_ sh a 25740 25739 25480 25740 26024 S wait \_ /bin/bash 26024 25740 25480 26024 26024 R+ - \_ ps f -o pid,ppid,sid,pgid,tpgid,stat,wchan,cmd The three nested shells all have the same session ID, the login shell 25480 remains session leader. The bottom-level shell 25740 has backgrounded itself so that the ps command it forked can be the foreground process (tpgid = 26024). > The "salloc" command itself is essentially a wrapper which allows a user to > invoke a specific > command or shell after calling SLURM to reserve some resources. I see no > reason that salloc > should not be able to be executed within a script. The above use case uses shell scripts to construct interactive sessions. I have found no non-ugly way of allowing this, and believe also that sh or bash resort to some tricks and heuristics that make this possible (i.e. not sure that sh/bash will always get this right). The reason is the "some resources". When a user kills the salloc process, the remote processes running via aprun on compute nodes are not at the same time terminated. Once the salloc session is in addition either terminated via scancel or has naturally timed out, it would be time to free up the resources reserved for this session. Very likely SLURM itself could handle this situation. However, on Cray the compute nodes are not under control of SLURM, but the Basil batch system layer. At the end of the salloc session, this layer receives a notification that the job is done. However, it will refuse to launch any new reservations until it has not cleaned up the existing ones. But it can not clean up the existing ones since the remote orphaned processes are still executing. Until an operator comes in (e.g. during the middle of the night) and cleans up the orphaned processes, no new jobs will run since the old reservation sits in "pending cancel" state. Hence we do not allow other processes to obtain control over salloc unless salloc is running in an interactive mode. We run SLURM on 3 XT systems, including our main production system, and one XE system. When we migrated we had many novice users. Since introduction of the job control into salloc, we have had not a single case of machine-unusable-time caused by orphaned child processes of salloc. Hence if you have your way, we would need to fork in order to keep things running.
