Don,

as I stated earlier, I am sorry that this broke things for you, but ...

There was on major and one minor reason for submitting the set of job
control patches. The minor one is such that I would have no problems if
the SLURM developers would withdraw job control, and is: without job
control, there will always be situations where users are forced to use
kill -9 or similar to get rid of salloc. Since salloc is then not in 
control of the sub-processes it runs. 

The major reason for keeping this is the proprietary system we use, the 
technical details are below. I am not writing these to enter into a 
discussion of Cray features, but hope that the work at ORNL proceeds to 
replace ALPS with SLURM. Then the problem described below would not 
exist. That problem basically forced us to either 
 * disable interactive sessions on our systems (unacceptable since both
   researchers and newcomers rely on interactive sessions to test out 
   how a particular combination of parameters work) or
 * use job control as a trade-off.

The situation in December was such that it took only a single mistake of one
user to make an entire multi-cabinet system unusable, i.e. if SLURM were a
firewall or malware protection program, we would not even have this discussion.

In December, when job control had not been added, we twice had stretches in
the order of 10-12 hours where the machine became blocked and no new jobs
would run, the messed up salloc sessions both happened in the evening.

> Perhaps I was not clear in my description of the problem,  but the patch 
> you supplied most emphatically does *not*  solve the problem that the bash 
> shell is crashing and getting into the SIGTTIN loop before ever issuing 
> its own prompt to the user!    The "/bin/bash" command should have 
> executed the shell and allowed the user to enter commands,  and not 
> immediately terminated.
> 
Have you tried this with other shells? This is coming from bash itself, which 
tries to become the foreground process in order to perform its job control. 
When run within a job script, salloc does not allow another process to come 
into the foreground, only if run in interactive mode. The behaviour is expected,
it may be that other shells less aggressively try to move into the foreground.

> You seem to imply that it is somehow illegitimate to execute the "salloc" 
> within a script.    I submit that it is almost second nature for 
> Linux/Unix programmers to create various "wrapper" scripts to invoke 
> commands (including "salloc") with certain fixed parameters, while 
> allowing easy substitution of other parameters. 
If I understand you correctly, you mean a shell invoking a shell such as
% cat a
#!/bin/bash
/bin/sh -c /bin/bash
% ./a
% ps f -o pid,ppid,sid,pgid,tpgid,stat,wchan,cmd
  PID  PPID   SID  PGID TPGID STAT WCHAN  CMD
25480 25477 25480 25480 26024 Ss   wait   -bash
25739 25480 25480 25739 26024 S    wait    \_ sh a
25740 25739 25480 25740 26024 S    wait        \_ /bin/bash
26024 25740 25480 26024 26024 R+   -               \_ ps f -o 
pid,ppid,sid,pgid,tpgid,stat,wchan,cmd

The three nested shells all have the same session ID, the login shell 25480 
remains session leader.
The bottom-level shell 25740 has backgrounded itself so that the ps command it 
forked can be the
foreground process (tpgid = 26024). 

> The "salloc" command itself is essentially a wrapper which allows a user to 
> invoke a specific 
> command or shell after calling SLURM to reserve some resources. I see  no 
> reason that salloc 
> should not be able to be executed within a script. 
The above use case uses shell scripts to construct interactive sessions. I have 
found no non-ugly
way of allowing this, and believe also that sh or bash resort to some tricks 
and heuristics that
make this possible (i.e. not sure that sh/bash will always get this right).

The reason is the "some resources". When a user kills the salloc process, the 
remote processes
running via aprun on compute nodes are not at the same time terminated. Once 
the salloc session
is in addition either terminated via scancel or has naturally timed out, it 
would be time to free
up the resources reserved for this session. Very likely SLURM itself could 
handle this situation.

However, on Cray the compute nodes are not under control of SLURM, but the 
Basil batch system layer.
At the end of the salloc session, this layer receives a notification that the 
job is done. However,
it will refuse to launch any new reservations until it has not cleaned up the 
existing ones. But it
can not clean up the existing ones since the remote orphaned processes are 
still executing. Until
an operator comes in (e.g. during the middle of the night) and cleans up the 
orphaned processes, no
new jobs will run since the old reservation sits in "pending cancel" state.

Hence we do not allow other processes to obtain control over salloc unless 
salloc is running in an
interactive mode.

We run SLURM on 3 XT systems, including our main production system, and one XE 
system. When we
migrated we had many novice users. Since introduction of the job control into 
salloc, we have had
not a single case of machine-unusable-time caused by orphaned child processes 
of salloc.

Hence if you have your way, we would need to fork in order to keep things 
running.

Reply via email to