Answering my own question, I got private email which points to
<https://bugs.schedmd.com/show_bug.cgi?id=4412>, describing both the
problem and the solution. (Thanks Matthieu!)
Andy
On 12/08/2017 11:06 AM, Andy Riebs wrote:
I've gathered more information, and I am probably having a fight with
pam. First, of note, this problem can be reproduced with a single
node, single task job, such as
$ sbatch -N1 --reservation awr
#!/bin/bash
hostname
Submitted batch job 90436
$ sinfo -R
batch job complete f slurm 2017-12-08T15:34:37 node017
$
With SlurmdDebug=debug5, the only thing interesting in slurmd.log is
[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic signature
plugin loaded
[2017-12-08T15:34:37.778] [90436.batch] error: pam_open_session:
Cannot make/remove an entry for the specified session
[2017-12-08T15:34:37.779] [90436.batch] error: error in pam_setup
[2017-12-08T15:34:37.804] [90436.batch] error: job_manager exiting
abnormally, rc = 4020
[2017-12-08T15:34:37.804] [90436.batch] job 90436 completed with
slurm_rc = 4020, job_rc = 0
/etc/pam.d/slurm is defined as
auth required pam_localuser.so
auth required pam_shells.so
account required pam_unix.so
account required pam_access.so
session required pam_unix.so
session required pam_loginuid.so
/var/log/secure reports
Dec 8 15:34:37 node017 : pam_unix(slurm:session): open_session -
error recovering username
Dec 8 15:34:37 node017 : pam_loginuid(slurm:session): unexpected
response from failed conversation function
Dec 8 15:34:37 node017 : pam_loginuid(slurm:session): error
recovering login user-name
The message "error recovering username" seems likely to be at the
heart of the problem here. This worked just fine with Slurm 16.05.8,
and I think it was also working with Slurm 17.11.0-0pre2.
Any thoughts about where I should go from here?
Andy
On 11/30/2017 08:40 AM, Andy Riebs wrote:
We've just installed 17.11.0 on our 100+ node x86_64 cluster running
CentOS 7.4 this afternoon, and periodically see a single node
(perhaps the first node in an allocation?) get drained with the
message "batch job complete failure".
On one node in question, slurmd.log reports
pam_unix(slurm:session): open_session - error recovering username
pam_loginuid(slurm:session): unexpected response from failed
conversation function
On another node drained for the same reason,
error: pam_open_session: Cannot make/remove an entry for the
specified session
error: error in pam_setup
error: job_manager exiting abnormally, rc = 4020
sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
slurmctld has logged
error: slurmd error running JobId=33 on node(s)=node048: Slurmd
could not execve job
drain_nodes: node node048 state set to DRAIN
If anyone can shine some light on where I should start looking, I
shall be most obliged!
Andy
--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!