Answering my own question, I got private email which points to <https://bugs.schedmd.com/show_bug.cgi?id=4412>, describing both the problem and the solution. (Thanks Matthieu!)

Andy


On 12/08/2017 11:06 AM, Andy Riebs wrote:

I've gathered more information, and I am probably having a fight with pam.  First, of note, this problem can be reproduced with a single node, single task job, such as

$ sbatch -N1 --reservation awr
#!/bin/bash
hostname
Submitted batch job 90436
$ sinfo -R
batch job complete f slurm     2017-12-08T15:34:37 node017
$

With SlurmdDebug=debug5, the only thing interesting in slurmd.log is

[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic signature plugin loaded [2017-12-08T15:34:37.778] [90436.batch] error: pam_open_session: Cannot make/remove an entry for the specified session
[2017-12-08T15:34:37.779] [90436.batch] error: error in pam_setup
[2017-12-08T15:34:37.804] [90436.batch] error: job_manager exiting abnormally, rc = 4020 [2017-12-08T15:34:37.804] [90436.batch] job 90436 completed with slurm_rc = 4020, job_rc = 0

/etc/pam.d/slurm is defined as

auth            required        pam_localuser.so
auth            required        pam_shells.so
account         required        pam_unix.so
account         required        pam_access.so
session         required        pam_unix.so
session         required        pam_loginuid.so

/var/log/secure reports

Dec  8 15:34:37 node017 : pam_unix(slurm:session): open_session - error recovering username Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): unexpected response from failed conversation function Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): error recovering login user-name

The message "error recovering username" seems likely to be at the heart of the problem here. This worked just fine with Slurm 16.05.8, and I think it was also working with Slurm 17.11.0-0pre2.

Any thoughts about where I should go from here?

Andy

On 11/30/2017 08:40 AM, Andy Riebs wrote:
We've just installed 17.11.0 on our 100+ node x86_64 cluster running CentOS 7.4 this afternoon, and periodically see a single node (perhaps the first node in an allocation?) get drained with the message "batch job complete failure".

On one node in question, slurmd.log reports

    pam_unix(slurm:session): open_session - error recovering username
    pam_loginuid(slurm:session): unexpected response from failed
conversation function
On another node drained for the same reason,

    error: pam_open_session: Cannot make/remove an entry for the
    specified session
    error: error in pam_setup
    error: job_manager exiting abnormally, rc = 4020
    sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0

slurmctld has logged

    error: slurmd error running JobId=33 on node(s)=node048: Slurmd
    could not execve job

    drain_nodes: node node048 state set to DRAIN

If anyone can shine some light on where I should start looking, I shall be most obliged!

Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
     May the source be with you!


Reply via email to