Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Andy Riebs Fri, 08 Dec 2017 12:25:08 -0800

Answering my own question, I got private email which points to<https://bugs.schedmd.com/show_bug.cgi?id=4412>, describing both theproblem and the solution. (Thanks Matthieu!)


Andy



On 12/08/2017 11:06 AM, Andy Riebs wrote:

I've gathered more information, and I am probably having a fight withpam. First, of note, this problem can be reproduced with a singlenode, single task job, such as
$ sbatch -N1 --reservation awr
#!/bin/bash
hostname
Submitted batch job 90436
$ sinfo -R
batch job complete f slurm     2017-12-08T15:34:37 node017
$

With SlurmdDebug=debug5, the only thing interesting in slurmd.log is
[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic signatureplugin loaded[2017-12-08T15:34:37.778] [90436.batch] error: pam_open_session:Cannot make/remove an entry for the specified session
[2017-12-08T15:34:37.779] [90436.batch] error: error in pam_setup
[2017-12-08T15:34:37.804] [90436.batch] error: job_manager exitingabnormally, rc = 4020[2017-12-08T15:34:37.804] [90436.batch] job 90436 completed withslurm_rc = 4020, job_rc = 0
/etc/pam.d/slurm is defined as

auth            required        pam_localuser.so
auth            required        pam_shells.so
account         required        pam_unix.so
account         required        pam_access.so
session         required        pam_unix.so
session         required        pam_loginuid.so

/var/log/secure reports
Dec 8 15:34:37 node017 : pam_unix(slurm:session): open_session -error recovering usernameDec 8 15:34:37 node017 : pam_loginuid(slurm:session): unexpectedresponse from failed conversation functionDec 8 15:34:37 node017 : pam_loginuid(slurm:session): errorrecovering login user-name
The message "error recovering username" seems likely to be at theheart of the problem here. This worked just fine with Slurm 16.05.8,and I think it was also working with Slurm 17.11.0-0pre2.
Any thoughts about where I should go from here?

Andy

On 11/30/2017 08:40 AM, Andy Riebs wrote:
We've just installed 17.11.0 on our 100+ node x86_64 cluster runningCentOS 7.4 this afternoon, and periodically see a single node(perhaps the first node in an allocation?) get drained with themessage "batch job complete failure".
On one node in question, slurmd.log reports

    pam_unix(slurm:session): open_session - error recovering username
    pam_loginuid(slurm:session): unexpected response from failed
conversation function
On another node drained for the same reason,

    error: pam_open_session: Cannot make/remove an entry for the
    specified session
    error: error in pam_setup
    error: job_manager exiting abnormally, rc = 4020
    sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0

slurmctld has logged

    error: slurmd error running JobId=33 on node(s)=node048: Slurmd
    could not execve job

    drain_nodes: node node048 state set to DRAIN
If anyone can shine some light on where I should start looking, Ishall be most obliged!
Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
     May the source be with you!

Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Reply via email to