Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-12-08 Thread Andy Riebs
Answering my own question, I got private email which points to 
, describing both the 
problem and the solution. (Thanks Matthieu!)


Andy


On 12/08/2017 11:06 AM, Andy Riebs wrote:


I've gathered more information, and I am probably having a fight with 
pam.  First, of note, this problem can be reproduced with a single 
node, single task job, such as


$ sbatch -N1 --reservation awr
#!/bin/bash
hostname
Submitted batch job 90436
$ sinfo -R
batch job complete f slurm 2017-12-08T15:34:37 node017
$

With SlurmdDebug=debug5, the only thing interesting in slurmd.log is

[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic signature 
plugin loaded
[2017-12-08T15:34:37.778] [90436.batch] error: pam_open_session: 
Cannot make/remove an entry for the specified session

[2017-12-08T15:34:37.779] [90436.batch] error: error in pam_setup
[2017-12-08T15:34:37.804] [90436.batch] error: job_manager exiting 
abnormally, rc = 4020
[2017-12-08T15:34:37.804] [90436.batch] job 90436 completed with 
slurm_rc = 4020, job_rc = 0


/etc/pam.d/slurm is defined as

auth    required    pam_localuser.so
auth    required    pam_shells.so
account required    pam_unix.so
account required    pam_access.so
session required    pam_unix.so
session required    pam_loginuid.so

/var/log/secure reports

Dec  8 15:34:37 node017 : pam_unix(slurm:session): open_session - 
error recovering username
Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): unexpected 
response from failed conversation function
Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): error 
recovering login user-name


The message "error recovering username" seems likely to be at the 
heart of the problem here. This worked just fine with Slurm 16.05.8, 
and I think it was also working with Slurm 17.11.0-0pre2.


Any thoughts about where I should go from here?

Andy

On 11/30/2017 08:40 AM, Andy Riebs wrote:
We've just installed 17.11.0 on our 100+ node x86_64 cluster running 
CentOS 7.4 this afternoon, and periodically see a single node 
(perhaps the first node in an allocation?) get drained with the 
message "batch job complete failure".


On one node in question, slurmd.log reports

pam_unix(slurm:session): open_session - error recovering username
pam_loginuid(slurm:session): unexpected response from failed
conversation function 


On another node drained for the same reason,

error: pam_open_session: Cannot make/remove an entry for the
specified session
error: error in pam_setup
error: job_manager exiting abnormally, rc = 4020
sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0

slurmctld has logged

error: slurmd error running JobId=33 on node(s)=node048: Slurmd
could not execve job

drain_nodes: node node048 state set to DRAIN

If anyone can shine some light on where I should start looking, I 
shall be most obliged!


Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
 May the source be with you!






Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-12-08 Thread Andy Riebs
I've gathered more information, and I am probably having a fight with 
pam.  First, of note, this problem can be reproduced with a single node, 
single task job, such as


$ sbatch -N1 --reservation awr
#!/bin/bash
hostname
Submitted batch job 90436
$ sinfo -R
batch job complete f slurm 2017-12-08T15:34:37 node017
$

With SlurmdDebug=debug5, the only thing interesting in slurmd.log is

[2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic signature 
plugin loaded
[2017-12-08T15:34:37.778] [90436.batch] error: pam_open_session: Cannot 
make/remove an entry for the specified session

[2017-12-08T15:34:37.779] [90436.batch] error: error in pam_setup
[2017-12-08T15:34:37.804] [90436.batch] error: job_manager exiting 
abnormally, rc = 4020
[2017-12-08T15:34:37.804] [90436.batch] job 90436 completed with 
slurm_rc = 4020, job_rc = 0


/etc/pam.d/slurm is defined as

auth    required    pam_localuser.so
auth    required    pam_shells.so
account required    pam_unix.so
account required    pam_access.so
session required    pam_unix.so
session required    pam_loginuid.so

/var/log/secure reports

Dec  8 15:34:37 node017 : pam_unix(slurm:session): open_session - error 
recovering username
Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): unexpected 
response from failed conversation function
Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): error recovering 
login user-name


The message "error recovering username" seems likely to be at the heart 
of the problem here. This worked just fine with Slurm 16.05.8, and I 
think it was also working with Slurm 17.11.0-0pre2.


Any thoughts about where I should go from here?

Andy

On 11/30/2017 08:40 AM, Andy Riebs wrote:
We've just installed 17.11.0 on our 100+ node x86_64 cluster running 
CentOS 7.4 this afternoon, and periodically see a single node (perhaps 
the first node in an allocation?) get drained with the message "batch 
job complete failure".


On one node in question, slurmd.log reports

pam_unix(slurm:session): open_session - error recovering username
pam_loginuid(slurm:session): unexpected response from failed
conversation function 


On another node drained for the same reason,

error: pam_open_session: Cannot make/remove an entry for the
specified session
error: error in pam_setup
error: job_manager exiting abnormally, rc = 4020
sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0

slurmctld has logged

error: slurmd error running JobId=33 on node(s)=node048: Slurmd
could not execve job

drain_nodes: node node048 state set to DRAIN

If anyone can shine some light on where I should start looking, I 
shall be most obliged!


Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
 May the source be with you!