Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Alan Orth Sun, 04 Feb 2018 07:12:19 -0800

I came here looking for this! The last time I tried it in early 2017-12 it
was still "broken" with SLURM 17.11.0. Glad to see that it was fixed with
17.11.1 (and to know why). I've now got PAM limits being applied correctly
on my cluster. Thanks for the link, Andy.


Cheers,

On Fri, Dec 8, 2017 at 10:25 PM Andy Riebs <andy.ri...@hpe.com> wrote:

> Answering my own question, I got private email which points to
> <https://bugs.schedmd.com/show_bug.cgi?id=4412>
> <https://bugs.schedmd.com/show_bug.cgi?id=4412>, describing both the
> problem and the solution. (Thanks Matthieu!)
>
> Andy
>
> On 12/08/2017 11:06 AM, Andy Riebs wrote:
>
> I've gathered more information, and I am probably having a fight with
> pam.  First, of note, this problem can be reproduced with a single node,
> single task job, such as
>
> $ sbatch -N1 --reservation awr
> #!/bin/bash
> hostname
> Submitted batch job 90436
> $ sinfo -R
> batch job complete f slurm     2017-12-08T15:34:37 node017
> $
>
> With SlurmdDebug=debug5, the only thing interesting in slurmd.log is
>
> [2017-12-08T15:34:37.770] [90436.batch] Munge cryptographic signature
> plugin loaded
> [2017-12-08T15:34:37.778] [90436.batch] error: pam_open_session: Cannot
> make/remove an entry for the specified session
> [2017-12-08T15:34:37.779] [90436.batch] error: error in pam_setup
> [2017-12-08T15:34:37.804] [90436.batch] error: job_manager exiting
> abnormally, rc = 4020
> [2017-12-08T15:34:37.804] [90436.batch] job 90436 completed with slurm_rc
> = 4020, job_rc = 0
>
> /etc/pam.d/slurm is defined as
>
> auth            required        pam_localuser.so
> auth            required        pam_shells.so
> account         required        pam_unix.so
> account         required        pam_access.so
> session         required        pam_unix.so
> session         required        pam_loginuid.so
>
> /var/log/secure reports
>
> Dec  8 15:34:37 node017 : pam_unix(slurm:session): open_session - error
> recovering username
> Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): unexpected response
> from failed conversation function
> Dec  8 15:34:37 node017 : pam_loginuid(slurm:session): error recovering
> login user-name
>
> The message "error recovering username" seems likely to be at the heart of
> the problem here. This worked just fine with Slurm 16.05.8, and I think it
> was also working with Slurm 17.11.0-0pre2.
>
> Any thoughts about where I should go from here?
>
> Andy
> On 11/30/2017 08:40 AM, Andy Riebs wrote:
>
> We've just installed 17.11.0 on our 100+ node x86_64 cluster running
> CentOS 7.4 this afternoon, and periodically see a single node (perhaps the
> first node in an allocation?) get drained with the message "batch job
> complete failure".
>
> On one node in question, slurmd.log reports
>
> pam_unix(slurm:session): open_session - error recovering username
> pam_loginuid(slurm:session): unexpected response from failed conversation
> function
>
> On another node drained for the same reason,
>
> error: pam_open_session: Cannot make/remove an entry for the specified
> session
> error: error in pam_setup
> error: job_manager exiting abnormally, rc = 4020
> sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
>
> slurmctld has logged
>
> error: slurmd error running JobId=33 on node(s)=node048: Slurmd could not
> execve job
>
> drain_nodes: node node048 state set to DRAIN
>
> If anyone can shine some light on where I should start looking, I shall be
> most obliged!
>
> Andy
>
> --
> Andy riebsandy.ri...@hpe.com
> Hewlett-Packard Enterprise
> High Performance Computing Software Engineering+1 404 648 9024 
> <(404)%20648-9024>
> My opinions are not necessarily those of HPE
>     May the source be with you!
>
>
>
> --

Alan Orth
alan.o...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch

Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Reply via email to