On Wed, 11 Jan 2012 20:26:10 +0100, Carlos Aguado Sanchez 
<carlos.agu...@epfl.ch> wrote:
> Dear devteam,
> 
> 
> We are experiencing some issues with slurm-2.3.1 and the AUKS plugin for 
> integration with NFSv4 and Kerberos. Although that is an external tool, 
> I thought it might be related to this mailing list as it uses the spank 
> interface, hence the question.
> 
> Sending a job whose working directory sits on NFS I observe that the job 
> step does not get kerberos credentials before opening the output file 
> and consequently fails.
> 
> Specifying an alternate output file makes the kernel to cache the gssrpc 
> context and all subsequent jobs are successful for the time those remain 
> cached [1].
> 
> With high debug in slurmd logs (below), I observe that the AUKS spank 
> plugin has not been called yet.
> 
> 
> Is it a known/desired behavior? perhaps some work around could be 
> envisioned using the SrunProlog without too much hassle? what would you 
> advise?

As you found, output files are opened before the call to spank_init().
I wonder if there is any reason not to move the spank_init() call
up to where all the other plugins are initialized (at the beginning
of job_manager())

mark

 
> 
> Best regards,
> Carlos
> 
> 
> 
> [caguado@node01:/nfs4/cluster.san/user/caguado/public]$ sbatch 
> --auks=yes -pbatch -n5 test.sh
> Submitted batch job 52
> [caguado@node01:/nfs4/cluster.san/user/caguado/public]$ scontrol show job
> JobId=52 Name=test.sh
>     UserId=caguado(118403) GroupId=bbp(10067)
>     Priority=4294901712 Account=infrastructure QOS=normal
>     JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
>     Requeue=1 Restarts=0 BatchFlag=1 ExitCode=1:0
>     RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
>     SubmitTime=2012-01-11T18:35:54 EligibleTime=2012-01-11T18:35:54
>     StartTime=2012-01-11T18:35:54 EndTime=2012-01-11T18:35:54
>     PreemptTime=None SuspendTime=None SecsPreSuspend=0
>     Partition=batch AllocNode:Sid=bbplinsrv2:7893
>     ReqNodeList=(null) ExcNodeList=(null)
>     NodeList=node[02-06]
>     BatchHost=node02
>     NumNodes=5 NumCPUs=50 CPUs/Task=1 ReqS:C:T=*:*:*
>     MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>     Features=(null) Gres=(null) Reservation=(null)
>     Shared=OK Contiguous=0 Licenses=(null) Network=(null)
>     Command=/nfs4/main.cluster/user/caguado/public/test.sh
>     WorkDir=/nfs4/main.cluster/user/caguado/public
> 
> 
> And from /var/log/slurm/slurmd.log:
> 
> Entering _setup_normal_io
> [2012-01-11T18:35:54] [52] eio: handling events for 1 objects
> [2012-01-11T18:35:54] [52] Called _msg_socket_readable
> [2012-01-11T18:35:54] [52] Uncached user/gid: caguado/10067
> [2012-01-11T18:35:54] [52] eio: handling events for 1 objects
> [2012-01-11T18:35:54] [52] Called _msg_socket_readable
> [2012-01-11T18:35:54] [52]   stdin file name = /dev/null
> [2012-01-11T18:35:54] [52]   stdout file name = 
> /nfs4/cluster.san/user/caguado/public/slurm-52.out
> [2012-01-11T18:35:54] [52] Could not open stdout file 
> /nfs4/cluster.san/user/caguado/public/slurm-52.out: Permission denied
> [2012-01-11T18:35:54] [52] eio: handling events for 1 objects
> [2012-01-11T18:35:54] [52] Called _msg_socket_readable
> [2012-01-11T18:35:54] [52] Leaving  _setup_normal_io
> [2012-01-11T18:35:54] [52] IO setup failed: Permission denied
> [2012-01-11T18:35:54] [52] Before call to spank_fini()
> [2012-01-11T18:35:54] [52] After call to spank_fini()
> 
> 
> Excerpt of /var/log/slurm/slurmd.log in a successful case:
> 
> [2012-01-11T20:04:06] [1147] Entering _setup_normal_io
> [2012-01-11T20:04:06] [1147] eio: handling events for 1 objects
> [2012-01-11T20:04:06] [1147] Called _msg_socket_readable
> [2012-01-11T20:04:06] [1147] Uncached user/gid: caguado/10067
> [2012-01-11T20:04:06] [1147] eio: handling events for 1 objects
> [2012-01-11T20:04:06] [1147] Called _msg_socket_readable
> [2012-01-11T20:04:06] [1147]   stdin file name = /dev/null
> [2012-01-11T20:04:06] [1147]   stdout file name = /tmp/output
> [2012-01-11T20:04:06] [1147]   stderr file name = /tmp/output
> [2012-01-11T20:04:06] [1147] eio: handling events for 1 objects
> [2012-01-11T20:04:06] [1147] Called _msg_socket_readable
> [2012-01-11T20:04:06] [1147] Leaving  _setup_normal_io
> [2012-01-11T20:04:06] [1147] debug level = 2
> [2012-01-11T20:04:06] [1147] Before call to spank_init()
> [2012-01-11T20:04:06] [1147] spank: opening plugin stack 
> /etc/slurm/plugstack.conf
> [2012-01-11T20:04:06] [1147] /etc/slurm/plugstack.conf: 1: include 
> "/etc/slurm/plugstack.conf.d/*"
> [2012-01-11T20:04:06] [1147] spank: opening plugin stack 
> /etc/slurm/plugstack.conf.d/99-lua
> [2012-01-11T20:04:06] [1147] spank: opening plugin stack 
> /etc/slurm/plugstack.conf.d/auks.conf
> [2012-01-11T20:04:06] [1147] Couldn't find sym 
> 'slurm_spank_task_init_privileged' in the plugin
> [2012-01-11T20:04:06] [1147] Couldn't find sym 'slurm_spank_task_init' 
> in the plugin
> [2012-01-11T20:04:06] [1147] Couldn't find sym 
> 'slurm_spank_task_post_fork' in the plugin
> [2012-01-11T20:04:06] [1147] spank: 
> /etc/slurm/plugstack.conf.d/auks.conf:40: Loaded plugin auks.so
> [2012-01-11T20:04:06] [1147] spank: opening plugin stack 
> /etc/slurm/plugstack.conf.d/auks.conf.example
> [2012-01-11T20:04:06] [1147] SPANK: appending plugin option "auks"
> [2012-01-11T20:04:07] [1147] spank-auks: user '118403' cred stored in 
> /tmp/krb5cc_118403_1147_UUzVxf
> [2012-01-11T20:04:07] [1147] spank: auks.so: init = 0
> [2012-01-11T20:04:07] [1147] spank: auks.so: init_post_opt = 0
> [2012-01-11T20:04:07] [1147] After call to spank_init()
> 
> 
> Ref:
> [1] http://www.citi.umich.edu/projects/nfsv4/linux/faq/#krb5_006
> 

Reply via email to