On Wed, 11 Jan 2012 20:26:10 +0100, Carlos Aguado Sanchez <carlos.agu...@epfl.ch> wrote: > Dear devteam, > > > We are experiencing some issues with slurm-2.3.1 and the AUKS plugin for > integration with NFSv4 and Kerberos. Although that is an external tool, > I thought it might be related to this mailing list as it uses the spank > interface, hence the question. > > Sending a job whose working directory sits on NFS I observe that the job > step does not get kerberos credentials before opening the output file > and consequently fails. > > Specifying an alternate output file makes the kernel to cache the gssrpc > context and all subsequent jobs are successful for the time those remain > cached [1]. > > With high debug in slurmd logs (below), I observe that the AUKS spank > plugin has not been called yet. > > > Is it a known/desired behavior? perhaps some work around could be > envisioned using the SrunProlog without too much hassle? what would you > advise?
As you found, output files are opened before the call to spank_init(). I wonder if there is any reason not to move the spank_init() call up to where all the other plugins are initialized (at the beginning of job_manager()) mark > > Best regards, > Carlos > > > > [caguado@node01:/nfs4/cluster.san/user/caguado/public]$ sbatch > --auks=yes -pbatch -n5 test.sh > Submitted batch job 52 > [caguado@node01:/nfs4/cluster.san/user/caguado/public]$ scontrol show job > JobId=52 Name=test.sh > UserId=caguado(118403) GroupId=bbp(10067) > Priority=4294901712 Account=infrastructure QOS=normal > JobState=FAILED Reason=NonZeroExitCode Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 ExitCode=1:0 > RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A > SubmitTime=2012-01-11T18:35:54 EligibleTime=2012-01-11T18:35:54 > StartTime=2012-01-11T18:35:54 EndTime=2012-01-11T18:35:54 > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > Partition=batch AllocNode:Sid=bbplinsrv2:7893 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=node[02-06] > BatchHost=node02 > NumNodes=5 NumCPUs=50 CPUs/Task=1 ReqS:C:T=*:*:* > MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 > Features=(null) Gres=(null) Reservation=(null) > Shared=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/nfs4/main.cluster/user/caguado/public/test.sh > WorkDir=/nfs4/main.cluster/user/caguado/public > > > And from /var/log/slurm/slurmd.log: > > Entering _setup_normal_io > [2012-01-11T18:35:54] [52] eio: handling events for 1 objects > [2012-01-11T18:35:54] [52] Called _msg_socket_readable > [2012-01-11T18:35:54] [52] Uncached user/gid: caguado/10067 > [2012-01-11T18:35:54] [52] eio: handling events for 1 objects > [2012-01-11T18:35:54] [52] Called _msg_socket_readable > [2012-01-11T18:35:54] [52] stdin file name = /dev/null > [2012-01-11T18:35:54] [52] stdout file name = > /nfs4/cluster.san/user/caguado/public/slurm-52.out > [2012-01-11T18:35:54] [52] Could not open stdout file > /nfs4/cluster.san/user/caguado/public/slurm-52.out: Permission denied > [2012-01-11T18:35:54] [52] eio: handling events for 1 objects > [2012-01-11T18:35:54] [52] Called _msg_socket_readable > [2012-01-11T18:35:54] [52] Leaving _setup_normal_io > [2012-01-11T18:35:54] [52] IO setup failed: Permission denied > [2012-01-11T18:35:54] [52] Before call to spank_fini() > [2012-01-11T18:35:54] [52] After call to spank_fini() > > > Excerpt of /var/log/slurm/slurmd.log in a successful case: > > [2012-01-11T20:04:06] [1147] Entering _setup_normal_io > [2012-01-11T20:04:06] [1147] eio: handling events for 1 objects > [2012-01-11T20:04:06] [1147] Called _msg_socket_readable > [2012-01-11T20:04:06] [1147] Uncached user/gid: caguado/10067 > [2012-01-11T20:04:06] [1147] eio: handling events for 1 objects > [2012-01-11T20:04:06] [1147] Called _msg_socket_readable > [2012-01-11T20:04:06] [1147] stdin file name = /dev/null > [2012-01-11T20:04:06] [1147] stdout file name = /tmp/output > [2012-01-11T20:04:06] [1147] stderr file name = /tmp/output > [2012-01-11T20:04:06] [1147] eio: handling events for 1 objects > [2012-01-11T20:04:06] [1147] Called _msg_socket_readable > [2012-01-11T20:04:06] [1147] Leaving _setup_normal_io > [2012-01-11T20:04:06] [1147] debug level = 2 > [2012-01-11T20:04:06] [1147] Before call to spank_init() > [2012-01-11T20:04:06] [1147] spank: opening plugin stack > /etc/slurm/plugstack.conf > [2012-01-11T20:04:06] [1147] /etc/slurm/plugstack.conf: 1: include > "/etc/slurm/plugstack.conf.d/*" > [2012-01-11T20:04:06] [1147] spank: opening plugin stack > /etc/slurm/plugstack.conf.d/99-lua > [2012-01-11T20:04:06] [1147] spank: opening plugin stack > /etc/slurm/plugstack.conf.d/auks.conf > [2012-01-11T20:04:06] [1147] Couldn't find sym > 'slurm_spank_task_init_privileged' in the plugin > [2012-01-11T20:04:06] [1147] Couldn't find sym 'slurm_spank_task_init' > in the plugin > [2012-01-11T20:04:06] [1147] Couldn't find sym > 'slurm_spank_task_post_fork' in the plugin > [2012-01-11T20:04:06] [1147] spank: > /etc/slurm/plugstack.conf.d/auks.conf:40: Loaded plugin auks.so > [2012-01-11T20:04:06] [1147] spank: opening plugin stack > /etc/slurm/plugstack.conf.d/auks.conf.example > [2012-01-11T20:04:06] [1147] SPANK: appending plugin option "auks" > [2012-01-11T20:04:07] [1147] spank-auks: user '118403' cred stored in > /tmp/krb5cc_118403_1147_UUzVxf > [2012-01-11T20:04:07] [1147] spank: auks.so: init = 0 > [2012-01-11T20:04:07] [1147] spank: auks.so: init_post_opt = 0 > [2012-01-11T20:04:07] [1147] After call to spank_init() > > > Ref: > [1] http://www.citi.umich.edu/projects/nfsv4/linux/faq/#krb5_006 >