On Wed, 11 Jan 2012 17:23:34 -0700, Moe Jette <je...@schedmd.com> wrote:
> Matthieu,
> 
> I do not see any problem moving the code as you suggest, although it  
> will prevent error messages being sent to the srun command from some  
> possible failure conditions. The change could also result in failures  
> that are not obvious to us looking at the code right now.
> 
> Something that we might consider is adding a new spank plugin function  
> to be made at an earlier point in the slurmstepd code. That should be  
> a simple and low risk change.

spank_init() is where the plugins are actually loaded by plugstack.c,
so it would have to be that call which is relocated.

Mattheiu, is the relevant auks call made from a slurm_spank_init()
callback, or slurm_spank_init_post_opt()?

One thing I might want to do is break spank_init_post_opt() out of
spank_init() on the slurmd side, so that it could be run later
in the code (after IO files are set up). However, if auks or other
security related plugins need to handle options from the user
before IO can be set up, this might not work.

mark

 
> Moe
> 
> 
> Quoting Matthieu Hautreux <matthieu.hautr...@cea.fr>:
> 
> > On 01/11/2012 10:03 PM, Mark A. Grondona wrote:
> >> On Wed, 11 Jan 2012 20:26:10 +0100, Carlos Aguado  
> >> Sanchez<carlos.agu...@epfl.ch>  wrote:
> >>> Dear devteam,
> >>>
> >>>
> >>> We are experiencing some issues with slurm-2.3.1 and the AUKS plugin for
> >>> integration with NFSv4 and Kerberos. Although that is an external tool,
> >>> I thought it might be related to this mailing list as it uses the spank
> >>> interface, hence the question.
> >>>
> >>> Sending a job whose working directory sits on NFS I observe that the job
> >>> step does not get kerberos credentials before opening the output file
> >>> and consequently fails.
> >>>
> >>> Specifying an alternate output file makes the kernel to cache the gssrpc
> >>> context and all subsequent jobs are successful for the time those remain
> >>> cached [1].
> >>>
> >>> With high debug in slurmd logs (below), I observe that the AUKS spank
> >>> plugin has not been called yet.
> >>>
> >>>
> >>> Is it a known/desired behavior? perhaps some work around could be
> >>> envisioned using the SrunProlog without too much hassle? what would you
> >>> advise?
> >> As you found, output files are opened before the call to spank_init().
> >> I wonder if there is any reason not to move the spank_init() call
> >> up to where all the other plugins are initialized (at the beginning
> >> of job_manager())
> >>
> >> mark
> >
> > This problem is also of great interest for Andreas Davour who was  
> > looking for having something working with AFS (a previous message of  
> > today). I am pretty sure that regardless of the workaround he could  
> > find to generate an AFS token, he will encounter the same exact  
> > problem.
> >
> > Looking at the code in job_manager() I was wondering the same thing  
> > concerning the spank stack.
> >
> > If we consider that one need a security token before accessing his  
> > output/error files, the initialization of this token must be done  
> > prior to trying to open these files. As kerberos credential is  
> > currently managed using SLURM in a spank plugin, the spank plugin  
> > should thus be initialized before. As a secured FS could require an  
> > intermediate token (like in AFS), and that these kind of tokens are  
> > commonly managed by PAM modules, the pam_setup should be executed  
> > after the spank init and before any attempt to open a file in user  
> > context too. To be sure that spank plugins are executed in the  
> > container of the job, I also think that the slurm_container_create  
> > should also be moved to ensure that it is done before spank_init.
> >
> > It would then require to move slurm_container_create, spank_init and  
> > pam_setup from _fork_all_tasks to job_manager before IO setup. That  
> > will probably prevent from having errors printed in the srun  
> > stdout/stderr during these calls. As the pam_setup is tightly  
> > coupled with the drop privileges in _fork_all_tasks, it will  
> > probably requires to split the internal logic of that function too.
> > An other option could be to move the IO setup as well as the  
> > consecutive log initialization just after the pam_setup  
> > in_fork_all_tasks.
> > That is clearly a big modification and I am not sure of the  
> > consequences. Moe, do you think that this could be something  
> > possible or do you see blocking problems in these 2 possibilities ?
> >
> > Regards,
> > Matthieu
> >
> >>
> >>
> >>> Best regards,
> >>> Carlos
> >>>
> >>>
> >>>
> >>> [caguado@node01:/nfs4/cluster.san/user/caguado/public]$ sbatch
> >>> --auks=yes -pbatch -n5 test.sh
> >>> Submitted batch job 52
> >>> [caguado@node01:/nfs4/cluster.san/user/caguado/public]$ scontrol show job
> >>> JobId=52 Name=test.sh
> >>>     UserId=caguado(118403) GroupId=bbp(10067)
> >>>     Priority=4294901712 Account=infrastructure QOS=normal
> >>>     JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
> >>>     Requeue=1 Restarts=0 BatchFlag=1 ExitCode=1:0
> >>>     RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
> >>>     SubmitTime=2012-01-11T18:35:54 EligibleTime=2012-01-11T18:35:54
> >>>     StartTime=2012-01-11T18:35:54 EndTime=2012-01-11T18:35:54
> >>>     PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >>>     Partition=batch AllocNode:Sid=bbplinsrv2:7893
> >>>     ReqNodeList=(null) ExcNodeList=(null)
> >>>     NodeList=node[02-06]
> >>>     BatchHost=node02
> >>>     NumNodes=5 NumCPUs=50 CPUs/Task=1 ReqS:C:T=*:*:*
> >>>     MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
> >>>     Features=(null) Gres=(null) Reservation=(null)
> >>>     Shared=OK Contiguous=0 Licenses=(null) Network=(null)
> >>>     Command=/nfs4/main.cluster/user/caguado/public/test.sh
> >>>     WorkDir=/nfs4/main.cluster/user/caguado/public
> >>>
> >>>
> >>> And from /var/log/slurm/slurmd.log:
> >>>
> >>> Entering _setup_normal_io
> >>> [2012-01-11T18:35:54] [52] eio: handling events for 1 objects
> >>> [2012-01-11T18:35:54] [52] Called _msg_socket_readable
> >>> [2012-01-11T18:35:54] [52] Uncached user/gid: caguado/10067
> >>> [2012-01-11T18:35:54] [52] eio: handling events for 1 objects
> >>> [2012-01-11T18:35:54] [52] Called _msg_socket_readable
> >>> [2012-01-11T18:35:54] [52]   stdin file name = /dev/null
> >>> [2012-01-11T18:35:54] [52]   stdout file name =
> >>> /nfs4/cluster.san/user/caguado/public/slurm-52.out
> >>> [2012-01-11T18:35:54] [52] Could not open stdout file
> >>> /nfs4/cluster.san/user/caguado/public/slurm-52.out: Permission denied
> >>> [2012-01-11T18:35:54] [52] eio: handling events for 1 objects
> >>> [2012-01-11T18:35:54] [52] Called _msg_socket_readable
> >>> [2012-01-11T18:35:54] [52] Leaving  _setup_normal_io
> >>> [2012-01-11T18:35:54] [52] IO setup failed: Permission denied
> >>> [2012-01-11T18:35:54] [52] Before call to spank_fini()
> >>> [2012-01-11T18:35:54] [52] After call to spank_fini()
> >>>
> >>>
> >>> Excerpt of /var/log/slurm/slurmd.log in a successful case:
> >>>
> >>> [2012-01-11T20:04:06] [1147] Entering _setup_normal_io
> >>> [2012-01-11T20:04:06] [1147] eio: handling events for 1 objects
> >>> [2012-01-11T20:04:06] [1147] Called _msg_socket_readable
> >>> [2012-01-11T20:04:06] [1147] Uncached user/gid: caguado/10067
> >>> [2012-01-11T20:04:06] [1147] eio: handling events for 1 objects
> >>> [2012-01-11T20:04:06] [1147] Called _msg_socket_readable
> >>> [2012-01-11T20:04:06] [1147]   stdin file name = /dev/null
> >>> [2012-01-11T20:04:06] [1147]   stdout file name = /tmp/output
> >>> [2012-01-11T20:04:06] [1147]   stderr file name = /tmp/output
> >>> [2012-01-11T20:04:06] [1147] eio: handling events for 1 objects
> >>> [2012-01-11T20:04:06] [1147] Called _msg_socket_readable
> >>> [2012-01-11T20:04:06] [1147] Leaving  _setup_normal_io
> >>> [2012-01-11T20:04:06] [1147] debug level = 2
> >>> [2012-01-11T20:04:06] [1147] Before call to spank_init()
> >>> [2012-01-11T20:04:06] [1147] spank: opening plugin stack
> >>> /etc/slurm/plugstack.conf
> >>> [2012-01-11T20:04:06] [1147] /etc/slurm/plugstack.conf: 1: include
> >>> "/etc/slurm/plugstack.conf.d/*"
> >>> [2012-01-11T20:04:06] [1147] spank: opening plugin stack
> >>> /etc/slurm/plugstack.conf.d/99-lua
> >>> [2012-01-11T20:04:06] [1147] spank: opening plugin stack
> >>> /etc/slurm/plugstack.conf.d/auks.conf
> >>> [2012-01-11T20:04:06] [1147] Couldn't find sym
> >>> 'slurm_spank_task_init_privileged' in the plugin
> >>> [2012-01-11T20:04:06] [1147] Couldn't find sym 'slurm_spank_task_init'
> >>> in the plugin
> >>> [2012-01-11T20:04:06] [1147] Couldn't find sym
> >>> 'slurm_spank_task_post_fork' in the plugin
> >>> [2012-01-11T20:04:06] [1147] spank:
> >>> /etc/slurm/plugstack.conf.d/auks.conf:40: Loaded plugin auks.so
> >>> [2012-01-11T20:04:06] [1147] spank: opening plugin stack
> >>> /etc/slurm/plugstack.conf.d/auks.conf.example
> >>> [2012-01-11T20:04:06] [1147] SPANK: appending plugin option "auks"
> >>> [2012-01-11T20:04:07] [1147] spank-auks: user '118403' cred stored in
> >>> /tmp/krb5cc_118403_1147_UUzVxf
> >>> [2012-01-11T20:04:07] [1147] spank: auks.so: init = 0
> >>> [2012-01-11T20:04:07] [1147] spank: auks.so: init_post_opt = 0
> >>> [2012-01-11T20:04:07] [1147] After call to spank_init()
> >>>
> >>>
> >>> Ref:
> >>> [1] http://www.citi.umich.edu/projects/nfsv4/linux/faq/#krb5_006
> >>>
> >
> >
> 
> 
> 

Reply via email to