It turns out all my woes were caused by iptables. iptables was running on several nodes and blocking some traffic. Once I shut it down everything started working perfectly.
On Wed, Jan 18, 2012 at 2:12 PM, Michel Bourget <mic...@sgi.com> wrote: > On 01/18/2012 01:17 PM, Davis Ford wrote: > >> Thanks Michael. >> >> time is not out of sync. I setup ntpd, and verified that the time is >> coordinated >> uids are not out of sync. on each node I did groupadd -g 777 slurm; >> useradd -g 777 -u 777 slurm; and specified the slurm user in slurm.conf; >> they are all using the same slurm.conf -- it is linked thru NFS >> running as root while not allowing it in slurm.conf? that is an >> interesting one. >> >> If I run something like: >> >> srun -n5 whoami >> root >> root >> root >> root >> root >> >> Why is this? Shouldn't it be executing as the slurm user? In slurm.conf >> => >> >> SlurmUser=slurm >> >> ...but I am executing srun as root >> > > > Ah. That makes sense, no ? The submissions "user" is root and slurmd > switch to root. slurmd run as root ( well, SlurmdUser ) and switch to the > invoking user between the fork() and the execve(). You would not want it to > switch to SlurmUser , right ? > > Just tried it and I see that. If I do it from michel, whoami reports > michel. My understanding is the "slurm" concept is about when > slurmctld/slurmd/slurmstepd(to some degree) performs "files" actions on > cyheckpoint dir, epilog, prolog, etc ... : therese are performed as the > SlurmUser ... And slurmctld execute as "SlurmUser" ... but probably, didn't > dig it deeper, switch to root to perform some actions, then switch back to > SlurmUser. And this also relate to PrivateData visibility knob, since > slurmctld execute as SlurmUser: it defines what SlurmUser can see from each > other job/etc ... stuff. > > > Btw, works well. If, from root, I "su - michel" or "su michel" , srun -n5 > whoami report michel in both cases. I hope this is what you get ... > > On Wed, Jan 18, 2012 at 1:09 PM, Michel Bourget<mic...@sgi.com> wrote: >> >> On 01/18/2012 12:44 PM, Davis Ford wrote: >>> >>> I'm seeing another error pop up when I try to run jobs: >>>> >>>> srun: error: Task launch for 67.0 failed on node ORLGAS2: Invalid job >>>> credential >>>> srun: error: Application launch failed: Invalid job credential >>>> >>>> The jobs submitted was: srun -n10 hostname >>>> >>>> Is this related to auth? I'm using munge, and I validated that munge is >>>> working properly on the node that has the error. If it isn't auth/munge >>>> related, what does "Invalid job credential" mean? >>>> >>>> >>>> I could think of: >>> - time is out-of-sync by more than 5 minutes >>> - running as root while not allowing it in slurm.conf >>> - uids out-of-sync ? >>> >>> Of course, make sure munge.key is the _same_ on all the nodes :) >>> >>> Also, it could be munged could have been oom-killed. I've seen that many >>> times. >>> Not sure it would result in that symptom but ... >>> >>> >>> >>> -- >>> >>> ------------------------------****----------------------------**- >>> >>> Michel Bourget - SGI - Linux Software Engineering >>> "Past BIOS POST, everything else is extra" (travis) >>> ------------------------------****----------------------------**- >>> >>> >>> > > -- > > ------------------------------**----------------------------- > Michel Bourget - SGI - Linux Software Engineering > "Past BIOS POST, everything else is extra" (travis) > ------------------------------**----------------------------- > >