It turns out all my woes were caused by iptables.  iptables was running on
several nodes and blocking some traffic.  Once I shut it down everything
started working perfectly.

On Wed, Jan 18, 2012 at 2:12 PM, Michel Bourget <mic...@sgi.com> wrote:

> On 01/18/2012 01:17 PM, Davis Ford wrote:
>
>> Thanks Michael.
>>
>> time is not out of sync.  I setup ntpd, and verified that the time is
>> coordinated
>> uids are not out of sync.  on each node I did groupadd -g 777 slurm;
>> useradd -g 777 -u 777 slurm; and specified the slurm user in slurm.conf;
>> they are all using the same slurm.conf -- it is linked thru NFS
>> running as root while not allowing it in slurm.conf?  that is an
>> interesting one.
>>
>> If I run something like:
>>
>> srun -n5 whoami
>> root
>> root
>> root
>> root
>> root
>>
>> Why is this?  Shouldn't it be executing as the slurm user?  In slurm.conf
>> =>
>>
>> SlurmUser=slurm
>>
>> ...but I am executing srun as root
>>
>
>
> Ah. That makes sense, no ? The submissions "user" is root and slurmd
> switch to root. slurmd run as root ( well, SlurmdUser ) and switch to the
> invoking user between the fork() and the execve(). You would not want it to
> switch to SlurmUser , right ?
>
> Just tried it and I see that. If I do it from michel, whoami reports
> michel. My understanding is the "slurm" concept is about when
> slurmctld/slurmd/slurmstepd(to some degree) performs "files" actions on
> cyheckpoint dir, epilog, prolog, etc ... : therese are performed as the
> SlurmUser ... And slurmctld execute as "SlurmUser" ... but probably, didn't
> dig it deeper, switch to root to perform some actions, then switch back to
> SlurmUser. And this also relate to PrivateData visibility knob, since
> slurmctld execute as SlurmUser: it defines what SlurmUser can see from each
> other job/etc ... stuff.
>
>
> Btw, works well. If, from root, I "su - michel" or "su michel" , srun -n5
> whoami report michel in both cases. I hope this is what you get ...
>
>  On Wed, Jan 18, 2012 at 1:09 PM, Michel Bourget<mic...@sgi.com>  wrote:
>>
>>  On 01/18/2012 12:44 PM, Davis Ford wrote:
>>>
>>>  I'm seeing another error pop up when I try to run jobs:
>>>>
>>>> srun: error: Task launch for 67.0 failed on node ORLGAS2: Invalid job
>>>> credential
>>>> srun: error: Application launch failed: Invalid job credential
>>>>
>>>> The jobs submitted was: srun -n10 hostname
>>>>
>>>> Is this related to auth?  I'm using munge, and I validated that munge is
>>>> working properly on the node that has the error.  If it isn't auth/munge
>>>> related, what does "Invalid job credential" mean?
>>>>
>>>>
>>>>  I could think of:
>>> - time is out-of-sync by more than 5 minutes
>>> - running as root while not allowing it in slurm.conf
>>> - uids out-of-sync ?
>>>
>>> Of course, make sure munge.key is the _same_ on all the nodes :)
>>>
>>> Also, it could be munged could have been oom-killed. I've seen that many
>>> times.
>>> Not sure it would result in that symptom but ...
>>>
>>>
>>>
>>> --
>>>
>>> ------------------------------****----------------------------**-
>>>
>>>     Michel Bourget - SGI - Linux Software Engineering
>>>    "Past BIOS POST, everything else is extra" (travis)
>>> ------------------------------****----------------------------**-
>>>
>>>
>>>
>
> --
>
> ------------------------------**-----------------------------
>     Michel Bourget - SGI - Linux Software Engineering
>    "Past BIOS POST, everything else is extra" (travis)
> ------------------------------**-----------------------------
>
>

Reply via email to