On Tue, Sep 2, 2014 at 8:38 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> Matt: Random thought -- is your "srun" a shell script, perchance?  (it
> shouldn't be, but perhaps there's some kind of local override...?)
>
> Ralph's point on the call today is that it doesn't matter *how* this
> problem is happening.  It *is* happening to real users, and so we need to
> account for it.
>
> But it really bothers me that we don't understand *how/why* this is
> happening (e.g., is this OMPI's fault somehow?  I don't think so, but then
> again, we don't understand how it's happening).  *Somewhere* in there, a
> shell is getting invoked.  But "srun" shouldn't be invoking a shell on the
> remote side -- it should be directly fork/exec'ing the tokens with no shell
> interpretation at all.
>

Jeff,

Just saw this, sorry. Our srun is indeed a shell script. It seems to be a
wrapper around the regular srun that runs a --task-prolog. What it
does...that's beyond my ken, but I could ask. My guess is that it probably
does something that helps keep our old PBS scripts running (sets
$PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The
admins would, of course, prefer all future scripts be SLURM-native scripts,
but there are a lot of production runs that uses many, many PBS scripts.
Converting that would need slow, careful QC to make sure any "pure SLURM"
versions act as expected.

Matt


-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

Reply via email to