On Tue, Sep 2, 2014 at 8:38 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:
> Matt: Random thought -- is your "srun" a shell script, perchance? (it > shouldn't be, but perhaps there's some kind of local override...?) > > Ralph's point on the call today is that it doesn't matter *how* this > problem is happening. It *is* happening to real users, and so we need to > account for it. > > But it really bothers me that we don't understand *how/why* this is > happening (e.g., is this OMPI's fault somehow? I don't think so, but then > again, we don't understand how it's happening). *Somewhere* in there, a > shell is getting invoked. But "srun" shouldn't be invoking a shell on the > remote side -- it should be directly fork/exec'ing the tokens with no shell > interpretation at all. > Jeff, Just saw this, sorry. Our srun is indeed a shell script. It seems to be a wrapper around the regular srun that runs a --task-prolog. What it does...that's beyond my ken, but I could ask. My guess is that it probably does something that helps keep our old PBS scripts running (sets $PBS_NODEFILE, say). We used to run PBS but switched to SLURM recently. The admins would, of course, prefer all future scripts be SLURM-native scripts, but there are a lot of production runs that uses many, many PBS scripts. Converting that would need slow, careful QC to make sure any "pure SLURM" versions act as expected. Matt -- "And, isn't sanity really just a one-trick pony anyway? I mean all you get is one trick: rational thinking. But when you're good and crazy, oooh, oooh, oooh, the sky is the limit!" -- The Tick