Ralph Castain wrote:
We would consider it a "feature" that OpenMPI is integrated with Torque. We
actually read the PBS_NODEFILE internally ourselves. I believe the problem
here is that specifying the "machinefile" prevents us from using that
Torque-integrated code and forces us down a different code path that doesn't
correctly interpret the PBS_NODEFILE format.

We probably should consider your observation a "bug" - frankly, it wasn't
something anyone anticipated a user ever doing, so nobody I know of ever
tested it. I'd have to dig into the internals to understand how you wound up
in that particular error mode.

I'd say that this behavior of mpirun under Torque TM should be considered as
a bug. Ideally, users should not have to design their scripts differently
according to whether the sysadmin decided to configure in TM or not.
Also, for interactive tests one doesn't have TM.  I think that mpirun just
ought to work no matter what...

So I'd strongly propose that "-machinefile" should at least be tolerated
when mpirun executes under TM.  You might issue a warning about -machinefile
being ignored under TM, but the code should never bomb out, IMHO.
Such behavior would be much easier for users (and sysadmins :-) to
understand than the present situation.

Thanks again,
Ole

Reply via email to