Re: [OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1 [SOLVED]

A M Thu, 10 Aug 2017 13:05:22 -0700

All solved and now works well! The culprit was the lost line in the
"maui.cfg" file:


JOBNODEMATCHPOLICY EXACTNODE

The default value for this variable is EXACTPROC and, in its presence, Maui
completely ignores the  "-l nodes=N:ppn=M" PBS instruction and allocates
the first M available cores inside the first free node..

Andy.


2017-08-09 23:55 GMT+02:00 A M <amm.p...@gmail.com>:

>
> Thanks!
>
> In fact there should be a problem with Maui's node allocation setting. I
> have checked the $PBS_NODEFILE contents (this is also may be seen with
> "qstat -n1"): while the default Torque scheduler correctly allocates one
> slot on node1 and another slot on node2, in case of Maui I always see that
> Maui allocates two slots on one of the nodes. Will now try to check better
> the maui.cfg file. Apparently my allocation policy is not correct. Will now
> dig it further..
>
> Andy.
>
>
>
>
>
> 2017-08-09 21:49 GMT+02:00 r...@open-mpi.org <r...@open-mpi.org>:
>
>> sounds to me like your maui scheduler didn’t provide any allocated slots
>> on the nodes - did you check $PBS_NODEFILE?
>>
>> > On Aug 9, 2017, at 12:41 PM, A M <amm.p...@gmail.com> wrote:
>> >
>> >
>> > Hello,
>> >
>> > I have just ran into a strange issue with "mpirun". Here is what
>> happened:
>> >
>> > I successfully installed Torque 6.1.1.1 with the plain pbs_sched on a
>> minimal set of 2 IB nodes. Then I added openmpi 2.1.1 compiled with verbs
>> and tm, and have verified that mpirun works as it should with a small
>> "pingpong" program.
>> >
>> > Here is my Torque minimal jobscript which I used to check the IB
>> message passing:
>> >
>> > #!/bin/sh
>> > #PBS -o Out
>> > #PBS -e Err
>> > #PBS -l nodes=2:ppn=1
>> > cd $PBS_O_WORKDIR
>> > mpirun -np 2 -pernode ./pingpong 4000000
>> >
>> > The job correctly used IB as the default message passing iface and
>> resulted in 3.6 Gb/sec "pingpong" bandwidth which is correct in my case,
>> since the two batch nodes have the QDR HCAs.
>> >
>> > I have then stopped "pbs_sched" and started the Maui 3.3.1 scheduler
>> instead. Serial jobs work without any problem, but the same jobscript is
>> now failing with the following  message:
>> >
>> > --------
>> > Your job has requested more processes than the ppr for this topology
>> can support:
>> > App: /lustre/work/user/testus/pingpong
>> > Number of procs:  2
>> > PPR: 1:node
>> > Please revise the conflict and try again.
>> > --------
>> >
>> > I then have tried to play with  - -nooversubscribe and "--pernode 2"
>> options, but the error persisted. It looks like the freshmost "mpirun" is
>> getting some information from the latest available Maui scheduler. It is
>> enough to go back to "pbs_sched", and everything works like a charm. I used
>> the preexisting "maui.cfg" file which still works well on the oldish Centos
>> 6 with an old 1.8.5 version of openmpi.
>> >
>> > Thanks ahead for any hint/comment on how to address this. Are there any
>> other mpirun options to try? Should I try to downgrade openmpi to the
>> latest 1.X series?
>> >
>> > Andy.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > mpirun -np 2 -pernode --mca btl ^tcp ./pingpong 4000000
>> >
>> >
>> > 2.
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > users@lists.open-mpi.org
>> > https://lists.open-mpi.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpirun 2.1.1 refuses to start a Torque 6.1.1.1 job if I change the scheduler to Maui 3.3.1 [SOLVED]

Reply via email to