All solved and now works well! The culprit was the lost line in the "maui.cfg" file:
JOBNODEMATCHPOLICY EXACTNODE The default value for this variable is EXACTPROC and, in its presence, Maui completely ignores the "-l nodes=N:ppn=M" PBS instruction and allocates the first M available cores inside the first free node.. Andy. 2017-08-09 23:55 GMT+02:00 A M <amm.p...@gmail.com>: > > Thanks! > > In fact there should be a problem with Maui's node allocation setting. I > have checked the $PBS_NODEFILE contents (this is also may be seen with > "qstat -n1"): while the default Torque scheduler correctly allocates one > slot on node1 and another slot on node2, in case of Maui I always see that > Maui allocates two slots on one of the nodes. Will now try to check better > the maui.cfg file. Apparently my allocation policy is not correct. Will now > dig it further.. > > Andy. > > > > > > 2017-08-09 21:49 GMT+02:00 r...@open-mpi.org <r...@open-mpi.org>: > >> sounds to me like your maui scheduler didn’t provide any allocated slots >> on the nodes - did you check $PBS_NODEFILE? >> >> > On Aug 9, 2017, at 12:41 PM, A M <amm.p...@gmail.com> wrote: >> > >> > >> > Hello, >> > >> > I have just ran into a strange issue with "mpirun". Here is what >> happened: >> > >> > I successfully installed Torque 6.1.1.1 with the plain pbs_sched on a >> minimal set of 2 IB nodes. Then I added openmpi 2.1.1 compiled with verbs >> and tm, and have verified that mpirun works as it should with a small >> "pingpong" program. >> > >> > Here is my Torque minimal jobscript which I used to check the IB >> message passing: >> > >> > #!/bin/sh >> > #PBS -o Out >> > #PBS -e Err >> > #PBS -l nodes=2:ppn=1 >> > cd $PBS_O_WORKDIR >> > mpirun -np 2 -pernode ./pingpong 4000000 >> > >> > The job correctly used IB as the default message passing iface and >> resulted in 3.6 Gb/sec "pingpong" bandwidth which is correct in my case, >> since the two batch nodes have the QDR HCAs. >> > >> > I have then stopped "pbs_sched" and started the Maui 3.3.1 scheduler >> instead. Serial jobs work without any problem, but the same jobscript is >> now failing with the following message: >> > >> > -------- >> > Your job has requested more processes than the ppr for this topology >> can support: >> > App: /lustre/work/user/testus/pingpong >> > Number of procs: 2 >> > PPR: 1:node >> > Please revise the conflict and try again. >> > -------- >> > >> > I then have tried to play with - -nooversubscribe and "--pernode 2" >> options, but the error persisted. It looks like the freshmost "mpirun" is >> getting some information from the latest available Maui scheduler. It is >> enough to go back to "pbs_sched", and everything works like a charm. I used >> the preexisting "maui.cfg" file which still works well on the oldish Centos >> 6 with an old 1.8.5 version of openmpi. >> > >> > Thanks ahead for any hint/comment on how to address this. Are there any >> other mpirun options to try? Should I try to downgrade openmpi to the >> latest 1.X series? >> > >> > Andy. >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > mpirun -np 2 -pernode --mca btl ^tcp ./pingpong 4000000 >> > >> > >> > 2. >> > >> > >> > _______________________________________________ >> > users mailing list >> > users@lists.open-mpi.org >> > https://lists.open-mpi.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users