Anthony,

in your script, can you


set -x

env

pbsdsh hostname

mpirun --display-map --display-allocation --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname


and then compress and send the output ?


Cheers,


Gilles

On 10/3/2017 1:19 PM, Anthony Thyssen wrote:
I noticed that too.  Though the submitting host for torque is a different host (main head node, "shrek"),  "node21" is the host that torque runs the batch script (and the mpirun command) it being the first node in the "dualcore" resource group.

Adding option...

It fixed the hostname in the allocation map, though had no effect on the outcome.  The allocation is still simply ignored.

=======8<--------CUT HERE----------
PBS Job Number       9000
PBS batch run on     node21.emperor
Time it was started  2017-10-03_14:11:20
Current Directory    /net/shrek.emperor/home/shrek/anthony
Submitted work dir   /home/shrek/anthony/mpi-pbs
Number of Nodes      5
Nodefile List       /var/lib/torque/aux//9000.shrek.emperor
node21.emperor
node25.emperor
node24.emperor
node23.emperor
node22.emperor
---------------------------------------

======================  ALLOCATED NODES   ======================
node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
node25.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
node24.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
node22.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================
node21.emperor
node21.emperor
node21.emperor
node21.emperor
node21.emperor
=======8<--------CUT HERE----------


  Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au>>
 --------------------------------------------------------------------------
   The equivalent of an armoured car should always be used to
   protect any secret kept in a cardboard box.
   -- Anthony Thyssen, On the use of Encryption
 --------------------------------------------------------------------------


On Tue, Oct 3, 2017 at 2:00 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

    One thing I can see is that the local host (where mpirun executed)
    shows as “node21” in the allocation, while all others show their
    FQDN. This might be causing some confusion.

    You might try adding "--mca orte_keep_fqdn_hostnames 1” to your
    cmd line and see if that helps.


    On Oct 2, 2017, at 8:14 PM, Anthony Thyssen
    <a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au>> wrote:

    Update...  Problem of all processes runing on first node
    (oversubscribed dual-core machine) is NOT resolved.

    Changing the mpirun  command in the Torque batch script
    ("pbs_hello" - See previous) to

       mpirun --nooversubscribe --display-allocation hostname

    Then submitting to PBS/Torque using

    qsub -l nodes=5:ppn=1:dualcore pbs_hello

    To run on 5 dual-core machines. Produces the following result...

    =======8<--------CUT HERE----------
    PBS Job Number       8996
    PBS batch run on  node21.emperor
    Time it was started 2017-10-03_13:04:07
    Current Directory /net/shrek.emperor/home/shrek/anthony
    Submitted work dir  /home/shrek/anthony/mpi-pbs
    Number of Nodes      5
    Nodefile List /var/lib/torque/aux//8996.shrek.emperor
    node21.emperor
    node25.emperor
    node24.emperor
    node23.emperor
    node22.emperor
    ---------------------------------------

    ======================  ALLOCATED NODES  ======================
            node21: slots=1 max_slots=0 slots_inuse=0 state=UP
            node25.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
            node24.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
            node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
            node22.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
    =================================================================
    node21.emperor
    node21.emperor
    node21.emperor
    node21.emperor
    node21.emperor
    =======8<--------CUT HERE----------

    The $PBS_NODE file shows torque requesting 5 processes on 5
    separate machines.

    The mpirun command's "ALLOCATED NODES" shows it picked up the
    request correctly from torque.

    But the "hostname" output still shows ALL processes were run on
    the first node only!

    Even though I requested it not to over subscribe.


    I am at a complete loss as to how to solve this problem..

    ANY and all suggestions, or even ways I can get other information
    as to what is causing this will be most welcome.


      Anthony Thyssen ( System Programmer )   
    <a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au>>
     --------------------------------------------------------------------------
       Using encryption on the Internet is the equivalent of arranging
       an armored car to deliver credit-card information from someone
       living in a cardboard box to someone living on a park bench.
                           -- Gene Spafford
     --------------------------------------------------------------------------



    _______________________________________________
    users mailing list
    users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
    https://lists.open-mpi.org/mailman/listinfo/users
    <https://lists.open-mpi.org/mailman/listinfo/users>




_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to