No problem - glad you were able to work it out!

> On Oct 5, 2017, at 11:22 PM, Anthony Thyssen <a.thys...@griffith.edu.au> 
> wrote:
> 
> Sorry  r...@open-mpi.org <mailto:r...@open-mpi.org>  as Gilles Gouaillardet 
> pointed out to me the problem wasn't OpenMPI, but with the specific EPEL 
> implementation (see Redhat Bugzilla 1321154)
> 
> Today, the the server was able to be taken down for maintance, and I wanted 
> to try a few things.
> 
> After installing EPEL Testing Repo    torque-4.2.10-11.el7
> 
> However I found that all the nodes were 'down'  even though everything 
> appears to be running, with no errors in the error logs.
> 
> After a lot of trials, errors and reseach, I eventually (on a whim) I decided 
> to remove the "num_node_boards=1" entry from the "torque/server_priv/nodes" 
> file and restart the server & scheduler.   Suddenly the nodes were "free" and 
> my initial test job ran.
> 
> Perhaps the EPEL-Test Torque 4.2.10-11  does not contain Numa?
> 
> ALL later tests (with OpenMPI - RHEL SRPM 1.10.6-2 re-compiled "--with-tm")  
> is now responding to the Torque mode allocation correctly and is no longer 
> simply running all the jobs on the first node.
> 
> That is    $PBS_NODEFILE  ,    pbsdsh hostname  and   mpirun hostname    are 
> all in agreement.
> 
> Thank you all for your help, and putting up with with me.
> 
>   Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au 
> <mailto:a.thys...@griffith.edu.au>>
>  --------------------------------------------------------------------------
>   "Around here we've got a name for people what talks to dragons."
>   "Traitor?"  Wiz asked apprehensively.
>   "No.  Lunch."                     -- Rick Cook, "Wizadry Consulted"
>  --------------------------------------------------------------------------
> 
> 
> On Wed, Oct 4, 2017 at 11:43 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:
> Can you try a newer version of OMPI, say the 3.0.0 release? Just curious to 
> know if we perhaps “fixed” something relevant.
> 
> 
>> On Oct 3, 2017, at 5:33 PM, Anthony Thyssen <a.thys...@griffith.edu.au 
>> <mailto:a.thys...@griffith.edu.au>> wrote:
>> 
>> FYI...
>> 
>> The problem is discussed further in 
>> 
>> Redhat Bugzilla: Bug 1321154 - numa enabled torque don't work
>>    https://bugzilla.redhat.com/show_bug.cgi?id=1321154 
>> <https://bugzilla.redhat.com/show_bug.cgi?id=1321154>
>> 
>> I'd seen this previous as it required me to add "num_node_boards=1" to each 
>> node in the
>> /var/lib/torque/server_priv/nodes  to get torque to at least work.  
>> Specifically by munging
>> the $PBS_NODES" (which comes out correcT) into a host list containing the 
>> correct
>> "slot=" counts.  But of course now that I have compiled OpenMPI using 
>> "--with-tm" that
>> should not have been needed as in fact is now ignored by OpenMPI in a 
>> Torque-PBS
>> environment.
>> 
>> However it seems ever since "NUMA" support was into the Torque RPM's, has 
>> also caused
>> the current problems, and is still continuing.   The last action is a new 
>> EPEL "test' version
>> (August 2017),  I will try shortly.
>> 
>> Take you for your help, though I am still open to suggestions for a 
>> replacement.
>> 
>>   Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au 
>> <mailto:a.thys...@griffith.edu.au>>
>>  --------------------------------------------------------------------------
>>    Encryption... is a powerful defensive weapon for free people.
>>    It offers a technical guarantee of privacy, regardless of who is
>>    running the government... It's hard to think of a more powerful,
>>    less dangerous tool for liberty.            --  Esther Dyson
>>  --------------------------------------------------------------------------
>> 
>> 
>> 
>> On Wed, Oct 4, 2017 at 9:02 AM, Anthony Thyssen <a.thys...@griffith.edu.au 
>> <mailto:a.thys...@griffith.edu.au>> wrote:
>> Thank you Gilles.  At least I now have something to follow though with.
>> 
>> As a FYI, the torque is the pre-built version from the Redhat Extras (EPEL) 
>> archive.
>> torque-4.2.10-10.el7.x86_64
>> 
>> Normally pre-build packages have no problems, but in this case.
>> 
>> 
>> 
>> 
>> On Tue, Oct 3, 2017 at 3:39 PM, Gilles Gouaillardet <gil...@rist.or.jp 
>> <mailto:gil...@rist.or.jp>> wrote:
>> Anthony,
>> 
>> 
>> we had a similar issue reported some times ago (e.g. Open MPI ignores torque 
>> allocation),
>> 
>> and after quite some troubleshooting, we ended up with the same behavior 
>> (e.g. pbsdsh is not working as expected).
>> 
>> see https://www.mail-archive.com/users@lists.open-mpi.org/msg29952.html 
>> <https://www.mail-archive.com/users@lists.open-mpi.org/msg29952.html> for 
>> the last email.
>> 
>> 
>> from an Open MPI point of view, i would consider the root cause is with your 
>> torque install.
>> 
>> this case was reported at 
>> http://www.clusterresources.com/pipermail/torqueusers/2016-September/018858.html
>>  
>> <http://www.clusterresources.com/pipermail/torqueusers/2016-September/018858.html>
>> 
>> and no conclusion was reached.
>> 
>> 
>> Cheers,
>> 
>> 
>> Gilles
>> 
>> 
>> On 10/3/2017 2:02 PM, Anthony Thyssen wrote:
>> The stdin and stdout are saved to separate channels.
>> 
>> It is interesting that the output from pbsdsh is node21.emperor 5 times, 
>> even though $PBS_NODES is the 5 individual nodes.
>> 
>> Attached are the two compressed files, as well as the pbs_hello batch used.
>> 
>> Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au 
>> <mailto:a.thys...@griffith.edu.au> <mailto:a.thys...@griffith.edu.au 
>> <mailto:a.thys...@griffith.edu.au>>>
>>  --------------------------------------------------------------------------
>>   There are two types of encryption:
>>     One that will prevent your sister from reading your diary, and
>>     One that will prevent your government.           -- Bruce Schneier
>>  --------------------------------------------------------------------------
>> 
>> 
>> 
>> 
>> On Tue, Oct 3, 2017 at 2:39 PM, Gilles Gouaillardet <gil...@rist.or.jp 
>> <mailto:gil...@rist.or.jp> <mailto:gil...@rist.or.jp 
>> <mailto:gil...@rist.or.jp>>> wrote:
>> 
>>     Anthony,
>> 
>> 
>>     in your script, can you
>> 
>> 
>>     set -x
>> 
>>     env
>> 
>>     pbsdsh hostname
>> 
>>     mpirun --display-map --display-allocation --mca ess_base_verbose
>>     10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname
>> 
>> 
>>     and then compress and send the output ?
>> 
>> 
>>     Cheers,
>> 
>> 
>>     Gilles
>> 
>> 
>>     On 10/3/2017 1:19 PM, Anthony Thyssen wrote:
>> 
>>         I noticed that too.  Though the submitting host for torque is
>>         a different host (main head node, "shrek"),  "node21" is the
>>         host that torque runs the batch script (and the mpirun
>>         command) it being the first node in the "dualcore" resource group.
>> 
>>         Adding option...
>> 
>>         It fixed the hostname in the allocation map, though had no
>>         effect on the outcome.  The allocation is still simply ignored.
>> 
>>         =======8<--------CUT HERE----------
>>         PBS Job Number       9000
>>         PBS batch run on     node21.emperor
>>         Time it was started  2017-10-03_14:11:20
>>         Current Directory    /net/shrek.emperor/home/shrek/anthony
>>         Submitted work dir   /home/shrek/anthony/mpi-pbs
>>         Number of Nodes      5
>>         Nodefile List       /var/lib/torque/aux//9000.shrek.emperor
>>         node21.emperor
>>         node25.emperor
>>         node24.emperor
>>         node23.emperor
>>         node22.emperor
>>         ---------------------------------------
>> 
>>         ======================  ALLOCATED NODES  ======================
>>         node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>         node25.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>         node24.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>         node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>         node22.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>         =================================================================
>>         node21.emperor
>>         node21.emperor
>>         node21.emperor
>>         node21.emperor
>>         node21.emperor
>>         =======8<--------CUT HERE----------
>> 
>> 
>>           Anthony Thyssen ( System Programmer )   
>>         <a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au> 
>> <mailto:a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au>>
>>         <mailto:a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au>
>>         <mailto:a.thys...@griffith.edu.au 
>> <mailto:a.thys...@griffith.edu.au>>>>
>>          
>> --------------------------------------------------------------------------
>>            The equivalent of an armoured car should always be used to
>>            protect any secret kept in a cardboard box.
>>            -- Anthony Thyssen, On the use of Encryption
>>          
>> --------------------------------------------------------------------------
>> 
>> 
>> 
>> 
>>         _______________________________________________
>>         users mailing list
>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
>> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
>>         https://lists.open-mpi.org/mailman/listinfo/users 
>> <https://lists.open-mpi.org/mailman/listinfo/users>
>>         <https://lists.open-mpi.org/mailman/listinfo/users 
>> <https://lists.open-mpi.org/mailman/listinfo/users>>
>> 
>> 
>>     _______________________________________________
>>     users mailing list
>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
>> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
>>     https://lists.open-mpi.org/mailman/listinfo/users 
>> <https://lists.open-mpi.org/mailman/listinfo/users>
>>     <https://lists.open-mpi.org/mailman/listinfo/users 
>> <https://lists.open-mpi.org/mailman/listinfo/users>>
>> 
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/users 
>> <https://lists.open-mpi.org/mailman/listinfo/users>
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/users 
>> <https://lists.open-mpi.org/mailman/listinfo/users>
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/users 
>> <https://lists.open-mpi.org/mailman/listinfo/users>
> 

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to