No problem - glad you were able to work it out!
> On Oct 5, 2017, at 11:22 PM, Anthony Thyssen <a.thys...@griffith.edu.au> > wrote: > > Sorry r...@open-mpi.org <mailto:r...@open-mpi.org> as Gilles Gouaillardet > pointed out to me the problem wasn't OpenMPI, but with the specific EPEL > implementation (see Redhat Bugzilla 1321154) > > Today, the the server was able to be taken down for maintance, and I wanted > to try a few things. > > After installing EPEL Testing Repo torque-4.2.10-11.el7 > > However I found that all the nodes were 'down' even though everything > appears to be running, with no errors in the error logs. > > After a lot of trials, errors and reseach, I eventually (on a whim) I decided > to remove the "num_node_boards=1" entry from the "torque/server_priv/nodes" > file and restart the server & scheduler. Suddenly the nodes were "free" and > my initial test job ran. > > Perhaps the EPEL-Test Torque 4.2.10-11 does not contain Numa? > > ALL later tests (with OpenMPI - RHEL SRPM 1.10.6-2 re-compiled "--with-tm") > is now responding to the Torque mode allocation correctly and is no longer > simply running all the jobs on the first node. > > That is $PBS_NODEFILE , pbsdsh hostname and mpirun hostname are > all in agreement. > > Thank you all for your help, and putting up with with me. > > Anthony Thyssen ( System Programmer ) <a.thys...@griffith.edu.au > <mailto:a.thys...@griffith.edu.au>> > -------------------------------------------------------------------------- > "Around here we've got a name for people what talks to dragons." > "Traitor?" Wiz asked apprehensively. > "No. Lunch." -- Rick Cook, "Wizadry Consulted" > -------------------------------------------------------------------------- > > > On Wed, Oct 4, 2017 at 11:43 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> > <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote: > Can you try a newer version of OMPI, say the 3.0.0 release? Just curious to > know if we perhaps “fixed” something relevant. > > >> On Oct 3, 2017, at 5:33 PM, Anthony Thyssen <a.thys...@griffith.edu.au >> <mailto:a.thys...@griffith.edu.au>> wrote: >> >> FYI... >> >> The problem is discussed further in >> >> Redhat Bugzilla: Bug 1321154 - numa enabled torque don't work >> https://bugzilla.redhat.com/show_bug.cgi?id=1321154 >> <https://bugzilla.redhat.com/show_bug.cgi?id=1321154> >> >> I'd seen this previous as it required me to add "num_node_boards=1" to each >> node in the >> /var/lib/torque/server_priv/nodes to get torque to at least work. >> Specifically by munging >> the $PBS_NODES" (which comes out correcT) into a host list containing the >> correct >> "slot=" counts. But of course now that I have compiled OpenMPI using >> "--with-tm" that >> should not have been needed as in fact is now ignored by OpenMPI in a >> Torque-PBS >> environment. >> >> However it seems ever since "NUMA" support was into the Torque RPM's, has >> also caused >> the current problems, and is still continuing. The last action is a new >> EPEL "test' version >> (August 2017), I will try shortly. >> >> Take you for your help, though I am still open to suggestions for a >> replacement. >> >> Anthony Thyssen ( System Programmer ) <a.thys...@griffith.edu.au >> <mailto:a.thys...@griffith.edu.au>> >> -------------------------------------------------------------------------- >> Encryption... is a powerful defensive weapon for free people. >> It offers a technical guarantee of privacy, regardless of who is >> running the government... It's hard to think of a more powerful, >> less dangerous tool for liberty. -- Esther Dyson >> -------------------------------------------------------------------------- >> >> >> >> On Wed, Oct 4, 2017 at 9:02 AM, Anthony Thyssen <a.thys...@griffith.edu.au >> <mailto:a.thys...@griffith.edu.au>> wrote: >> Thank you Gilles. At least I now have something to follow though with. >> >> As a FYI, the torque is the pre-built version from the Redhat Extras (EPEL) >> archive. >> torque-4.2.10-10.el7.x86_64 >> >> Normally pre-build packages have no problems, but in this case. >> >> >> >> >> On Tue, Oct 3, 2017 at 3:39 PM, Gilles Gouaillardet <gil...@rist.or.jp >> <mailto:gil...@rist.or.jp>> wrote: >> Anthony, >> >> >> we had a similar issue reported some times ago (e.g. Open MPI ignores torque >> allocation), >> >> and after quite some troubleshooting, we ended up with the same behavior >> (e.g. pbsdsh is not working as expected). >> >> see https://www.mail-archive.com/users@lists.open-mpi.org/msg29952.html >> <https://www.mail-archive.com/users@lists.open-mpi.org/msg29952.html> for >> the last email. >> >> >> from an Open MPI point of view, i would consider the root cause is with your >> torque install. >> >> this case was reported at >> http://www.clusterresources.com/pipermail/torqueusers/2016-September/018858.html >> >> <http://www.clusterresources.com/pipermail/torqueusers/2016-September/018858.html> >> >> and no conclusion was reached. >> >> >> Cheers, >> >> >> Gilles >> >> >> On 10/3/2017 2:02 PM, Anthony Thyssen wrote: >> The stdin and stdout are saved to separate channels. >> >> It is interesting that the output from pbsdsh is node21.emperor 5 times, >> even though $PBS_NODES is the 5 individual nodes. >> >> Attached are the two compressed files, as well as the pbs_hello batch used. >> >> Anthony Thyssen ( System Programmer ) <a.thys...@griffith.edu.au >> <mailto:a.thys...@griffith.edu.au> <mailto:a.thys...@griffith.edu.au >> <mailto:a.thys...@griffith.edu.au>>> >> -------------------------------------------------------------------------- >> There are two types of encryption: >> One that will prevent your sister from reading your diary, and >> One that will prevent your government. -- Bruce Schneier >> -------------------------------------------------------------------------- >> >> >> >> >> On Tue, Oct 3, 2017 at 2:39 PM, Gilles Gouaillardet <gil...@rist.or.jp >> <mailto:gil...@rist.or.jp> <mailto:gil...@rist.or.jp >> <mailto:gil...@rist.or.jp>>> wrote: >> >> Anthony, >> >> >> in your script, can you >> >> >> set -x >> >> env >> >> pbsdsh hostname >> >> mpirun --display-map --display-allocation --mca ess_base_verbose >> 10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname >> >> >> and then compress and send the output ? >> >> >> Cheers, >> >> >> Gilles >> >> >> On 10/3/2017 1:19 PM, Anthony Thyssen wrote: >> >> I noticed that too. Though the submitting host for torque is >> a different host (main head node, "shrek"), "node21" is the >> host that torque runs the batch script (and the mpirun >> command) it being the first node in the "dualcore" resource group. >> >> Adding option... >> >> It fixed the hostname in the allocation map, though had no >> effect on the outcome. The allocation is still simply ignored. >> >> =======8<--------CUT HERE---------- >> PBS Job Number 9000 >> PBS batch run on node21.emperor >> Time it was started 2017-10-03_14:11:20 >> Current Directory /net/shrek.emperor/home/shrek/anthony >> Submitted work dir /home/shrek/anthony/mpi-pbs >> Number of Nodes 5 >> Nodefile List /var/lib/torque/aux//9000.shrek.emperor >> node21.emperor >> node25.emperor >> node24.emperor >> node23.emperor >> node22.emperor >> --------------------------------------- >> >> ====================== ALLOCATED NODES ====================== >> node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >> node25.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >> node24.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >> node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >> node22.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >> ================================================================= >> node21.emperor >> node21.emperor >> node21.emperor >> node21.emperor >> node21.emperor >> =======8<--------CUT HERE---------- >> >> >> Anthony Thyssen ( System Programmer ) >> <a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au> >> <mailto:a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au>> >> <mailto:a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au> >> <mailto:a.thys...@griffith.edu.au >> <mailto:a.thys...@griffith.edu.au>>>> >> >> -------------------------------------------------------------------------- >> The equivalent of an armoured car should always be used to >> protect any secret kept in a cardboard box. >> -- Anthony Thyssen, On the use of Encryption >> >> -------------------------------------------------------------------------- >> >> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> >> https://lists.open-mpi.org/mailman/listinfo/users >> <https://lists.open-mpi.org/mailman/listinfo/users> >> <https://lists.open-mpi.org/mailman/listinfo/users >> <https://lists.open-mpi.org/mailman/listinfo/users>> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> >> https://lists.open-mpi.org/mailman/listinfo/users >> <https://lists.open-mpi.org/mailman/listinfo/users> >> <https://lists.open-mpi.org/mailman/listinfo/users >> <https://lists.open-mpi.org/mailman/listinfo/users>> >> >> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/users >> <https://lists.open-mpi.org/mailman/listinfo/users> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/users >> <https://lists.open-mpi.org/mailman/listinfo/users> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/users >> <https://lists.open-mpi.org/mailman/listinfo/users> >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users