Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause
Hi, okay lets reboot, even though Gilles last mail was onto something. The problem is that i failed starting programs with mpirun when more than one node was involved. I mentioned that it is likely some configuration problem with my server, especially authentification(we have some kerberos ni

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread r...@open-mpi.org
I’m pruning this email thread so I can actually read the blasted thing :-) Guys: you are off in the wilderness chasing ghosts! Please stop. When I say that Torque uses an “ordered” file, I am _not_ saying that all the host entries of the same name have to be listed consecutively. I am saying tha

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Oswin, One more thing, can you pbsdsh -v hostname before invoking mpirun ? Hopefully this should print the three hostnames Then you can ldd `which pbsdsh` And see which libtorque.so is linked with it Cheers, Gilles Oswin Krause wrote: >Hi Gilles, > >There you go: > >[zbh251@a00551 ~]$ cat $

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Oswin, So it seems that Open MPI think it tm_spawn orted on the remote nodes, but orted ends up running on the same node than mpirun. On your compute nodes, can you ldd /.../lib/openmpi/mca_plm_tm.so And confirm it is linked with the same libtorque.so that was built/provided with torque ? Chec

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause
Hi Gilles, There you go: [zbh251@a00551 ~]$ cat $PBS_NODEFILE a00551.science.domain a00554.science.domain a00553.science.domain [zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname [a00551.science.domain:18889] mca: base: components_re

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Oswin, can you please run again (one task per physical node) with mpirun --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname Cheers, Gilles On 9/8/2016 6:42 PM, Oswin Krause wrote: Hi, i reconfigured to only have one physical node. Still no success,

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause
Hi, i reconfigured to only have one physical node. Still no success, but the nodefile now looks better. I still get the errors: [a00551.science.domain:18021] [[34768,0],1] bind() failed on error Address already in use (98) [a00551.science.domain:18021] [[34768,0],1] ORTE_ERROR_LOG: Error in

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause
Hi Gilles, Hi Ralph, I have just rebuild openmpi. quite a lot more of information. As I said, i did not tinker with the PBS_NODEFILE. I think the issue might be NUMA here. I can try to go through the process and reconfigure to non-numa and see whether this works. The issue might be that the no

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Oswin, that might be off topic and or/premature ... PBS Pro has been made free (and opensource too) and is available at http://www.pbspro.org/ this is something you might be interested in (unless you are using torque because of the MOAB scheduler), and it might be more friendly (e.g. alwa

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Ralph, i am not sure i am reading you correctly, so let me clarify. i did not hack $PBS_NODEFILE for fun nor profit, i was simply trying to reproduce an issue i could not reproduce otherwise. /* my job submitted with -l nodes=3:ppn=1 do not start if there are only two nodes available, wher

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause
Hi, Thanks for all the hints. Only issue is: this is the file generated by torque. Torque - or at least the torque 4.2 provided by my redhat version - gives me an unordered file. Should I rebuild torque? Best, Oswin I am currently rebuilding the package with --enable-debug. On 2016-09-08 09

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread r...@open-mpi.org
If you are correctly analyzing things, then there would be an issue in the code. When we get an allocation from a resource manager, we set a flag indicating that it is “gospel” - i.e., that we do not directly sense the number of cores on a node and set the #slots equal to that value. Instead, we

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread r...@open-mpi.org
Someone has done some work there since I last did, but I can see the issue. Torque indeed always provides an ordered file - the only way you can get an unordered one is for someone to edit it, and that is forbidden - i.e., you get what you deserve because you are messing around with a system-def

Re: [OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Gilles Gouaillardet
Oswin, unfortunatly some important info is missing. i guess the root cause is Open MPI was not configure'd with --enable-debug could you please update your torque script and simply add the following snippet before invoking mpirun echo PBS_NODEFILE cat $PBS_NODEFILE echo --- as i wrote

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Gilles Gouaillardet
Ralph, there might be an issue within Open MPI. on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE uses the FQDN too. my $PBS_NODEFILE has one line per task, and it is ordered e.g. n0.cluster n0.cluster n1.cluster n1.cluster in my torque script, i rewrote the machine

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Oswin Krause
Hi, You are right. Yes the library is there and is linking to libtorque.so. Sorry for the confusion. Is there any other information I can provide? I am seriously new to all of this. Best, Oswin On 2016-09-07 17:16, r...@open-mpi.org wrote: You aren’t looking in the right place - there is

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-07 Thread r...@open-mpi.org
You aren’t looking in the right place - there is an “openmpi” directory underneath that one, and the mca_xxx libraries are down there > On Sep 7, 2016, at 7:43 AM, Oswin Krause > wrote: > > Hi Gilles, > > I do not have this library. Maybe this helps already... > > libmca_common_sm.so libmpi

Re: [OMPI users] OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Jeff Squyres (jsquyres)
You can also run: ompi_info | grep 'plm: tm' (note the quotes, because you need to include the space) If you see a line listing the TM PLM plugin, then you have Torque / PBS support built in to Open MPI. If you don't, then you don't. :-) > On Sep 7, 2016, at 11:01 AM, Gilles Gouaillardet >

Re: [OMPI users] OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Gilles Gouaillardet
I will double check the name. If you did not configure with --disable-dlopen, then mpirun only links with opal and orte. At run time, these libs will dlopen the plugins (from the openmpi sub directory, they are named mca_abc_xyz.so) If you have support for tm, then one of the plugin will be linke

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Oswin Krause
Hi Gilles, I do not have this library. Maybe this helps already... libmca_common_sm.so libmpi_mpifh.so libmpi_usempif08.so libompitrace.so libopen-rte.so libmpi_cxx.solibmpi.solibmpi_usempi_ignore_tkr.so libopen-pal.so liboshmem.so and mpirun does only link to

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Oswin Krause
Hi, Thanks for looking into it. Also thanks to rhc. I tried to be very consistent with the naming after being asked to do so by our it department. [zbh251@a00551 ~]$ hostname a00551.science.domain [zbh251@a00551 ~]$ hostname -f a00551.science.domain this is afair the same name as given in th

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Gilles Gouaillardet
Note the torque library will only show up if you configure'd with --disable-dlopen. Otherwise, you can ldd /.../lib/openmpi/mca_plm_tm.so Cheers, Gilles Bennet Fauber wrote: >Oswin, > >Does the torque library show up if you run > >$ ldd mpirun > >That would indicate that Torque support is comp

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-07 Thread r...@open-mpi.org
The usual cause of this problem is that the nodename in the machinefile is given as a00551, while Torque is assigning the node name as a00551.science.domain. Thus, mpirun thinks those are two separate nodes and winds up spawning an orted on its own node. You might try ensuring that your machine

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Gilles Gouaillardet
Thanjs for the ligs >From what i see now, it looks like a00551 is running both mpirun and orted, >though it should only run mpirun, and orted should run only on a00553 I will check the code and see what could be happening here Btw, what is the output of hostname hostname -f On a00551 ? Out of

Re: [OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Bennet Fauber
Oswin, Does the torque library show up if you run $ ldd mpirun That would indicate that Torque support is compiled in. Also, what happens if you use the same hostfile, or some hostfile as an explicit argument when you run mpirun from within the torque job? -- bennet On Wed, Sep 7, 2016 at

Re: [OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Oswin Krause
Hi, Sorry, I forgot: The node allocation seems to be correct as the nodes are NUMA. The node allocation in torque is a00551.science.domain-0 a00551.science.domain-1 a00553.science.domain-0 On 2016-09-07 14:41, Gilles Gouaillardet wrote: Hi, Which version of Open MPI are you running ? I not

Re: [OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Oswin Krause
Hi Gilles, Thanks for the hint with the machinefile. I know it is not equivalent and i do not intend to use that approach. I just wanted to know whether I could start the program successfully at all. Outside torque(4.2), rsh seems to be used which works fine, querying a password if no kerber

Re: [OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Gilles Gouaillardet
Hi, Which version of Open MPI are you running ? I noted that though you are asking three nodes and one task per node, you have been allocated 2 nodes only. I do not know if this is related to this issue. Note if you use the machinefile, a00551 has two slots (since it appears twice in the machi

[OMPI users] Unable to mpirun from within torque

2016-09-07 Thread Oswin Krause
Hi, I am currently trying to set up OpenMPI in torque. OpenMPI is build with tm support. Torque is correctly assigning nodes and I can run mpi-programs on single nodes just fine. the problem starts when processes are split between nodes. For example, I create an interactive session with torq