Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause
Hi, okay lets reboot, even though Gilles last mail was onto something. The problem is that i failed starting programs with mpirun when more than one node was involved. I mentioned that it is likely some configuration problem with my server, especially authentification(we have some kerberos ni

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread r...@open-mpi.org
I’m pruning this email thread so I can actually read the blasted thing :-) Guys: you are off in the wilderness chasing ghosts! Please stop. When I say that Torque uses an “ordered” file, I am _not_ saying that all the host entries of the same name have to be listed consecutively. I am saying tha

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Oswin, One more thing, can you pbsdsh -v hostname before invoking mpirun ? Hopefully this should print the three hostnames Then you can ldd `which pbsdsh` And see which libtorque.so is linked with it Cheers, Gilles Oswin Krause wrote: >Hi Gilles, > >There you go: > >[zbh251@a00551 ~]$ cat $

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Oswin, So it seems that Open MPI think it tm_spawn orted on the remote nodes, but orted ends up running on the same node than mpirun. On your compute nodes, can you ldd /.../lib/openmpi/mca_plm_tm.so And confirm it is linked with the same libtorque.so that was built/provided with torque ? Chec

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause
Hi Gilles, There you go: [zbh251@a00551 ~]$ cat $PBS_NODEFILE a00551.science.domain a00554.science.domain a00553.science.domain [zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname [a00551.science.domain:18889] mca: base: components_re

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Oswin, can you please run again (one task per physical node) with mpirun --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname Cheers, Gilles On 9/8/2016 6:42 PM, Oswin Krause wrote: Hi, i reconfigured to only have one physical node. Still no success,

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause
Hi, i reconfigured to only have one physical node. Still no success, but the nodefile now looks better. I still get the errors: [a00551.science.domain:18021] [[34768,0],1] bind() failed on error Address already in use (98) [a00551.science.domain:18021] [[34768,0],1] ORTE_ERROR_LOG: Error in

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause
Hi Gilles, Hi Ralph, I have just rebuild openmpi. quite a lot more of information. As I said, i did not tinker with the PBS_NODEFILE. I think the issue might be NUMA here. I can try to go through the process and reconfigure to non-numa and see whether this works. The issue might be that the no

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Oswin, that might be off topic and or/premature ... PBS Pro has been made free (and opensource too) and is available at http://www.pbspro.org/ this is something you might be interested in (unless you are using torque because of the MOAB scheduler), and it might be more friendly (e.g. alwa

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Ralph, i am not sure i am reading you correctly, so let me clarify. i did not hack $PBS_NODEFILE for fun nor profit, i was simply trying to reproduce an issue i could not reproduce otherwise. /* my job submitted with -l nodes=3:ppn=1 do not start if there are only two nodes available, wher

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause
Hi, Thanks for all the hints. Only issue is: this is the file generated by torque. Torque - or at least the torque 4.2 provided by my redhat version - gives me an unordered file. Should I rebuild torque? Best, Oswin I am currently rebuilding the package with --enable-debug. On 2016-09-08 09

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread r...@open-mpi.org
If you are correctly analyzing things, then there would be an issue in the code. When we get an allocation from a resource manager, we set a flag indicating that it is “gospel” - i.e., that we do not directly sense the number of cores on a node and set the #slots equal to that value. Instead, we

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread r...@open-mpi.org
Someone has done some work there since I last did, but I can see the issue. Torque indeed always provides an ordered file - the only way you can get an unordered one is for someone to edit it, and that is forbidden - i.e., you get what you deserve because you are messing around with a system-def