Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Yeah, I'm surprised by that - they used to build --with-tm as it only activates if/when it finds itself in the appropriate environment. No harm in building the support, so the distros always built with all the RM components. No idea why this happened - you might mention it to them as I suspect

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Crni Gorac via users
Indeed, I realized in the meantime that changing the hostfile to: node1 slots=1 node2 slots=1 works as I expected. Thanks once again for the clarification, got it now. I'll see if we can live this way (the job submission scripts are mostly automatically

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Hostfile isn't being ignored - it is doing precisely what it is supposed to do (and is documented to do). The problem is that without tm support, we don't read the external allocation. So we use hostfile to identify the hosts, and then we discover the #slots on each host as being the #cores on

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Crni Gorac via users
OK, just checked and you're right: both processes get run on the first node. So it seems that the "hostfile" option in mpirun, that in my case refers to a file properly listing two nodes, like: node1 node2 is ignored. I also tried logging in to node1,

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Crni Gorac via users
I'm launching using qsub, the line from my previous message is from the corresponding qsub job submission script. FWIW, here is the whole script: #!/bin/bash #PBS -N FOO #PBS -l select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2 #PBS -l Walltime=00:01:00

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Are you launching the job with "mpirun"? I'm not familiar with that cmd line and don't know what it does. Most likely explanation is that the mpirun from the prebuilt versions doesn't have TM support, and therefore doesn't understand the 1ppn directive in your cmd line. My guess is that you

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Crni Gorac via users
I have one process per node, here is corresponding line from my job submission script (with compute nodes named "node1" and "node2"): #PBS -l select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2 On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users wrote: > > Afraid I can't

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Afraid I can't understand your scenario - when you say you "submit a job" to run on two nodes, how many processes are you running on each node?? > On Jan 18, 2022, at 1:07 PM, Crni Gorac via users > wrote: > > Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and > have PBS