Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Yeah, I'm surprised by that - they used to build --with-tm as it only activates 
if/when it finds itself in the appropriate environment. No harm in building the 
support, so the distros always built with all the RM components. No idea why 
this happened - you might mention it to them as I suspect it was an 
error/oversight.



> On Jan 18, 2022, at 3:05 PM, Crni Gorac via users  
> wrote:
> 
> Indeed, I realized in the meantime that changing the hostfile to:
> 
> node1 slots=1
> node2 slots=1
> 
> works as I expected.
> 
> Thanks once again for the clarification, got it now.  I'll see if we
> can live this way (the job submission scripts are mostly automatically
> generated from an auxiliary, site specific, shell script, and I can
> change this one to simply add "slots=1" to the hostfile generated by
> PBS, before passing it to mpirun), but it's pity that tm support is
> not included in these pre-built OpenMPI installations.
> 
> On Tue, Jan 18, 2022 at 11:56 PM Ralph Castain via users
>  wrote:
>> 
>> Hostfile isn't being ignored - it is doing precisely what it is supposed to 
>> do (and is documented to do). The problem is that without tm support, we 
>> don't read the external allocation. So we use hostfile to identify the 
>> hosts, and then we discover the #slots on each host as being the #cores on 
>> that node.
>> 
>> In contrast, the -host option is doing what it is supposed to do - it 
>> assigns one slot for each mention of the hostname. You can increase the slot 
>> allocation using the colon qualifier - i.e., "-host node1:5" assigns 5 slots 
>> to node1.
>> 
>> If tm support is included, then we read the PBS allocation and see one slot 
>> on each node - and launch accordingly.
>> 
>> 
>>> On Jan 18, 2022, at 2:44 PM, Crni Gorac via users 
>>>  wrote:
>>> 
>>> OK, just checked and you're right: both processes get run on the first
>>> node.  So it seems that the "hostfile" option in mpirun, that in my
>>> case refers to a file properly listing two nodes, like:
>>> 
>>> node1
>>> node2
>>> 
>>> is ignored.
>>> 
>>> I also tried logging in to node1, and launching using mpirun directly,
>>> without PBS, and the same thing happens.  However, if I specify "host"
>>> options instead, then ranks get started on different nodes, and it all
>>> works properly.  Then I tried the same from within the PBS script, and
>>> it worked.
>>> 
>>> Thus, to summarize, instead of:
>>> mpirun -n 2 -hostfile $PBS_NODEFILE ./foo
>>> one should use:
>>> mpirun -n 2 --host node1,node2 ./foo
>>> 
>>> Rather strange, but it's important that it works somehow.  Thanks for your 
>>> help!
>>> 
>>> On Tue, Jan 18, 2022 at 10:54 PM Ralph Castain via users
>>>  wrote:
 
 Are you launching the job with "mpirun"? I'm not familiar with that cmd 
 line and don't know what it does.
 
 Most likely explanation is that the mpirun from the prebuilt versions 
 doesn't have TM support, and therefore doesn't understand the 1ppn 
 directive in your cmd line. My guess is that you are using the ssh 
 launcher - what is odd is that you should wind up with two procs on the 
 first node, in which case those envars are correct. If you are seeing one 
 proc on each node, then something is wrong.
 
 
> On Jan 18, 2022, at 1:33 PM, Crni Gorac via users 
>  wrote:
> 
> I have one process per node, here is corresponding line from my job
> submission script (with compute nodes named "node1" and "node2"):
> 
> #PBS -l 
> select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
> 
> On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users
>  wrote:
>> 
>> Afraid I can't understand your scenario - when you say you "submit a 
>> job" to run on two nodes, how many processes are you running on each 
>> node??
>> 
>> 
>>> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users 
>>>  wrote:
>>> 
>>> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
>>> have PBS 18.1.4 installed on my cluster (cluster nodes are running
>>> CentOS 7.9).  When I try to submit a job that will run on two nodes in
>>> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
>>> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
>>> instead of both being 0.  At the same time, the hostfile generated by
>>> PBS ($PBS_NODEFILE) properly contains two nodes listed.
>>> 
>>> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
>>> However, when I build OpenMPI myself (notable difference from above
>>> mentioned pre-built MPI versions is that I use "--with-tm" option to
>>> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
>>> OMPI_COMM_WORLD_LOCAL_RANK are set properly.
>>> 
>>> I'm not sure how to debug the problem, and whether it is possible 

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Crni Gorac via users
Indeed, I realized in the meantime that changing the hostfile to:

node1 slots=1
node2 slots=1

works as I expected.

Thanks once again for the clarification, got it now.  I'll see if we
can live this way (the job submission scripts are mostly automatically
generated from an auxiliary, site specific, shell script, and I can
change this one to simply add "slots=1" to the hostfile generated by
PBS, before passing it to mpirun), but it's pity that tm support is
not included in these pre-built OpenMPI installations.

On Tue, Jan 18, 2022 at 11:56 PM Ralph Castain via users
 wrote:
>
> Hostfile isn't being ignored - it is doing precisely what it is supposed to 
> do (and is documented to do). The problem is that without tm support, we 
> don't read the external allocation. So we use hostfile to identify the hosts, 
> and then we discover the #slots on each host as being the #cores on that node.
>
> In contrast, the -host option is doing what it is supposed to do - it assigns 
> one slot for each mention of the hostname. You can increase the slot 
> allocation using the colon qualifier - i.e., "-host node1:5" assigns 5 slots 
> to node1.
>
> If tm support is included, then we read the PBS allocation and see one slot 
> on each node - and launch accordingly.
>
>
> > On Jan 18, 2022, at 2:44 PM, Crni Gorac via users 
> >  wrote:
> >
> > OK, just checked and you're right: both processes get run on the first
> > node.  So it seems that the "hostfile" option in mpirun, that in my
> > case refers to a file properly listing two nodes, like:
> > 
> > node1
> > node2
> > 
> > is ignored.
> >
> > I also tried logging in to node1, and launching using mpirun directly,
> > without PBS, and the same thing happens.  However, if I specify "host"
> > options instead, then ranks get started on different nodes, and it all
> > works properly.  Then I tried the same from within the PBS script, and
> > it worked.
> >
> > Thus, to summarize, instead of:
> > mpirun -n 2 -hostfile $PBS_NODEFILE ./foo
> > one should use:
> > mpirun -n 2 --host node1,node2 ./foo
> >
> > Rather strange, but it's important that it works somehow.  Thanks for your 
> > help!
> >
> > On Tue, Jan 18, 2022 at 10:54 PM Ralph Castain via users
> >  wrote:
> >>
> >> Are you launching the job with "mpirun"? I'm not familiar with that cmd 
> >> line and don't know what it does.
> >>
> >> Most likely explanation is that the mpirun from the prebuilt versions 
> >> doesn't have TM support, and therefore doesn't understand the 1ppn 
> >> directive in your cmd line. My guess is that you are using the ssh 
> >> launcher - what is odd is that you should wind up with two procs on the 
> >> first node, in which case those envars are correct. If you are seeing one 
> >> proc on each node, then something is wrong.
> >>
> >>
> >>> On Jan 18, 2022, at 1:33 PM, Crni Gorac via users 
> >>>  wrote:
> >>>
> >>> I have one process per node, here is corresponding line from my job
> >>> submission script (with compute nodes named "node1" and "node2"):
> >>>
> >>> #PBS -l 
> >>> select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
> >>>
> >>> On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users
> >>>  wrote:
> 
>  Afraid I can't understand your scenario - when you say you "submit a 
>  job" to run on two nodes, how many processes are you running on each 
>  node??
> 
> 
> > On Jan 18, 2022, at 1:07 PM, Crni Gorac via users 
> >  wrote:
> >
> > Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
> > have PBS 18.1.4 installed on my cluster (cluster nodes are running
> > CentOS 7.9).  When I try to submit a job that will run on two nodes in
> > the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
> > instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
> > instead of both being 0.  At the same time, the hostfile generated by
> > PBS ($PBS_NODEFILE) properly contains two nodes listed.
> >
> > I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
> > However, when I build OpenMPI myself (notable difference from above
> > mentioned pre-built MPI versions is that I use "--with-tm" option to
> > point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
> > OMPI_COMM_WORLD_LOCAL_RANK are set properly.
> >
> > I'm not sure how to debug the problem, and whether it is possible to
> > fix it at all with a pre-built OpenMPI version, so any suggestion is
> > welcome.
> >
> > Thanks.
> 
> 
> >>
> >>
>
>


Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Hostfile isn't being ignored - it is doing precisely what it is supposed to do 
(and is documented to do). The problem is that without tm support, we don't 
read the external allocation. So we use hostfile to identify the hosts, and 
then we discover the #slots on each host as being the #cores on that node.

In contrast, the -host option is doing what it is supposed to do - it assigns 
one slot for each mention of the hostname. You can increase the slot allocation 
using the colon qualifier - i.e., "-host node1:5" assigns 5 slots to node1.

If tm support is included, then we read the PBS allocation and see one slot on 
each node - and launch accordingly.


> On Jan 18, 2022, at 2:44 PM, Crni Gorac via users  
> wrote:
> 
> OK, just checked and you're right: both processes get run on the first
> node.  So it seems that the "hostfile" option in mpirun, that in my
> case refers to a file properly listing two nodes, like:
> 
> node1
> node2
> 
> is ignored.
> 
> I also tried logging in to node1, and launching using mpirun directly,
> without PBS, and the same thing happens.  However, if I specify "host"
> options instead, then ranks get started on different nodes, and it all
> works properly.  Then I tried the same from within the PBS script, and
> it worked.
> 
> Thus, to summarize, instead of:
> mpirun -n 2 -hostfile $PBS_NODEFILE ./foo
> one should use:
> mpirun -n 2 --host node1,node2 ./foo
> 
> Rather strange, but it's important that it works somehow.  Thanks for your 
> help!
> 
> On Tue, Jan 18, 2022 at 10:54 PM Ralph Castain via users
>  wrote:
>> 
>> Are you launching the job with "mpirun"? I'm not familiar with that cmd line 
>> and don't know what it does.
>> 
>> Most likely explanation is that the mpirun from the prebuilt versions 
>> doesn't have TM support, and therefore doesn't understand the 1ppn directive 
>> in your cmd line. My guess is that you are using the ssh launcher - what is 
>> odd is that you should wind up with two procs on the first node, in which 
>> case those envars are correct. If you are seeing one proc on each node, then 
>> something is wrong.
>> 
>> 
>>> On Jan 18, 2022, at 1:33 PM, Crni Gorac via users 
>>>  wrote:
>>> 
>>> I have one process per node, here is corresponding line from my job
>>> submission script (with compute nodes named "node1" and "node2"):
>>> 
>>> #PBS -l 
>>> select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
>>> 
>>> On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users
>>>  wrote:
 
 Afraid I can't understand your scenario - when you say you "submit a job" 
 to run on two nodes, how many processes are you running on each node??
 
 
> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users 
>  wrote:
> 
> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
> have PBS 18.1.4 installed on my cluster (cluster nodes are running
> CentOS 7.9).  When I try to submit a job that will run on two nodes in
> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
> instead of both being 0.  At the same time, the hostfile generated by
> PBS ($PBS_NODEFILE) properly contains two nodes listed.
> 
> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
> However, when I build OpenMPI myself (notable difference from above
> mentioned pre-built MPI versions is that I use "--with-tm" option to
> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
> OMPI_COMM_WORLD_LOCAL_RANK are set properly.
> 
> I'm not sure how to debug the problem, and whether it is possible to
> fix it at all with a pre-built OpenMPI version, so any suggestion is
> welcome.
> 
> Thanks.
 
 
>> 
>> 




Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Crni Gorac via users
OK, just checked and you're right: both processes get run on the first
node.  So it seems that the "hostfile" option in mpirun, that in my
case refers to a file properly listing two nodes, like:

node1
node2

is ignored.

I also tried logging in to node1, and launching using mpirun directly,
without PBS, and the same thing happens.  However, if I specify "host"
options instead, then ranks get started on different nodes, and it all
works properly.  Then I tried the same from within the PBS script, and
it worked.

Thus, to summarize, instead of:
mpirun -n 2 -hostfile $PBS_NODEFILE ./foo
one should use:
mpirun -n 2 --host node1,node2 ./foo

Rather strange, but it's important that it works somehow.  Thanks for your help!

On Tue, Jan 18, 2022 at 10:54 PM Ralph Castain via users
 wrote:
>
> Are you launching the job with "mpirun"? I'm not familiar with that cmd line 
> and don't know what it does.
>
> Most likely explanation is that the mpirun from the prebuilt versions doesn't 
> have TM support, and therefore doesn't understand the 1ppn directive in your 
> cmd line. My guess is that you are using the ssh launcher - what is odd is 
> that you should wind up with two procs on the first node, in which case those 
> envars are correct. If you are seeing one proc on each node, then something 
> is wrong.
>
>
> > On Jan 18, 2022, at 1:33 PM, Crni Gorac via users 
> >  wrote:
> >
> > I have one process per node, here is corresponding line from my job
> > submission script (with compute nodes named "node1" and "node2"):
> >
> > #PBS -l 
> > select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
> >
> > On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users
> >  wrote:
> >>
> >> Afraid I can't understand your scenario - when you say you "submit a job" 
> >> to run on two nodes, how many processes are you running on each node??
> >>
> >>
> >>> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users 
> >>>  wrote:
> >>>
> >>> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
> >>> have PBS 18.1.4 installed on my cluster (cluster nodes are running
> >>> CentOS 7.9).  When I try to submit a job that will run on two nodes in
> >>> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
> >>> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
> >>> instead of both being 0.  At the same time, the hostfile generated by
> >>> PBS ($PBS_NODEFILE) properly contains two nodes listed.
> >>>
> >>> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
> >>> However, when I build OpenMPI myself (notable difference from above
> >>> mentioned pre-built MPI versions is that I use "--with-tm" option to
> >>> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
> >>> OMPI_COMM_WORLD_LOCAL_RANK are set properly.
> >>>
> >>> I'm not sure how to debug the problem, and whether it is possible to
> >>> fix it at all with a pre-built OpenMPI version, so any suggestion is
> >>> welcome.
> >>>
> >>> Thanks.
> >>
> >>
>
>


Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Crni Gorac via users
I'm launching using qsub, the line from my previous message is from
the corresponding qsub job submission script.  FWIW, here is the whole
script:


#!/bin/bash
#PBS -N FOO
#PBS -l select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
#PBS -l Walltime=00:01:00
#PBS -o "foo.out"
#PBS -e "foo.err"
#PBS -q myqueue
#PBS -V

cd $PBS_O_WORKDIR
cat $PBS_NODEFILE >nodes.txt

mpipath=/usr/mpi/gcc/openmpi-4.1.2rc2
mpibinpath=$mpipath/bin
mpilibpath=$mpipath/lib64
export PATH=$mpibinpath:$PATH
export LD_LIBRARY_PATH=$mpilibpath:$LD_LIBRARY_PATH

mpirun -n 2 -hostfile $PBS_NODEFILE ./foo


Here, "foo" is a small MPI program that just prints
OMPI_COMM_WORLD_LOCAL_RANK and OMPI_COMM_WORLD_LOCAL_SIZE, and exits.

On Tue, Jan 18, 2022 at 10:54 PM Ralph Castain via users
 wrote:
>
> Are you launching the job with "mpirun"? I'm not familiar with that cmd line 
> and don't know what it does.
>
> Most likely explanation is that the mpirun from the prebuilt versions doesn't 
> have TM support, and therefore doesn't understand the 1ppn directive in your 
> cmd line. My guess is that you are using the ssh launcher - what is odd is 
> that you should wind up with two procs on the first node, in which case those 
> envars are correct. If you are seeing one proc on each node, then something 
> is wrong.
>
>
> > On Jan 18, 2022, at 1:33 PM, Crni Gorac via users 
> >  wrote:
> >
> > I have one process per node, here is corresponding line from my job
> > submission script (with compute nodes named "node1" and "node2"):
> >
> > #PBS -l 
> > select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
> >
> > On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users
> >  wrote:
> >>
> >> Afraid I can't understand your scenario - when you say you "submit a job" 
> >> to run on two nodes, how many processes are you running on each node??
> >>
> >>
> >>> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users 
> >>>  wrote:
> >>>
> >>> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
> >>> have PBS 18.1.4 installed on my cluster (cluster nodes are running
> >>> CentOS 7.9).  When I try to submit a job that will run on two nodes in
> >>> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
> >>> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
> >>> instead of both being 0.  At the same time, the hostfile generated by
> >>> PBS ($PBS_NODEFILE) properly contains two nodes listed.
> >>>
> >>> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
> >>> However, when I build OpenMPI myself (notable difference from above
> >>> mentioned pre-built MPI versions is that I use "--with-tm" option to
> >>> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
> >>> OMPI_COMM_WORLD_LOCAL_RANK are set properly.
> >>>
> >>> I'm not sure how to debug the problem, and whether it is possible to
> >>> fix it at all with a pre-built OpenMPI version, so any suggestion is
> >>> welcome.
> >>>
> >>> Thanks.
> >>
> >>
>
>


Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Are you launching the job with "mpirun"? I'm not familiar with that cmd line 
and don't know what it does.

Most likely explanation is that the mpirun from the prebuilt versions doesn't 
have TM support, and therefore doesn't understand the 1ppn directive in your 
cmd line. My guess is that you are using the ssh launcher - what is odd is that 
you should wind up with two procs on the first node, in which case those envars 
are correct. If you are seeing one proc on each node, then something is wrong.


> On Jan 18, 2022, at 1:33 PM, Crni Gorac via users  
> wrote:
> 
> I have one process per node, here is corresponding line from my job
> submission script (with compute nodes named "node1" and "node2"):
> 
> #PBS -l select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2
> 
> On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users
>  wrote:
>> 
>> Afraid I can't understand your scenario - when you say you "submit a job" to 
>> run on two nodes, how many processes are you running on each node??
>> 
>> 
>>> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users 
>>>  wrote:
>>> 
>>> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
>>> have PBS 18.1.4 installed on my cluster (cluster nodes are running
>>> CentOS 7.9).  When I try to submit a job that will run on two nodes in
>>> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
>>> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
>>> instead of both being 0.  At the same time, the hostfile generated by
>>> PBS ($PBS_NODEFILE) properly contains two nodes listed.
>>> 
>>> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
>>> However, when I build OpenMPI myself (notable difference from above
>>> mentioned pre-built MPI versions is that I use "--with-tm" option to
>>> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
>>> OMPI_COMM_WORLD_LOCAL_RANK are set properly.
>>> 
>>> I'm not sure how to debug the problem, and whether it is possible to
>>> fix it at all with a pre-built OpenMPI version, so any suggestion is
>>> welcome.
>>> 
>>> Thanks.
>> 
>> 




Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Crni Gorac via users
I have one process per node, here is corresponding line from my job
submission script (with compute nodes named "node1" and "node2"):

#PBS -l select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2

On Tue, Jan 18, 2022 at 10:20 PM Ralph Castain via users
 wrote:
>
> Afraid I can't understand your scenario - when you say you "submit a job" to 
> run on two nodes, how many processes are you running on each node??
>
>
> > On Jan 18, 2022, at 1:07 PM, Crni Gorac via users 
> >  wrote:
> >
> > Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
> > have PBS 18.1.4 installed on my cluster (cluster nodes are running
> > CentOS 7.9).  When I try to submit a job that will run on two nodes in
> > the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
> > instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
> > instead of both being 0.  At the same time, the hostfile generated by
> > PBS ($PBS_NODEFILE) properly contains two nodes listed.
> >
> > I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
> > However, when I build OpenMPI myself (notable difference from above
> > mentioned pre-built MPI versions is that I use "--with-tm" option to
> > point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
> > OMPI_COMM_WORLD_LOCAL_RANK are set properly.
> >
> > I'm not sure how to debug the problem, and whether it is possible to
> > fix it at all with a pre-built OpenMPI version, so any suggestion is
> > welcome.
> >
> > Thanks.
>
>


Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Afraid I can't understand your scenario - when you say you "submit a job" to 
run on two nodes, how many processes are you running on each node??


> On Jan 18, 2022, at 1:07 PM, Crni Gorac via users  
> wrote:
> 
> Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
> have PBS 18.1.4 installed on my cluster (cluster nodes are running
> CentOS 7.9).  When I try to submit a job that will run on two nodes in
> the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
> instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
> instead of both being 0.  At the same time, the hostfile generated by
> PBS ($PBS_NODEFILE) properly contains two nodes listed.
> 
> I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
> However, when I build OpenMPI myself (notable difference from above
> mentioned pre-built MPI versions is that I use "--with-tm" option to
> point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
> OMPI_COMM_WORLD_LOCAL_RANK are set properly.
> 
> I'm not sure how to debug the problem, and whether it is possible to
> fix it at all with a pre-built OpenMPI version, so any suggestion is
> welcome.
> 
> Thanks.