Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and
have PBS 18.1.4 installed on my cluster (cluster nodes are running
CentOS 7.9).  When I try to submit a job that will run on two nodes in
the cluster, both ranks get OMPI_COMM_WORLD_LOCAL_SIZE set to 2,
instead of 1, and OMPI_COMM_WORLD_LOCAL_RANK are set to 0 and 1,
instead of both being 0.  At the same time, the hostfile generated by
PBS ($PBS_NODEFILE) properly contains two nodes listed.

I've tried with OpenMPI 3 from HPC-X, and the same thing happens too.
However, when I build OpenMPI myself (notable difference from above
mentioned pre-built MPI versions is that I use "--with-tm" option to
point to my PBS installation), then OMPI_COMM_WORLD_LOCAL_SIZE and
OMPI_COMM_WORLD_LOCAL_RANK are set properly.

I'm not sure how to debug the problem, and whether it is possible to
fix it at all with a pre-built OpenMPI version, so any suggestion is
welcome.

Thanks.

Reply via email to