Re: [OMPI users] [EXTERNAL] [BULK] Re: OpenMPI crashes with TCP connection error

2023-06-17 Thread Mccall, Kurt E. (MSFC-EV41) via users
Gilles, Joachim,

The command line to launch my application is:

mpiexec --mca orte_base_help_aggregate 0 \
--enable-recovery \
--mca mpi_param_check 1 \
--v \
--wdir ${work_dir} \
--hostfile ${MY_NODEFILE} \
--np ${num_proc}  \
--map-by ppr:1:node \
executable \
… application specific args

Thanks,
Kurt

From: users  On Behalf Of Gilles Gouaillardet 
via users
Sent: Friday, June 16, 2023 11:05 PM
To: Open MPI Users 
Cc: Gilles Gouaillardet 
Subject: [EXTERNAL] [BULK] Re: [OMPI users] OpenMPI crashes with TCP connection 
error

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.


Kurt,


I think Joachim was also asking for the command line used to launch your 
application.

Since you are using Slurm and MPI_Comm_spawn(), it is important to understand 
whether you are using mpirun or srun

FWIW, --mpi=pmix is a srun option. you can srun --mpi=list to find the 
available options.


Cheers,

Gilles

On Sat, Jun 17, 2023 at 2:53 AM Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:
Joachim,

Sorry to make you resort to divination.   My sbatch command is as follows:

sbatch --ntasks-per-node=24 --nodes=16 --ntasks=384  --job-name $job_name  
--exclusive --no-kill --verbose $release_dir/script.bash &

--mpi=pmix isn’t an option recognized by sbatch.   Is there an alternative?   
The slurm doc you mentioned has the following paragraph.  Is it still true with 
OpenMpi 4.1.5?

“NOTE: OpenMPI has a limitation that does not support calls to MPI_Comm_spawn() 
from within a Slurm allocation. If you need to use the MPI_Comm_spawn() 
function you will need to use another MPI implementation combined with PMI-2 
since PMIx doesn't support it either.”

I use MPI_Comm_spawn extensively in my application.

Thanks,
Kurt


From: Jenke, Joachim mailto:je...@itc.rwth-aachen.de>>
Sent: Thursday, June 15, 2023 5:33 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov>>
Subject: [EXTERNAL] Re: OpenMPI crashes with TCP connection error

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.

Hi Kurt,

Without knowing your exact MPI launch command, my cristal orb thinks you might 
want to try the -mpi=pmix flag for srun as documented for slurm+openmpi:
https://slurm.schedmd.com/mpi_guide.html#open_mpi

-Joachim

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>>
Sent: Thursday, June 15, 2023 11:56:28 PM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
mailto:users@lists.open-mpi.org>>
Cc: Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov>>
Subject: [OMPI users] OpenMPI crashes with TCP connection error


My job immediately crashes with the error message below.   I don’t know where 
to begin looking for the cause

of the error, or what information to provide to help you understand it.   Maybe 
you could clue me in .



I am on RedHat 4.18.0, using Slurm 20.11.8 and OpenMPI 4.1.5 compiled with gcc 
8.5.0.

I built OpenMPI with the following  “configure” command:



./configure --prefix=/opt/openmpi/4.1.5_gnu --with-slurm --enable-debug







WARNING: Open MPI accepted a TCP connection from what appears to be a

another Open MPI process but cannot find a corresponding process

entry for that peer.



This attempted connection will be ignored; your MPI job may or may not

continue properly.



  Local host: n001

  PID:985481






Re: [OMPI users] OpenMPI crashes with TCP connection error

2023-06-16 Thread Mccall, Kurt E. (MSFC-EV41) via users
Joachim,

Sorry to make you resort to divination.   My sbatch command is as follows:

sbatch --ntasks-per-node=24 --nodes=16 --ntasks=384  --job-name $job_name  
--exclusive --no-kill --verbose $release_dir/script.bash &

--mpi=pmix isn’t an option recognized by sbatch.   Is there an alternative?   
The slurm doc you mentioned has the following paragraph.  Is it still true with 
OpenMpi 4.1.5?

“NOTE: OpenMPI has a limitation that does not support calls to MPI_Comm_spawn() 
from within a Slurm allocation. If you need to use the MPI_Comm_spawn() 
function you will need to use another MPI implementation combined with PMI-2 
since PMIx doesn't support it either.”

I use MPI_Comm_spawn extensively in my application.

Thanks,
Kurt


From: Jenke, Joachim 
Sent: Thursday, June 15, 2023 5:33 PM
To: Open MPI Users 
Cc: Mccall, Kurt E. (MSFC-EV41) 
Subject: [EXTERNAL] Re: OpenMPI crashes with TCP connection error

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.


Hi Kurt,

Without knowing your exact MPI launch command, my cristal orb thinks you might 
want to try the -mpi=pmix flag for srun as documented for slurm+openmpi:
https://slurm.schedmd.com/mpi_guide.html#open_mpi

-Joachim

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>>
Sent: Thursday, June 15, 2023 11:56:28 PM
To: users@lists.open-mpi.org<mailto:users@lists.open-mpi.org> 
mailto:users@lists.open-mpi.org>>
Cc: Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov>>
Subject: [OMPI users] OpenMPI crashes with TCP connection error


My job immediately crashes with the error message below.   I don’t know where 
to begin looking for the cause

of the error, or what information to provide to help you understand it.   Maybe 
you could clue me in .



I am on RedHat 4.18.0, using Slurm 20.11.8 and OpenMPI 4.1.5 compiled with gcc 
8.5.0.

I built OpenMPI with the following  “configure” command:



./configure --prefix=/opt/openmpi/4.1.5_gnu --with-slurm --enable-debug







WARNING: Open MPI accepted a TCP connection from what appears to be a

another Open MPI process but cannot find a corresponding process

entry for that peer.



This attempted connection will be ignored; your MPI job may or may not

continue properly.



  Local host: n001

  PID:985481






[OMPI users] OpenMPI crashes with TCP connection error

2023-06-15 Thread Mccall, Kurt E. (MSFC-EV41) via users
My job immediately crashes with the error message below.   I don’t know where 
to begin looking for the cause
of the error, or what information to provide to help you understand it.   Maybe 
you could clue me in .

I am on RedHat 4.18.0, using Slurm 20.11.8 and OpenMPI 4.1.5 compiled with gcc 
8.5.0.
I built OpenMPI with the following  “configure” command:

./configure --prefix=/opt/openmpi/4.1.5_gnu --with-slurm --enable-debug



WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: n001
  PID:985481




[OMPI users] Disabling barrier in MPI_Finalize

2022-09-09 Thread Mccall, Kurt E. (MSFC-EV41) via users
Hi,

If a single process needs to exit, MPI_Finalize will pause at a barrier, 
possibly waiting for pending communications to complete.  Does OpenMPI have any 
means to disable this behavior, so the a single process can exit normally if 
the application calls for it?

Thanks,
Kurt


Re: [OMPI users] OpenMpi crash in MPI_Comm_spawn / developer message

2022-03-18 Thread Mccall, Kurt E. (MSFC-EV41) via users
Just an update:   eliminated the error below by telling MPI_Comm_spawn to 
create non-MPI processes, via the info key:

MPI_Info_set(info, "ompi_non_mpi", "true");

If you still want to pursue this matter, let me know.

Kurt

From: Mccall, Kurt E. (MSFC-EV41) 
Sent: Thursday, March 17, 2022 5:58 PM
To: Open MPI Users 
Cc: Mccall, Kurt E. (MSFC-EV41) 
Subject: OpenMpi crash in MPI_Comm_spawn / developer message

My job successfully spawned a large number of subprocesses via MPI_Comm_spawn, 
filling up the available cores.   When some of those subprocesses terminated, 
it attempted to spawn more.   It appears that the latter calls to 
MPI_Comm_spawn caused this error:

[n022.cluster.com:08996] [[56319,0],0] grpcomm:direct:send_relay proc 
[[56319,0],1] not running - cannot relay: NOT ALIVE

An internal error has occurred in ORTE:

[[56319,0],0] FORCE-TERMINATE AT Unreachable:-12 - error grpcomm_direct.c(601)

This is something that should be reported to the developers.

I would attach the output created by the mpiexec arguments “--mca 
ras_base_verbose 5 --display-devel-map --mca rmaps_base_verbose 5 “, but it is 
22 Mb.  Do you have a location where I can drop the file?

Thanks for any help.
Kurt


[OMPI users] OpenMpi crash in MPI_Comm_spawn / developer message

2022-03-17 Thread Mccall, Kurt E. (MSFC-EV41) via users
My job successfully spawned a large number of subprocesses via MPI_Comm_spawn, 
filling up the available cores.   When some of those subprocesses terminated, 
it attempted to spawn more.   It appears that the latter calls to 
MPI_Comm_spawn caused this error:

[n022.cluster.com:08996] [[56319,0],0] grpcomm:direct:send_relay proc 
[[56319,0],1] not running - cannot relay: NOT ALIVE

An internal error has occurred in ORTE:

[[56319,0],0] FORCE-TERMINATE AT Unreachable:-12 - error grpcomm_direct.c(601)

This is something that should be reported to the developers.

I would attach the output created by the mpiexec arguments “--mca 
ras_base_verbose 5 --display-devel-map --mca rmaps_base_verbose 5 “, but it is 
22 Mb.  Do you have a location where I can drop the file?

Thanks for any help.
Kurt


Re: [OMPI users] MPI_Intercomm_create error

2022-03-16 Thread Mccall, Kurt E. (MSFC-EV41) via users
George,

Thanks, that was it!

Kurt

From: George Bosilca 
Sent: Wednesday, March 16, 2022 4:38 PM
To: Open MPI Users 
Cc: Mccall, Kurt E. (MSFC-EV41) 
Subject: [EXTERNAL] Re: [OMPI users] MPI_Intercomm_create error

I see similar issues on platforms with multiple IP addresses, if some of them 
are not fully connected. In general, specifying which interface OMPI can use 
(with --mca btl_tcp_if_include x.y.z.t/s) solves the problem.

  George.


On Wed, Mar 16, 2022 at 5:11 PM Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:
I’m using OpenMpi 4.1.2 under Slurm 20.11.8.  My 2 process job is successfully 
launched, but when the main process rank 0
attempts to create an intercommunicator with process rank 1 on the other node:

MPI_Comm intercom;
MPI_Intercomm_create(MPI_COMM_SELF, 0, MPI_COMM_WORLD, 1, ,   );

OpenMpi spins deep inside the MPI_Intercomm_create code, and the following 
error is reported:

WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

The output resulting from using the mpirun arguments “--mca ras_base_verbose 5 
--display-devel-map --mca rmaps_base_verbose 5” is attached.
Any help would be appreciated.


[OMPI users] MPI_Intercomm_create error

2022-03-16 Thread Mccall, Kurt E. (MSFC-EV41) via users
I'm using OpenMpi 4.1.2 under Slurm 20.11.8.  My 2 process job is successfully 
launched, but when the main process rank 0
attempts to create an intercommunicator with process rank 1 on the other node:

MPI_Comm intercom;
MPI_Intercomm_create(MPI_COMM_SELF, 0, MPI_COMM_WORLD, 1, ,   );

OpenMpi spins deep inside the MPI_Intercomm_create code, and the following 
error is reported:

WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

The output resulting from using the mpirun arguments "--mca ras_base_verbose 5 
--display-devel-map --mca rmaps_base_verbose 5" is attached.
Any help would be appreciated.
SLURM_JOB_NODELIST =  n[001-002]
Calling mpirun for slurm
num_proc =  2
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm: available for 
selection
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base set policy with 
ppr:1:node device NONNULL
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base policy ppr 
modifiers 1:node provided
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component mindist
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[mindist]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component ppr
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [ppr]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component rank_file
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[rank_file]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component resilient
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[resilient]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component round_robin
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[round_robin]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component seq
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [seq]
[n001.cluster.pssclabs.com:3473322] [[65186,0],0]: Final mapper priorities
[n001.cluster.pssclabs.com:3473322] Mapper: ppr Priority: 90
[n001.cluster.pssclabs.com:3473322] Mapper: seq Priority: 60
[n001.cluster.pssclabs.com:3473322] Mapper: resilient Priority: 40
[n001.cluster.pssclabs.com:3473322] Mapper: mindist Priority: 20
[n001.cluster.pssclabs.com:3473322] Mapper: round_robin Priority: 10
[n001.cluster.pssclabs.com:3473322] Mapper: rank_file Priority: 0
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base set policy with 
ppr:1:node device NULL
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base policy ppr 
modifiers 1:node provided
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:allocate
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
checking nodelist: n[001-002]
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
parse range 001-002 (2)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
adding node n001 (24 slots)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
adding node n002 (24 slots)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate: success
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert 
inserting 2 nodes
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert updating 
HNP [n001] info to 24 slots
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert node 
n002 slots 24

==   ALLOCATED NODES   ==
n001: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
n002: flags=0x10 slots=24 max_slots=0 slots_inuse=0 state=UP
=

==   ALLOCATED NODES   ==
n001: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
n002: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
=
[n001.cluster.pssclabs.com:3473322] mca:rmaps: mapping job [65186,1]
[n001.cluster.pssclabs.com:3473322] mca:rmaps: setting mapping policies for job 
[65186,1] nprocs 2
[n001.cluster.pssclabs.com:3473322] mca:rmaps[303] binding not given - using 
bycore
[n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: mapping job [65186,1] with 
ppr 1:node
[n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: job [65186,1] assigned 
policy BYNODE:NOOVERSUBSCRIBE
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting with 2 nodes in list
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Filtering thru apps
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Retained 2 nodes in list

[OMPI users] OpenMPI, Slurm and MPI_Comm_spawn

2022-03-08 Thread Mccall, Kurt E. (MSFC-EV41) via users
The Slurm MPI User's Guide at https://slurm.schedmd.com/mpi_guide.html#open_mpi 
has a note that states:

NOTE: OpenMPI has a limitation that does not support calls to MPI_Comm_spawn() 
from within a Slurm allocation. If you need to use the MPI_Comm_spawn() 
function you will need to use another MPI implementation combined with PMI-2 
since PMIx doesn't support it either.

Is this still true in OpenMPI 4.1?

Thanks,
Kurt


Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn

2021-11-05 Thread Mccall, Kurt E. (MSFC-EV41) via users
Ralph,

I changed the host name to n022:* and the problem persisted.   Here is my C++ 
code (modified slightly.  the host name is not really hard coded as it is 
below).   I thought I needed "ppr:1:node" to spawn a single process, but maybe 
that is wrong.

char info_str[64];
sprintf(info_str, "ppr:%d:node", 1);
MPI_Info_create();
MPI_Info_set(info, "host", "n022:*");
MPI_Info_set(info, "map-by", info_str);

MPI_Comm_spawn(manager_cmd_.c_str(), argv_, 1, info, rank_, 
MPI_COMM_SELF, , error_codes);

From: users  On Behalf Of Ralph Castain via 
users
Sent: Friday, November 5, 2021 9:50 AM
To: Open MPI Users 
Cc: Ralph Castain 
Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job 
launch with MPI_Comm_spawn

Here is the problem:

[n022.cluster.com:30045<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fn022.cluster.com%3A30045%2F=04%7C01%7Ckurt.e.mccall%40nasa.gov%7C54a8e83c9a704aef919f08d9a06c8c17%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637717210337342010%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=RsGjFPR80WOAw7YAFqoZYWobGR1fBA7MOiapV2CK%2BJc%3D=0>]
 [[36230,0],0] using dash_host n022
[n022.cluster.com:30045<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fn022.cluster.com%3A30045%2F=04%7C01%7Ckurt.e.mccall%40nasa.gov%7C54a8e83c9a704aef919f08d9a06c8c17%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637717210337342010%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=RsGjFPR80WOAw7YAFqoZYWobGR1fBA7MOiapV2CK%2BJc%3D=0>]
 [[36230,0],0] Removing node n022 slots 1 inuse 1
--
All nodes which are allocated for this job are already filled.
--

Looks like your program is passing a "dash-host" MPI info key to the Comm_spawn 
request and listing host "n022". This translates into assigning only one slot 
to that host, which indeed has already been filled. If you want to tell OMPI to 
use that host with _all_ slots available, then you need to change that 
"dash-host" info to be "n022:*", or replace the asterisk with the number of 
procs you want to allow on that node.



On Nov 5, 2021, at 7:37 AM, Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov>> wrote:

Ralph,

I configured my build with -enable-debug and added "--mca rmaps_base_verbose 5" 
to the mpiexec command line.   I have attached the job output.   Thanks for 
being willing to look at this problem.

My complete configure command is as follows:

$ ./configure --enable-shared --enable-static --with-tm=/opt/torque 
--enable-mpi-cxx --enable-cxx-exceptions --disable-wrapper-runpath 
--prefix=/opt/openmpi_pgc_tm  CC=nvc CXX=nvc++ FC=pgfortran CPP=cpp CFLAGS="-O0 
-tp p7-64 -c99" CXXFLAGS="-O0 -tp p7-64" FCFLAGS="-O0 -tp p7-64" --enable-debug 
--enable-memchecker --with-valgrind=/home/kmccall/valgrind_install

The nvc++ version is "nvc++ 20.9-0 LLVM 64-bit target on x86-64 Linux -tp 
haswell".

Our OS is CentOS 7.

Here is my mpiexec command, minus all of the trailing arguments that don't 
affect mpiexec.

mpiexec --enable-recovery \
--mca rmaps_base_verbose 5 \
--display-allocation \
--merge-stderr-to-stdout \
--mca mpi_param_check 1 \
--v \
--x DISPLAY \
--map-by node \
-np 21  \
-wdir ${work_dir}  ...

Here is my qsub command for the program "Needles".

qsub -V -j oe -e $tmpdir_stdio -o $tmpdir_stdio -f -X -N Needles -l 
nodes=21:ppn=9  RunNeedles.bash;


From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Wednesday, November 3, 2021 11:58 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>
Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job 
launch with MPI_Comm_spawn

Could you please ensure it was configured with --enable-debug and then add 
"--mca rmaps_base_verbose 5" to the mpirun cmd line?




On Nov 3, 2021, at 9:10 AM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:

Gilles and Ralph,

I did build with -with-tm.   I tried Gilles workaround but the failure still 
occurred.What do I need to provide you so that you can investigate this 
possible bug?

Thanks,
Kurt

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Wednesday, November 3, 2021 8:45 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>
Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filli

Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn

2021-11-03 Thread Mccall, Kurt E. (MSFC-EV41) via users
Gilles and Ralph,

I did build with -with-tm.   I tried Gilles workaround but the failure still 
occurred.What do I need to provide you so that you can investigate this 
possible bug?

Thanks,
Kurt

From: users  On Behalf Of Ralph Castain via 
users
Sent: Wednesday, November 3, 2021 8:45 AM
To: Open MPI Users 
Cc: Ralph Castain 
Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job 
launch with MPI_Comm_spawn

Sounds like a bug to me - regardless of configuration, if the hostfile contains 
an entry for each slot on a node, OMPI should have added those up.



On Nov 3, 2021, at 2:49 AM, Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org>> wrote:

Kurt,

Assuming you built Open MPI with tm support (default if tm is detected at 
configure time, but you can configure --with-tm to have it abort if tm support 
is not found), you should not need to use a hostfile.

As a workaround, I would suggest you try to
mpirun --map-by node -np 21 ...


Cheers,

Gilles

On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:
I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled with 
Torque support.

I want to reserve multiple slots on each node, and then launch a single manager 
process on each node.   The remaining slots would be filled up as the manager 
spawns new processes with MPI_Comm_spawn on its local node.

Here is the abbreviated mpiexec command, which I assume is the source of the 
problem described below (?).   The hostfile was created by Torque and it 
contains many repeated node names, one for each slot that it reserved.

$ mpiexec --hostfile  MyHostFile  -np 21 -npernode 1  (etc.)


When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are 
allocated for this job are already filled."   They don’t appear to be filled as 
it also reports that only one slot is in use for each node:

==   ALLOCATED NODES   ==
n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP
n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

Do you have any idea what I am doing wrong?   My Torque qsub arguments are 
unchanged from when I successfully launched this kind of job structure under 
MPICH.   The relevant argument to qsub is the resource list, which is “-l  
nodes=21:ppn=9”.




[OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn

2021-11-03 Thread Mccall, Kurt E. (MSFC-EV41) via users
I'm using OpenMPI 4.1.1 compiled with Nvidia's nvc++ 20.9, and compiled with 
Torque support.

I want to reserve multiple slots on each node, and then launch a single manager 
process on each node.   The remaining slots would be filled up as the manager 
spawns new processes with MPI_Comm_spawn on its local node.

Here is the abbreviated mpiexec command, which I assume is the source of the 
problem described below (?).   The hostfile was created by Torque and it 
contains many repeated node names, one for each slot that it reserved.

$ mpiexec --hostfile  MyHostFile  -np 21 -npernode 1  (etc.)


When MPI_Comm_spawn is called, MPI is reporting that "All nodes which are 
allocated for this job are already filled."   They don't appear to be filled as 
it also reports that only one slot is in use for each node:

==   ALLOCATED NODES   ==
n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP
n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

Do you have any idea what I am doing wrong?   My Torque qsub arguments are 
unchanged from when I successfully launched this kind of job structure under 
MPICH.   The relevant argument to qsub is the resource list, which is "-l  
nodes=21:ppn=9".



[OMPI users] Memchecker and MPI_Comm_spawn

2020-05-09 Thread Mccall, Kurt E. (MSFC-EV41) via users
How can I run OpenMPI's Memchecker on a process created by MPI_Comm_spawn()?   
I've configured OpenMPI 4.0.3 for Memchecker, along with Valgrind 3.15.0 and it 
works quite well on processes created directly by mpiexec.

I tried to do something analogous by pre-pending "valgrind" onto the command 
passed to MPI_Comm_spawn(), but the process is not launched and it returns no 
error code.

char *argv[n];
MPI_Info info;
MPI_Comm comm;
int error_code[1];

MPI_Comm_spawn ("valgrind   myApp",  argv,  1,  info,  0,  MPI_COMM_SELF,  
,  error_code);

I didn't change the array of myApp arguments argv after adding "valgrind" to 
the command;   maybe it needs to be adjusted somehow.

Thanks,
Kurt



Re: [OMPI users] Meaning of mpiexec error flags

2020-04-14 Thread Mccall, Kurt E. (MSFC-EV41) via users
Darn, I was hoping the flags would give a clue to the malfunction, which I’ve 
been trying to solve for weeks.  MPI_Comm_spawn() correctly spawns a worker on 
the node the mpirun is executing on, but on other nodes it says the following:



There are no allocated resources for the application:
  /home/kmccall/mav/9.15_mpi/mav
that match the requested mapping:
  -host: n002.cluster.com:3

Verify that you have mapped the allocated resources properly for the
indicated specification.

[n002:08645] *** An error occurred in MPI_Comm_spawn
[n002:08645] *** reported by process [1225916417,4]
[n002:08645] *** on communicator MPI_COMM_SELF
[n002:08645] *** MPI_ERR_SPAWN: could not spawn processes
[n002:08645] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n002:08645] ***and potentially your MPI job)

As you suggested several weeks ago, I added a process count to the host name 
(n001.cluster.com:3)   but it didn’t help.   Here is how I set up the “info” 
argument to MPI_Comm_spawn to spawn a single worker:

char info_str[64], host_str[64];

sprintf(info_str, "ppr:%d:node", 1);
sprintf(host_str, "%s:%d", host_name_.c_str(), 3);// added “:3” to 
host name

MPI_Info_create();
MPI_Info_set(info, "host", host_str);
MPI_Info_set(info, "map-by", info_str);
MPI_Info_set(info, "ompi_non_mpi", "true");


From: users  On Behalf Of Ralph Castain via 
users
Sent: Tuesday, April 14, 2020 8:13 AM
To: Open MPI Users 
Cc: Ralph Castain 
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

Then those flags are correct. I suspect mpirun is executing on n006, yes? The 
"location verified" just means that the daemon of rank N reported back from the 
node we expected it to be on - Slurm and Cray sometimes renumber the ranks. 
Torque doesn't and so you should never see a problem. Since mpirun isn't 
launched by itself, its node is never "verified", though I probably should 
alter that as it is obviously in the "right" place.

I don't know what you mean by your app isn't behaving correctly on the remote 
nodes - best guess is that perhaps some envar they need isn't being forwarded?



On Apr 14, 2020, at 2:04 AM, Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov>> wrote:

CentOS, Torque.



From: Ralph Castain mailto:r...@open-mpi.org>>
Sent: Monday, April 13, 2020 5:44 PM
To: Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov>>
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

What kind of system are you running on? Slurm? Cray? ...?




On Apr 13, 2020, at 3:11 PM, Mccall, Kurt E. (MSFC-EV41) 
mailto:kurt.e.mcc...@nasa.gov>> wrote:

Thanks Ralph.   So the difference between the working node flag (0x11) and the 
non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED.
What does that imply?   The location of the daemon has NOT been verified?

Kurt

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Monday, April 13, 2020 4:47 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain mailto:r...@open-mpi.org>>
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

I updated the message to explain the flags (instead of a numerical value) for 
OMPI v5. In brief:

#define PRRTE_NODE_FLAG_DAEMON_LAUNCHED0x01   // whether or not the daemon 
on this node has been launched
#define PRRTE_NODE_FLAG_LOC_VERIFIED   0x02   // whether or not the 
location has been verified - used for

  // environments where the daemon's final destination is 
uncertain
#define PRRTE_NODE_FLAG_OVERSUBSCRIBED   0x04   // whether or not this node 
is oversubscribed
#define PRRTE_NODE_FLAG_MAPPED 0x08   // whether we 
have been added to the current map
#define PRRTE_NODE_FLAG_SLOTS_GIVEN   0x10   // the number of slots 
was specified - used only in non-managed environments
#define PRRTE_NODE_NON_USABLE           0x20   // the node is 
hosting a tool and is NOT to be used for jobs






On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:

My application is behaving correctly on node n006, and incorrectly on the lower 
numbered nodes.   The flags in the error message below may give a clue as to 
why.   What is the meaning of the flag values 0x11 and 0x13?

==   ALLOCATED NODES   ==
n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n003: flags=0x13 slots=3 max_slots=0 slots_inu

Re: [OMPI users] Meaning of mpiexec error flags

2020-04-13 Thread Mccall, Kurt E. (MSFC-EV41) via users
Thanks Ralph.   So the difference between the working node flag (0x11) and the 
non-working nodes’ flags (0x13) is the flag PRRTE_NODE_FLAG_LOC_VERIFIED.
What does that imply?   The location of the daemon has NOT been verified?

Kurt

From: users  On Behalf Of Ralph Castain via 
users
Sent: Monday, April 13, 2020 4:47 PM
To: Open MPI Users 
Cc: Ralph Castain 
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

I updated the message to explain the flags (instead of a numerical value) for 
OMPI v5. In brief:

#define PRRTE_NODE_FLAG_DAEMON_LAUNCHED0x01   // whether or not the daemon 
on this node has been launched
#define PRRTE_NODE_FLAG_LOC_VERIFIED   0x02   // whether or not the 
location has been verified - used for

  // environments where the daemon's final destination is 
uncertain
#define PRRTE_NODE_FLAG_OVERSUBSCRIBED   0x04   // whether or not this node 
is oversubscribed
#define PRRTE_NODE_FLAG_MAPPED 0x08   // whether we 
have been added to the current map
#define PRRTE_NODE_FLAG_SLOTS_GIVEN   0x10   // the number of slots 
was specified - used only in non-managed environments
#define PRRTE_NODE_NON_USABLE   0x20   // the node is 
hosting a tool and is NOT to be used for jobs




On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:

My application is behaving correctly on node n006, and incorrectly on the lower 
numbered nodes.   The flags in the error message below may give a clue as to 
why.   What is the meaning of the flag values 0x11 and 0x13?

==   ALLOCATED NODES   ==
n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP

I’m using OpenMPI 4.0.3.

Thanks,
Kurt



[OMPI users] Meaning of mpiexec error flags

2020-04-13 Thread Mccall, Kurt E. (MSFC-EV41) via users
My application is behaving correctly on node n006, and incorrectly on the lower 
numbered nodes.   The flags in the error message below may give a clue as to 
why.   What is the meaning of the flag values 0x11 and 0x13?

==   ALLOCATED NODES   ==
n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP

I'm using OpenMPI 4.0.3.

Thanks,
Kurt


Re: [OMPI users] [EXTERNAL] Re: Please help me interpret MPI output

2019-11-21 Thread Mccall, Kurt E. (MSFC-EV41) via users
Thanks for responding.   Here are some more details.   I'm using OpenMpi 4.0.2, 
compiled with the Portland Group Compiler, pgc++ 19.5-0, with the build flags

--enable-mpi-cxx  --enable-cxx-exceptions  --with-tm

PBS/Torque version is 5.1.1.

I launched the job with qsub:

qsub -V -j oe -e ./stdio -o ./stdio -f -X -N MyJob -l nodes=2:ppn=3  
RunMyJob.bash

My abbreviated mpiexec command within RunMyJob.bash was:

mpiexec --enable-recovery -display-map --display-allocation --mca 
mpi_param_check 1  --v --x DISPLAY --np 2  --map-by ppr:1:node 


-Original Message-
From: Peter Kjellström  
Sent: Thursday, November 21, 2019 3:40 AM
To: Mccall, Kurt E. (MSFC-EV41) 
Cc: users@lists.open-mpi.org
Subject: [EXTERNAL] Re: [OMPI users] Please help me interpret MPI output

On Wed, 20 Nov 2019 17:38:19 +
"Mccall, Kurt E. \(MSFC-EV41\) via users" 
wrote:

> Hi,
> 
> My job is behaving differently on its two nodes, refusing to
> MPI_Comm_spawn() a process on one of them but succeeding on the other.
...
> Data for node: n002Num slots: 3... Bound: N/A
> Data for node: n001Num slots: 3... Bound:
> socket 0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]
...
> Why is the Bound output different between n001 and n002?

Without knowing more details (like what exact openmpi, how exactly did you try 
to launch) etc. you're not likely to get good answers.

But it does seem clear that the process/rank to hardware (core) pinning 
happened on one but not the other node.

This suggests a broken install and/or enviroment and/or non-standard launch.

/Peter K


[OMPI users] Please help me interpret MPI output

2019-11-20 Thread Mccall, Kurt E. (MSFC-EV41) via users
Hi,

My job is behaving differently on its two nodes, refusing to MPI_Comm_spawn() a 
process on one of them but succeeding on the other.  Please help me interpret 
the output that MPI is producing - I am hoping it will yield clues as to what 
is different between the two nodes.

Here is one instance:

==   ALLOCATED NODES   ==
n002: flags=0x11 slots=3 max_slots=0 slots_inuse=0 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_inuse=0 state=UP


What do the differing flags values mean?   Here is another instance:



   JOB MAP   

Data for node: n002Num slots: 3Max slots: 0Num procs: 1
Process OMPI jobid: [24604,1] App: 0 Process rank: 0 Bound: N/A

Data for node: n001Num slots: 3Max slots: 0Num procs: 1
Process OMPI jobid: [24604,1] App: 0 Process rank: 1 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]


Why is the Bound output different between n001 and n002?

Thanks for your help.

Kurt


[OMPI users] Interpreting the output of --display-map and --display-allocation

2019-11-18 Thread Mccall, Kurt E. (MSFC-EV41) via users
I'm trying to debug a problem with my job, launched with the mpiexec options 
-display-map and -display-allocation, but I don't know how to interpret the 
output.   For example,  mpiexec displays the following when a job is spawned by 
MPI_Comm_spawn():

==   ALLOCATED NODES   ==
n002: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP

Maybe the differing "flags" values have bearing on the problem, but I don't 
know what they mean.   Are the outputs of these two options documented anywhere?

Thanks,
Kurt


Re: [OMPI users] OpenMpi not throwing C++ exceptions

2019-11-07 Thread Mccall, Kurt E. (MSFC-EV41) via users
> Something is odd here, though -- I have two separately compiled OpenMpi 
> directories, one with and one without Torque support
>  (via the -with-tm configure flag).  >Ompi_info chose the one without Torque 
> support.   Why would it choose one over the other?  
> The one with Torque support is what I am using at the moment.

This makes me wonder if I have some sort of inconsistency on my system:  I'm 
linking my code against the With-Torque version of OpenMpi, but the O/S favors 
the Without-Torque version.   Could this cause a problem?

Kurt




Re: [OMPI users] OpenMpi not throwing C++ exceptions

2019-11-07 Thread Mccall, Kurt E. (MSFC-EV41) via users

Just to double check, does ompi_info show that you have C++ exception support?

-
$ ompi_info --all | grep exceptions
  C++ exceptions: yes
-

Indeed it does:

$ ompi_info --all | grep exceptions
  Configure command line: '--prefix=/opt/openmpi_pgc' '--enable-mpi-cxx' 
'--enable-cxx-exceptions'
  C++ exceptions: yes


Something is odd here, though -- I have two separately compiled OpenMpi 
directories, one with and one without Torque support (via the -with-tm 
configure flag).  Ompi_info chose the one without Torque support.   Why would 
it choose one over the other?  The one with Torque support is what I am using 
at the moment.

Kurt



Re: [OMPI users] OpenMpi not throwing C++ exceptions

2019-11-07 Thread Mccall, Kurt E. (MSFC-EV41) via users
Ø  You need to also set the MPI::ERRORS_THROW_EXCEPTIONS error handler in your 
MPI application.


Thanks Jeff.   I double-checked, and yes, I’m calling 
MPI_Comm_set_errhandler(com , MPI::ERRORS_THROW_EXCEPTIONS) for every intra- 
and inter-communicator in the parent and child processes.  Could it be 
something else?

Kurt

From: Jeff Squyres (jsquyres) 
Subject: [EXTERNAL] Re: [OMPI users] OpenMpi not throwing C++ exceptions

On Nov 7, 2019, at 3:02 PM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:

My program is failing in MPI_Comm_spawn, but it seems to simply terminate the 
job rather than throwing an exception that I can catch.   Here is the 
abbreviated error message:

[n001:32127] *** An error occurred in MPI_Comm_spawn
[n001:32127] *** reported by process [1679884289,1]
[n001:32127] *** on communicator MPI_COMM_SELF
[n001:32127] *** MPI_ERR_SPAWN: could not spawn processes
[n001:32127] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:32127] ***and potentially your MPI job)
[n002.cluster.com:21375<https://urldefense.proofpoint.com/v2/url?u=http-3A__n002.cluster.com-3A21375_=DwMGaQ=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ=Rbo2632nQpfH9XERgZXFuZFa2OsovVqiYTpPgXZ7P8E=Tt-_cxUJ6VJQdnFlu38vRNoRXWK0G7EM4oPOH_Ye88Q=>]
 PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923
[n001.cluster.com:32115<https://urldefense.proofpoint.com/v2/url?u=http-3A__n001.cluster.com-3A32115_=DwMGaQ=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ=Rbo2632nQpfH9XERgZXFuZFa2OsovVqiYTpPgXZ7P8E=OZlYpmZ-vgd-OFvXlbtacOk8WFZUTOSSACtPAYbXJlg=>]
 PMIX ERROR: BAD-PARAM in file event/pmix_event_notification.c at line 923

When I compiled OpenMpi, I used the following flags:

./configure --prefix=/opt/openmpi_pgc --enable-mpi-cxx  --enable-cxx-exceptions 
 --with-tm

Is --enable-cxx-exceptions  not sufficient by itself to enable exceptions?  I’m 
using the Portland Group Compiler pgc++ 19.5-0 and OpenMpi 4.0.2.

You need to also set the MPI::ERRORS_THROW_EXCEPTIONS error handler in your MPI 
application.

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>



[OMPI users] OpenMpi not throwing C++ exceptions

2019-11-07 Thread Mccall, Kurt E. (MSFC-EV41) via users
My program is failing in MPI_Comm_spawn, but it seems to simply terminate the 
job rather than throwing an exception that I can catch.   Here is the 
abbreviated error message:

[n001:32127] *** An error occurred in MPI_Comm_spawn
[n001:32127] *** reported by process [1679884289,1]
[n001:32127] *** on communicator MPI_COMM_SELF
[n001:32127] *** MPI_ERR_SPAWN: could not spawn processes
[n001:32127] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:32127] ***and potentially your MPI job)
[n002.cluster.com:21375] PMIX ERROR: BAD-PARAM in file 
event/pmix_event_notification.c at line 923
[n001.cluster.com:32115] PMIX ERROR: BAD-PARAM in file 
event/pmix_event_notification.c at line 923

When I compiled OpenMpi, I used the following flags:

./configure --prefix=/opt/openmpi_pgc --enable-mpi-cxx  --enable-cxx-exceptions 
 --with-tm

Is --enable-cxx-exceptions  not sufficient by itself to enable exceptions?  I'm 
using the Portland Group Compiler pgc++ 19.5-0 and OpenMpi 4.0.2.




[OMPI users] MPI_Comm_spawn: no allocated resources for the application ...

2019-10-25 Thread Mccall, Kurt E. (MSFC-EV41) via users
I am trying to launch a number of manager processes, one per node, and then have
each of those managers spawn, on its own same node, a number of workers.   For 
this example,
I have 2 managers and 2 workers per manager.  I'm following the instructions at 
this link

https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn

to force one manager process per node.


Here is my PBS/Torque qsub command:

$ qsub -V -j oe -e ./stdio -o ./stdio -f -X -N MyManagerJob -l nodes=2:ppn=3  
MyManager.bash

I expect "-l nodes=2:ppn=3" to reserve 2 nodes with 3 slots on each (one slot 
for the manager and the other two for the separately spawned workers).  The 
first  argument
is a lower-case L, not a one.



Here is my mpiexec command within the MyManager.bash script.

mpiexec --enable-recovery --display-map --display-allocation --mca 
mpi_param_check 1 --v --x DISPLAY --np 2  --map-by ppr:1:node  MyManager.exe

I expect "--map-by ppr:1:node" to cause OpenMpi to launch exactly one manager 
on each node.



When the first worker is spawned vi MPI_Comm_spawn(), OpenMpi reports:

==   ALLOCATED NODES   ==
n002: flags=0x11 slots=3 max_slots=0 slots_inuse=3 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
=
--
There are no allocated resources for the application:
  ./MyWorker
that match the requested mapping:
  -host: n001.cluster.com

Verify that you have mapped the allocated resources properly for the
indicated specification.
--
[n001:14883] *** An error occurred in MPI_Comm_spawn
[n001:14883] *** reported by process [1897594881,1]
[n001:14883] *** on communicator MPI_COMM_SELF
[n001:14883] *** MPI_ERR_SPAWN: could not spawn processes



It the banner above, it clearly states that node n001 has 3 slots reserved
and only one slot in used at time of the spawn.   Not sure why it reports
that there are no resources for it.

I've tried compiling OpenMpi 4.0 both with and without Torque support, and
I've tried using a an explicit host file (or not), but the error is unchanged.
Any ideas?

My cluster is running CentOS 7.4 and I am using the Portland Group C++ compiler.


[OMPI users] MPI_Comm_Spawn failure: All nodes already filled

2019-08-07 Thread Mccall, Kurt E. (MSFC-EV41) via users
Ralph,

I got MPI_Comm_spawn to work by making sure that the hostfiles on the head 
(where mpiexec is called) and the remote node are identical.  I had assumed 
that only the one on the head node was read by OpenMPI.   Is this correct?

Thanks,
Kurt

From: Ralph Castain 

Subject: [EXTERNAL] Re: [OMPI users] MPI_Comm_Spawn failure: All nodes already 
filled

I'm afraid I cannot replicate this problem on OMPI master, so it could be 
something different about OMPI 4.0.1 or your environment. Can you download and 
test one of the nightly tarballs from the "master" branch and see if it works 
for you?

https://www.open-mpi.org/nightly/master/<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_nightly_master_=DwMFaQ=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y=KOjBOU3R8SYRlORpTU4f1S89BfzgobqHLEMS3VC_jq8=>

Ralph



On Aug 6, 2019, at 3:58 AM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:

Hi,

MPI_Comm_spawn() is failing with the error message “All nodes which are 
allocated for this job are already filled”.   I compiled OpenMpi 4.0.1 with the 
Portland Group C++  compiler, v. 19.5.0, both with and without Torque/Maui 
support.   I thought that not using Torque/Maui support would give me finer 
control over where MPI_Comm_spawn() places the processes, but the failure 
message was the same in either case.  Perhaps Torque is interfering with 
process creation somehow?

For the pared-down test code, I am following the instructions here to make 
mpiexec create exactly one manager process on a remote node, and then forcing 
that manager to spawn one worker process on the same remote node:

https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_47743425_controlling-2Dnode-2Dmapping-2Dof-2Dmpi-2Dcomm-2Dspawn=DwMFaQ=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y=XtyQHmJFY97-e9umj4ROKIlvbglN7fZx-2FTdawoMaY=>




Here is the full error message.   Note the Max Slots: 0 message therein (?):

Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 2Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: N/A

=
Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 0Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]

=
--
All nodes which are allocated for this job are already filled.
--
[n001:08114] *** An error occurred in MPI_Comm_spawn
[n001:08114] *** reported by process [2557214721,0]
[n001:08114] *** on communicator MPI_COMM_SELF
[n001:08114] *** MPI_ERR_SPAWN: could not spawn processes
[n001:08114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:08114] ***and potentially your MPI job)




Here is my mpiexec command:

mpiexec --display-map --v --x DISPLAY -hostfile MyNodeFile --np 1 -map-by 
ppr:1:node SpawnTestManager




Here is my hostfile “MyNodeFile”:

n001.cluster.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__n001.cluster.com_=DwMFaQ=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y=U1Kh9c1PySsnzmR1cM9R_R2_5zlVBcJk7McLUEwOT8c=>
 slots=2 max_slots=2




Here is my SpawnTestManager code:


#include 
#include 
#include 

#ifdef SUCCESS
#undef SUCCESS
#endif
#include "/opt/openmpi_pgc_tm/include/mpi.h"

using std::string;
using std::cout;
using std::endl;

int main(int argc, char *argv[])
{
int rank, world_size;
char *argv2[2];
MPI_Comm mpi_comm;
MPI_Info info;
char host[MPI_MAX_PROCESSOR_NAME + 1];
int host_name_len;

string worker_cmd = "SpawnTestWorker";
string host_name = 
"n001.cluster.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__n001.cluster.com_=DwMFaQ=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y=U1Kh9c1PySsnzmR1cM9R_R2_5zlVBcJk7McLUEwOT8c=>";

argv2[0] = "dummy_arg";
argv2[1] = NULL;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, _size);

MPI_Get_processor_name(host, _na

Re: [OMPI users] [EXTERNAL] Re: MPI_Comm_Spawn failure: All nodes already filled

2019-08-07 Thread Mccall, Kurt E. (MSFC-EV41) via users
Ralph,

I downloaded and compiled the August 7 tarball with Torque/Maui support and ran 
my test program – resulting in the same error when MPI_Comm_spawn was called.  
Can you suggest anything the might be wrong with my environment or our Torque 
configuration that might be causing this?

Thanks,
Kurt

From: Ralph Castain 
Sent: Tuesday, August 6, 2019 1:53 PM
To: Open MPI Users 
Cc: Mccall, Kurt E. (MSFC-EV41) 
Subject: [EXTERNAL] Re: [OMPI users] MPI_Comm_Spawn failure: All nodes already 
filled

I'm afraid I cannot replicate this problem on OMPI master, so it could be 
something different about OMPI 4.0.1 or your environment. Can you download and 
test one of the nightly tarballs from the "master" branch and see if it works 
for you?

https://www.open-mpi.org/nightly/master/<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_nightly_master_=DwMFaQ=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y=KOjBOU3R8SYRlORpTU4f1S89BfzgobqHLEMS3VC_jq8=>

Ralph



On Aug 6, 2019, at 3:58 AM, Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:

Hi,

MPI_Comm_spawn() is failing with the error message “All nodes which are 
allocated for this job are already filled”.   I compiled OpenMpi 4.0.1 with the 
Portland Group C++  compiler, v. 19.5.0, both with and without Torque/Maui 
support.   I thought that not using Torque/Maui support would give me finer 
control over where MPI_Comm_spawn() places the processes, but the failure 
message was the same in either case.  Perhaps Torque is interfering with 
process creation somehow?

For the pared-down test code, I am following the instructions here to make 
mpiexec create exactly one manager process on a remote node, and then forcing 
that manager to spawn one worker process on the same remote node:

https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_47743425_controlling-2Dnode-2Dmapping-2Dof-2Dmpi-2Dcomm-2Dspawn=DwMFaQ=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y=XtyQHmJFY97-e9umj4ROKIlvbglN7fZx-2FTdawoMaY=>




Here is the full error message.   Note the Max Slots: 0 message therein (?):

Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 2Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: N/A

=
Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 0Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]

=
--
All nodes which are allocated for this job are already filled.
--
[n001:08114] *** An error occurred in MPI_Comm_spawn
[n001:08114] *** reported by process [2557214721,0]
[n001:08114] *** on communicator MPI_COMM_SELF
[n001:08114] *** MPI_ERR_SPAWN: could not spawn processes
[n001:08114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:08114] ***and potentially your MPI job)




Here is my mpiexec command:

mpiexec --display-map --v --x DISPLAY -hostfile MyNodeFile --np 1 -map-by 
ppr:1:node SpawnTestManager




Here is my hostfile “MyNodeFile”:

n001.cluster.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__n001.cluster.com_=DwMFaQ=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y=U1Kh9c1PySsnzmR1cM9R_R2_5zlVBcJk7McLUEwOT8c=>
 slots=2 max_slots=2




Here is my SpawnTestManager code:


#include 
#include 
#include 

#ifdef SUCCESS
#undef SUCCESS
#endif
#include "/opt/openmpi_pgc_tm/include/mpi.h"

using std::string;
using std::cout;
using std::endl;

int main(int argc, char *argv[])
{
int rank, world_size;
char *argv2[2];
MPI_Comm mpi_comm;
MPI_Info info;
char host[MPI_MAX_PROCESSOR_NAME + 1];
int host_name_len;

string worker_cmd = "SpawnTestWorker";
string host_name = 
"n001.cluster.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__n001.cluster.com_=DwMFaQ=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y=U1Kh9c1PySsnzmR1cM9R_R2_5zlVBcJk7McLUEwOT8c=>";

argv2[0] = "dummy_arg";
argv2[1] =

[OMPI users] MPI_Comm_Spawn failure: All nodes already filled

2019-08-06 Thread Mccall, Kurt E. (MSFC-EV41) via users
Hi,

MPI_Comm_spawn() is failing with the error message "All nodes which are 
allocated for this job are already filled".   I compiled OpenMpi 4.0.1 with the 
Portland Group C++  compiler, v. 19.5.0, both with and without Torque/Maui 
support.   I thought that not using Torque/Maui support would give me finer 
control over where MPI_Comm_spawn() places the processes, but the failure 
message was the same in either case.  Perhaps Torque is interfering with 
process creation somehow?

For the pared-down test code, I am following the instructions here to make 
mpiexec create exactly one manager process on a remote node, and then forcing 
that manager to spawn one worker process on the same remote node:

https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn




Here is the full error message.   Note the Max Slots: 0 message therein (?):

Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 2Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: N/A

=
Data for JOB [39020,1] offset 0 Total slots allocated 22

   JOB MAP   

Data for node: n001Num slots: 2Max slots: 0Num procs: 1
Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]

=
--
All nodes which are allocated for this job are already filled.
--
[n001:08114] *** An error occurred in MPI_Comm_spawn
[n001:08114] *** reported by process [2557214721,0]
[n001:08114] *** on communicator MPI_COMM_SELF
[n001:08114] *** MPI_ERR_SPAWN: could not spawn processes
[n001:08114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:08114] ***and potentially your MPI job)




Here is my mpiexec command:

mpiexec --display-map --v --x DISPLAY -hostfile MyNodeFile --np 1 -map-by 
ppr:1:node SpawnTestManager




Here is my hostfile "MyNodeFile":

n001.cluster.com slots=2 max_slots=2




Here is my SpawnTestManager code:


#include 
#include 
#include 

#ifdef SUCCESS
#undef SUCCESS
#endif
#include "/opt/openmpi_pgc_tm/include/mpi.h"

using std::string;
using std::cout;
using std::endl;

int main(int argc, char *argv[])
{
int rank, world_size;
char *argv2[2];
MPI_Comm mpi_comm;
MPI_Info info;
char host[MPI_MAX_PROCESSOR_NAME + 1];
int host_name_len;

string worker_cmd = "SpawnTestWorker";
string host_name = "n001.cluster.com";

argv2[0] = "dummy_arg";
argv2[1] = NULL;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, _size);

MPI_Get_processor_name(host, _name_len);
cout << "Host name from MPI_Get_processor_name is " << host << endl;

   char info_str[64];
sprintf(info_str, "ppr:%d:node", 1);
MPI_Info_create();
MPI_Info_set(info, "host", host_name.c_str());
MPI_Info_set(info, "map-by", info_str);

MPI_Comm_spawn(worker_cmd.c_str(), argv2, 1, info, rank, MPI_COMM_SELF,
_comm, MPI_ERRCODES_IGNORE);
MPI_Comm_set_errhandler(mpi_comm, MPI::ERRORS_THROW_EXCEPTIONS);

std::cout << "Manager success!" << std::endl;

MPI_Finalize();
return 0;
}




Here is my SpawnTestWorker code:


#include "/opt/openmpi_pgc_tm/include/mpi.h"
#include 

int main(int argc, char *argv[])
{
int world_size, rank;
MPI_Comm manager_intercom;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, _size);

MPI_Comm_get_parent(_intercom);
MPI_Comm_set_errhandler(manager_intercom, MPI::ERRORS_THROW_EXCEPTIONS);

std::cout << "Worker success!" << std::endl;

MPI_Finalize();
return 0;
}


My config.log can be found here:  
https://gist.github.com/kmccall882/e26bc2ea58c9328162e8959b614a6fce.js

I've attached the other info requested at on the help page, except the output 
of "ompi_info -v ompi full --parsable".   My version of ompi_info doesn't 
accept the "ompi full" arguments, and the "-all" arg doesn't produce much 
output.

Thanks for your help,
Kurt









___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users