I'm using OpenMpi 4.1.2 under Slurm 20.11.8.  My 2 process job is successfully 
launched, but when the main process rank 0
attempts to create an intercommunicator with process rank 1 on the other node:

MPI_Comm intercom;
MPI_Intercomm_create(MPI_COMM_SELF, 0, MPI_COMM_WORLD, 1, <tag>,   &intercom);

OpenMpi spins deep inside the MPI_Intercomm_create code, and the following 
error is reported:

WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

The output resulting from using the mpirun arguments "--mca ras_base_verbose 5 
--display-devel-map --mca rmaps_base_verbose 5" is attached.
Any help would be appreciated.
SLURM_JOB_NODELIST =  n[001-002]
Calling mpirun for slurm
num_proc =  2
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm: available for 
selection
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base set policy with 
ppr:1:node device NONNULL
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base policy ppr 
modifiers 1:node provided
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component mindist
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[mindist]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component ppr
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [ppr]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component rank_file
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[rank_file]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component resilient
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[resilient]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component round_robin
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[round_robin]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component seq
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [seq]
[n001.cluster.pssclabs.com:3473322] [[65186,0],0]: Final mapper priorities
[n001.cluster.pssclabs.com:3473322]     Mapper: ppr Priority: 90
[n001.cluster.pssclabs.com:3473322]     Mapper: seq Priority: 60
[n001.cluster.pssclabs.com:3473322]     Mapper: resilient Priority: 40
[n001.cluster.pssclabs.com:3473322]     Mapper: mindist Priority: 20
[n001.cluster.pssclabs.com:3473322]     Mapper: round_robin Priority: 10
[n001.cluster.pssclabs.com:3473322]     Mapper: rank_file Priority: 0
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base set policy with 
ppr:1:node device NULL
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base policy ppr 
modifiers 1:node provided
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:allocate
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
checking nodelist: n[001-002]
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
parse range 001-002 (2)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
adding node n001 (24 slots)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
adding node n002 (24 slots)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate: success
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert 
inserting 2 nodes
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert updating 
HNP [n001] info to 24 slots
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert node 
n002 slots 24

======================   ALLOCATED NODES   ======================
        n001: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
        n002: flags=0x10 slots=24 max_slots=0 slots_inuse=0 state=UP
=================================================================

======================   ALLOCATED NODES   ======================
        n001: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
        n002: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
=================================================================
[n001.cluster.pssclabs.com:3473322] mca:rmaps: mapping job [65186,1]
[n001.cluster.pssclabs.com:3473322] mca:rmaps: setting mapping policies for job 
[65186,1] nprocs 2
[n001.cluster.pssclabs.com:3473322] mca:rmaps[303] binding not given - using 
bycore
[n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: mapping job [65186,1] with 
ppr 1:node
[n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: job [65186,1] assigned 
policy BYNODE:NOOVERSUBSCRIBE
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting with 2 nodes in list
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Filtering thru apps
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Retained 2 nodes in list
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] node n001 has 24 slots 
available
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] node n002 has 24 slots 
available
[n001.cluster.pssclabs.com:3473322] AVAILABLE NODES FOR MAPPING:
[n001.cluster.pssclabs.com:3473322]     node: n001 daemon: 0
[n001.cluster.pssclabs.com:3473322]     node: n002 daemon: 1
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting bookmark at node n001
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting at node n001
[n001.cluster.pssclabs.com:3473322] mca:rmaps: assigning locations for job 
[65186,1]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: assigning locations for job 
[65186,1] with ppr 1:node policy BYNODE:NOOVERSUBSCRIBE
[n001.cluster.pssclabs.com:3473322] RANKING POLICY: SLOT
[n001.cluster.pssclabs.com:3473322] mca:rmaps:base: computing vpids by slot for 
job [65186,1]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:base: assigning rank 0 to node 
n001
[n001.cluster.pssclabs.com:3473322] mca:rmaps:base: assigning rank 1 to node 
n002
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base:compute_usage
[n001.cluster.pssclabs.com:3473322] mca:rmaps: compute bindings for job 
[65186,1] with policy CORE:IF-SUPPORTED[1008]
[n001.cluster.pssclabs.com:3473322] mca:rmaps: computing bindings for job 
[65186,1]
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] bind_depth: 6
[n001.cluster.pssclabs.com:3473322] mca:rmaps: bind downward for job [65186,1] 
with bindings CORE:IF-SUPPORTED
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] GOT 1 CPUS
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] PROC [[65186,1],0] BITMAP 0
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] BOUND PROC 
[[65186,1],0][n001] TO socket 0[core 0[hwt 0]]: 
[B/././././././././././.][./././././././././././.]
 Data for JOB [65186,1] offset 0 Total slots allocated 48

 Mapper requested: NULL  Last mapper: ppr  Mapping policy: 
BYNODE:NOOVERSUBSCRIBE  Ranking policy: SLOT
 Binding policy: CORE:IF-SUPPORTED  Cpu set: NULL  PPR: 1:node  Cpus-per-rank: 0
        Num new daemons: 0      New daemon starting vpid INVALID
        Num nodes: 2

 Data for node: n001    State: 3        Flags: 11
        Daemon: [[65186,0],0]   Daemon launched: True
        Num slots: 24   Slots in use: 1 Oversubscribed: FALSE
        Num slots allocated: 24 Max slots: 0
        Num procs: 1    Next node_rank: 1
        Data for proc: [[65186,1],0]
                Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
                State: INITIALIZED      App_context: 0
                Locale:  NODE
                Binding: [B/././././././././././.][./././././././././././.]

 Data for node: n002    State: 3        Flags: 11
        Daemon: [[65186,0],1]   Daemon launched: True
        Num slots: 24   Slots in use: 1 Oversubscribed: FALSE
        Num slots allocated: 24 Max slots: 0
        Num procs: 1    Next node_rank: 1
        Data for proc: [[65186,1],1]
                Pid: 0  Local rank: 0   Node rank: 0    App rank: 1
                State: INITIALIZED      App_context: 0
                Locale:  NODE
                Binding: UNBOUND
MASTER PID 3473332 on n001.cluster.pssclabs.com ready for attach
ARGS passed to MavMpiManager = mpi/MavMpiMM -job_id 103 -head_node 
rocci.ndc.nasa.gov -works_per_man 23 -manager_cmd mpi/MavMpiMM -worker_cmd 
/home/kmccall/mav/9.82_mpi_slurm/mav -seed_start 1 -seed_end 50 -seed_file_out 
/home/kmccall/.mavmpi/seedfile_103.txt -control_dir /home/kmccall/.mavmpi 
-info_file /home/kmccall/.mavmpi/job_info_103.txt -control_file 
/home/kmccall/.mavmpi/cntrl_103.txt -epoch 1647464003 -month 2 -u 6 -c 2 -s -f 
ARGS passed to MavMpiMaster = mpi/MavMpiMM -job_id 103 -head_node 
rocci.ndc.nasa.gov -works_per_man 23 -manager_cmd mpi/MavMpiMM -worker_cmd 
/home/kmccall/mav/9.82_mpi_slurm/mav -seed_start 1 -seed_end 50 -seed_file_out 
/home/kmccall/.mavmpi/seedfile_103.txt -control_dir /home/kmccall/.mavmpi 
-info_file /home/kmccall/.mavmpi/job_info_103.txt -control_file 
/home/kmccall/.mavmpi/cntrl_103.txt -epoch 1647464003 -month 2 -u 6 -c 2 -s -f 

======================   ALLOCATED NODES   ======================
        n001: flags=0x11 slots=24 max_slots=0 slots_inuse=1 state=UP
        n002: flags=0x11 slots=24 max_slots=0 slots_inuse=1 state=UP
=================================================================
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:allocate
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:allocate allocation 
already read

======================   ALLOCATED NODES   ======================
        n001: flags=0x11 slots=24 max_slots=0 slots_inuse=1 state=UP
        n002: flags=0x11 slots=24 max_slots=0 slots_inuse=1 state=UP
=================================================================
[n001.cluster.pssclabs.com:3473322] mca:rmaps: mapping job [65186,2]
[n001.cluster.pssclabs.com:3473322] mca:rmaps: dynamic job [65186,2] will not 
inherit launch directives
[n001.cluster.pssclabs.com:3473322] mca:rmaps: setting mapping policies for job 
[65186,2] nprocs 1
[n001.cluster.pssclabs.com:3473322] mca:rmaps[186] mapping not given - using 
bycore
[n001.cluster.pssclabs.com:3473322] mca:rmaps[332] binding not given - using 
bycore
[n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: job [65186,2] not using ppr 
mapper PPR NULL policy PPR NOTSET
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:seq called on job 
[65186,2]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:seq: job [65186,2] not using seq 
mapper
[n001.cluster.pssclabs.com:3473322] mca:rmaps:resilient: cannot perform initial 
map of job [65186,2] - no fault groups
[n001.cluster.pssclabs.com:3473322] mca:rmaps:mindist: job [65186,2] not using 
mindist mapper
[n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: mapping job [65186,2]
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] using dash_host n001:24
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] dashhost: parsing args n001:24
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] dashhost: working node n001:24
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] dashhost: added node n001 to 
list - slots 24
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] dashhost: adding node n001 
with 24 slots to final list
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] node n001 has 23 slots 
available
[n001.cluster.pssclabs.com:3473322] AVAILABLE NODES FOR MAPPING:
[n001.cluster.pssclabs.com:3473322]     node: n001 daemon: 0
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting bookmark at node n001
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting at node n001
[n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: mapping no-span by Core for 
job [65186,2] slots 23 num_procs 1
[n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: found 24 Core objects on node 
n001
[n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: calculated nprocs 23
[n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: assigning nprocs 23
[n001.cluster.pssclabs.com:3473322] mca:rmaps: assigning locations for job 
[65186,2]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: job [65186,2] not using ppr 
assign: round_robin
[n001.cluster.pssclabs.com:3473322] mca:rmaps:resilient: job [65186,2] not 
using resilient assign: round_robin
[n001.cluster.pssclabs.com:3473322] mca:rmaps:mindist: job [65186,2] not using 
mindist mapper
[n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: assign locations for job 
[65186,2]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: assigning locations by Core 
for job [65186,2]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: found 24 Core objects on node 
n001
[n001.cluster.pssclabs.com:3473322] mca:rmaps:rr:assign skipping proc 
[[65186,1],0] - from another job
[n001.cluster.pssclabs.com:3473322] RANKING POLICY: SLOT
[n001.cluster.pssclabs.com:3473322] mca:rmaps:base: computing vpids by slot for 
job [65186,2]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:base: assigning rank 0 to node 
n001
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base:compute_usage
[n001.cluster.pssclabs.com:3473322] mca:rmaps: compute bindings for job 
[65186,2] with policy CORE:IF-SUPPORTED[1008]
[n001.cluster.pssclabs.com:3473322] mca:rmaps: bindings for job [65186,2] - 
bind in place
[n001.cluster.pssclabs.com:3473322] mca:rmaps: bind in place for job [65186,2] 
with bindings CORE:IF-SUPPORTED
[n001.cluster.pssclabs.com:3473322] BINDING PROC [[65186,2],0] TO Core NUMBER 2
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] BOUND PROC [[65186,2],0] TO 
2[Core:2] on node n001
MASTER 3473332 n001.cluster.pssclabs.com 0  t = 59.4706: spawn new manager 
MPI_Comm_spawn success
ARGS passed to MavMpiManager = mpi/MavMpiMM -job_id 103 -head_node 
rocci.ndc.nasa.gov -works_per_man 22 -manager_cmd mpi/MavMpiMM -worker_cmd 
/home/kmccall/mav/9.82_mpi_slurm/mav -seed_start 1 -seed_end 50 -seed_file_out 
/home/kmccall/.mavmpi/seedfile_103.txt -control_dir /home/kmccall/.mavmpi 
-info_file /home/kmccall/.mavmpi/job_info_103.txt -control_file 
/home/kmccall/.mavmpi/cntrl_103.txt -epoch 1647464003 -month 2 -u 6 -c 2 -s -f 
-spawn_manager 
MASTER 3473332 n001.cluster.pssclabs.com 0  t = 134.707: spawned new manager on 
mother superior 
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: n001
  PID:        3473332
--------------------------------------------------------------------------
[n001.cluster.pssclabs.com:3473322] 1 more process has sent help message 
help-mpi-btl-tcp.txt / server accept cannot find guid
[n001.cluster.pssclabs.com:3473322] Set MCA parameter 
"orte_base_help_aggregate" to 0 to see all help / error messages
slurmstepd: error: *** JOB 224103 ON n001 CANCELLED AT 2022-03-16T15:56:44 ***

Reply via email to