I'm using OpenMpi 4.1.2 under Slurm 20.11.8. My 2 process job is successfully launched, but when the main process rank 0 attempts to create an intercommunicator with process rank 1 on the other node:
MPI_Comm intercom; MPI_Intercomm_create(MPI_COMM_SELF, 0, MPI_COMM_WORLD, 1, <tag>, &intercom); OpenMpi spins deep inside the MPI_Intercomm_create code, and the following error is reported: WARNING: Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer. This attempted connection will be ignored; your MPI job may or may not continue properly. The output resulting from using the mpirun arguments "--mca ras_base_verbose 5 --display-devel-map --mca rmaps_base_verbose 5" is attached. Any help would be appreciated.
SLURM_JOB_NODELIST = n[001-002] Calling mpirun for slurm num_proc = 2 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm: available for selection [n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base set policy with ppr:1:node device NONNULL [n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base policy ppr modifiers 1:node provided [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available component mindist [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [mindist] [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available component ppr [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [ppr] [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available component rank_file [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [rank_file] [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available component resilient [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [resilient] [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available component round_robin [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [round_robin] [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available component seq [n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [seq] [n001.cluster.pssclabs.com:3473322] [[65186,0],0]: Final mapper priorities [n001.cluster.pssclabs.com:3473322] Mapper: ppr Priority: 90 [n001.cluster.pssclabs.com:3473322] Mapper: seq Priority: 60 [n001.cluster.pssclabs.com:3473322] Mapper: resilient Priority: 40 [n001.cluster.pssclabs.com:3473322] Mapper: mindist Priority: 20 [n001.cluster.pssclabs.com:3473322] Mapper: round_robin Priority: 10 [n001.cluster.pssclabs.com:3473322] Mapper: rank_file Priority: 0 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base set policy with ppr:1:node device NULL [n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base policy ppr modifiers 1:node provided [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:allocate [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: checking nodelist: n[001-002] [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: parse range 001-002 (2) [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: adding node n001 (24 slots) [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: adding node n002 (24 slots) [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate: success [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert inserting 2 nodes [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert updating HNP [n001] info to 24 slots [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert node n002 slots 24 ====================== ALLOCATED NODES ====================== n001: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP n002: flags=0x10 slots=24 max_slots=0 slots_inuse=0 state=UP ================================================================= ====================== ALLOCATED NODES ====================== n001: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP n002: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP ================================================================= [n001.cluster.pssclabs.com:3473322] mca:rmaps: mapping job [65186,1] [n001.cluster.pssclabs.com:3473322] mca:rmaps: setting mapping policies for job [65186,1] nprocs 2 [n001.cluster.pssclabs.com:3473322] mca:rmaps[303] binding not given - using bycore [n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: mapping job [65186,1] with ppr 1:node [n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: job [65186,1] assigned policy BYNODE:NOOVERSUBSCRIBE [n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting with 2 nodes in list [n001.cluster.pssclabs.com:3473322] [[65186,0],0] Filtering thru apps [n001.cluster.pssclabs.com:3473322] [[65186,0],0] Retained 2 nodes in list [n001.cluster.pssclabs.com:3473322] [[65186,0],0] node n001 has 24 slots available [n001.cluster.pssclabs.com:3473322] [[65186,0],0] node n002 has 24 slots available [n001.cluster.pssclabs.com:3473322] AVAILABLE NODES FOR MAPPING: [n001.cluster.pssclabs.com:3473322] node: n001 daemon: 0 [n001.cluster.pssclabs.com:3473322] node: n002 daemon: 1 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting bookmark at node n001 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting at node n001 [n001.cluster.pssclabs.com:3473322] mca:rmaps: assigning locations for job [65186,1] [n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: assigning locations for job [65186,1] with ppr 1:node policy BYNODE:NOOVERSUBSCRIBE [n001.cluster.pssclabs.com:3473322] RANKING POLICY: SLOT [n001.cluster.pssclabs.com:3473322] mca:rmaps:base: computing vpids by slot for job [65186,1] [n001.cluster.pssclabs.com:3473322] mca:rmaps:base: assigning rank 0 to node n001 [n001.cluster.pssclabs.com:3473322] mca:rmaps:base: assigning rank 1 to node n002 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base:compute_usage [n001.cluster.pssclabs.com:3473322] mca:rmaps: compute bindings for job [65186,1] with policy CORE:IF-SUPPORTED[1008] [n001.cluster.pssclabs.com:3473322] mca:rmaps: computing bindings for job [65186,1] [n001.cluster.pssclabs.com:3473322] [[65186,0],0] bind_depth: 6 [n001.cluster.pssclabs.com:3473322] mca:rmaps: bind downward for job [65186,1] with bindings CORE:IF-SUPPORTED [n001.cluster.pssclabs.com:3473322] [[65186,0],0] GOT 1 CPUS [n001.cluster.pssclabs.com:3473322] [[65186,0],0] PROC [[65186,1],0] BITMAP 0 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] BOUND PROC [[65186,1],0][n001] TO socket 0[core 0[hwt 0]]: [B/././././././././././.][./././././././././././.] Data for JOB [65186,1] offset 0 Total slots allocated 48 Mapper requested: NULL Last mapper: ppr Mapping policy: BYNODE:NOOVERSUBSCRIBE Ranking policy: SLOT Binding policy: CORE:IF-SUPPORTED Cpu set: NULL PPR: 1:node Cpus-per-rank: 0 Num new daemons: 0 New daemon starting vpid INVALID Num nodes: 2 Data for node: n001 State: 3 Flags: 11 Daemon: [[65186,0],0] Daemon launched: True Num slots: 24 Slots in use: 1 Oversubscribed: FALSE Num slots allocated: 24 Max slots: 0 Num procs: 1 Next node_rank: 1 Data for proc: [[65186,1],0] Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 State: INITIALIZED App_context: 0 Locale: NODE Binding: [B/././././././././././.][./././././././././././.] Data for node: n002 State: 3 Flags: 11 Daemon: [[65186,0],1] Daemon launched: True Num slots: 24 Slots in use: 1 Oversubscribed: FALSE Num slots allocated: 24 Max slots: 0 Num procs: 1 Next node_rank: 1 Data for proc: [[65186,1],1] Pid: 0 Local rank: 0 Node rank: 0 App rank: 1 State: INITIALIZED App_context: 0 Locale: NODE Binding: UNBOUND MASTER PID 3473332 on n001.cluster.pssclabs.com ready for attach ARGS passed to MavMpiManager = mpi/MavMpiMM -job_id 103 -head_node rocci.ndc.nasa.gov -works_per_man 23 -manager_cmd mpi/MavMpiMM -worker_cmd /home/kmccall/mav/9.82_mpi_slurm/mav -seed_start 1 -seed_end 50 -seed_file_out /home/kmccall/.mavmpi/seedfile_103.txt -control_dir /home/kmccall/.mavmpi -info_file /home/kmccall/.mavmpi/job_info_103.txt -control_file /home/kmccall/.mavmpi/cntrl_103.txt -epoch 1647464003 -month 2 -u 6 -c 2 -s -f ARGS passed to MavMpiMaster = mpi/MavMpiMM -job_id 103 -head_node rocci.ndc.nasa.gov -works_per_man 23 -manager_cmd mpi/MavMpiMM -worker_cmd /home/kmccall/mav/9.82_mpi_slurm/mav -seed_start 1 -seed_end 50 -seed_file_out /home/kmccall/.mavmpi/seedfile_103.txt -control_dir /home/kmccall/.mavmpi -info_file /home/kmccall/.mavmpi/job_info_103.txt -control_file /home/kmccall/.mavmpi/cntrl_103.txt -epoch 1647464003 -month 2 -u 6 -c 2 -s -f ====================== ALLOCATED NODES ====================== n001: flags=0x11 slots=24 max_slots=0 slots_inuse=1 state=UP n002: flags=0x11 slots=24 max_slots=0 slots_inuse=1 state=UP ================================================================= [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:allocate [n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:allocate allocation already read ====================== ALLOCATED NODES ====================== n001: flags=0x11 slots=24 max_slots=0 slots_inuse=1 state=UP n002: flags=0x11 slots=24 max_slots=0 slots_inuse=1 state=UP ================================================================= [n001.cluster.pssclabs.com:3473322] mca:rmaps: mapping job [65186,2] [n001.cluster.pssclabs.com:3473322] mca:rmaps: dynamic job [65186,2] will not inherit launch directives [n001.cluster.pssclabs.com:3473322] mca:rmaps: setting mapping policies for job [65186,2] nprocs 1 [n001.cluster.pssclabs.com:3473322] mca:rmaps[186] mapping not given - using bycore [n001.cluster.pssclabs.com:3473322] mca:rmaps[332] binding not given - using bycore [n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: job [65186,2] not using ppr mapper PPR NULL policy PPR NOTSET [n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:seq called on job [65186,2] [n001.cluster.pssclabs.com:3473322] mca:rmaps:seq: job [65186,2] not using seq mapper [n001.cluster.pssclabs.com:3473322] mca:rmaps:resilient: cannot perform initial map of job [65186,2] - no fault groups [n001.cluster.pssclabs.com:3473322] mca:rmaps:mindist: job [65186,2] not using mindist mapper [n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: mapping job [65186,2] [n001.cluster.pssclabs.com:3473322] [[65186,0],0] using dash_host n001:24 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] dashhost: parsing args n001:24 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] dashhost: working node n001:24 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] dashhost: added node n001 to list - slots 24 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] dashhost: adding node n001 with 24 slots to final list [n001.cluster.pssclabs.com:3473322] [[65186,0],0] node n001 has 23 slots available [n001.cluster.pssclabs.com:3473322] AVAILABLE NODES FOR MAPPING: [n001.cluster.pssclabs.com:3473322] node: n001 daemon: 0 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting bookmark at node n001 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting at node n001 [n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: mapping no-span by Core for job [65186,2] slots 23 num_procs 1 [n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: found 24 Core objects on node n001 [n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: calculated nprocs 23 [n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: assigning nprocs 23 [n001.cluster.pssclabs.com:3473322] mca:rmaps: assigning locations for job [65186,2] [n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: job [65186,2] not using ppr assign: round_robin [n001.cluster.pssclabs.com:3473322] mca:rmaps:resilient: job [65186,2] not using resilient assign: round_robin [n001.cluster.pssclabs.com:3473322] mca:rmaps:mindist: job [65186,2] not using mindist mapper [n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: assign locations for job [65186,2] [n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: assigning locations by Core for job [65186,2] [n001.cluster.pssclabs.com:3473322] mca:rmaps:rr: found 24 Core objects on node n001 [n001.cluster.pssclabs.com:3473322] mca:rmaps:rr:assign skipping proc [[65186,1],0] - from another job [n001.cluster.pssclabs.com:3473322] RANKING POLICY: SLOT [n001.cluster.pssclabs.com:3473322] mca:rmaps:base: computing vpids by slot for job [65186,2] [n001.cluster.pssclabs.com:3473322] mca:rmaps:base: assigning rank 0 to node n001 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base:compute_usage [n001.cluster.pssclabs.com:3473322] mca:rmaps: compute bindings for job [65186,2] with policy CORE:IF-SUPPORTED[1008] [n001.cluster.pssclabs.com:3473322] mca:rmaps: bindings for job [65186,2] - bind in place [n001.cluster.pssclabs.com:3473322] mca:rmaps: bind in place for job [65186,2] with bindings CORE:IF-SUPPORTED [n001.cluster.pssclabs.com:3473322] BINDING PROC [[65186,2],0] TO Core NUMBER 2 [n001.cluster.pssclabs.com:3473322] [[65186,0],0] BOUND PROC [[65186,2],0] TO 2[Core:2] on node n001 MASTER 3473332 n001.cluster.pssclabs.com 0 t = 59.4706: spawn new manager MPI_Comm_spawn success ARGS passed to MavMpiManager = mpi/MavMpiMM -job_id 103 -head_node rocci.ndc.nasa.gov -works_per_man 22 -manager_cmd mpi/MavMpiMM -worker_cmd /home/kmccall/mav/9.82_mpi_slurm/mav -seed_start 1 -seed_end 50 -seed_file_out /home/kmccall/.mavmpi/seedfile_103.txt -control_dir /home/kmccall/.mavmpi -info_file /home/kmccall/.mavmpi/job_info_103.txt -control_file /home/kmccall/.mavmpi/cntrl_103.txt -epoch 1647464003 -month 2 -u 6 -c 2 -s -f -spawn_manager MASTER 3473332 n001.cluster.pssclabs.com 0 t = 134.707: spawned new manager on mother superior -------------------------------------------------------------------------- WARNING: Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer. This attempted connection will be ignored; your MPI job may or may not continue properly. Local host: n001 PID: 3473332 -------------------------------------------------------------------------- [n001.cluster.pssclabs.com:3473322] 1 more process has sent help message help-mpi-btl-tcp.txt / server accept cannot find guid [n001.cluster.pssclabs.com:3473322] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages slurmstepd: error: *** JOB 224103 ON n001 CANCELLED AT 2022-03-16T15:56:44 ***