[OMPI users] mpi program gets stuck
see also: https://pastebin.com/s5tjaUkF (py3.9) ➜ /share cat hosts 192.168.180.48 slots=1 192.168.60.203 slots=1 1. This command now runs correctly using your openmpi-gitclone-pr11096.tar.bz2 (py3.9) ➜ /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime 2. But this command gets stuck. It seems to be the mpi program that gets stuck. test.py: import mpi4py from mpi4py import MPI (py3.9) ➜ /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 python test.py [computer01:47982] mca: base: component_find: searching NULL for plm components [computer01:47982] mca: base: find_dyn_components: checking NULL for plm components [computer01:47982] pmix:mca: base: components_register: registering framework plm components [computer01:47982] pmix:mca: base: components_register: found loaded component slurm [computer01:47982] pmix:mca: base: components_register: component slurm register function successful [computer01:47982] pmix:mca: base: components_register: found loaded component ssh [computer01:47982] pmix:mca: base: components_register: component ssh register function successful [computer01:47982] mca: base: components_open: opening plm components [computer01:47982] mca: base: components_open: found loaded component slurm [computer01:47982] mca: base: components_open: component slurm open function successful [computer01:47982] mca: base: components_open: found loaded component ssh [computer01:47982] mca: base: components_open: component ssh open function successful [computer01:47982] mca:base:select: Auto-selecting plm components [computer01:47982] mca:base:select:( plm) Querying component [slurm] [computer01:47982] mca:base:select:( plm) Querying component [ssh] [computer01:47982] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL [computer01:47982] mca:base:select:( plm) Query of component [ssh] set priority to 10 [computer01:47982] mca:base:select:( plm) Selected component [ssh] [computer01:47982] mca: base: close: component slurm closed [computer01:47982] mca: base: close: unloading component slurm [computer01:47982] [prterun-computer01-47982@0,0] plm:ssh_setup on agent ssh : rsh path NULL [computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive start comm [computer01:47982] mca: base: component_find: searching NULL for ras components [computer01:47982] mca: base: find_dyn_components: checking NULL for ras components [computer01:47982] pmix:mca: base: components_register: registering framework ras components [computer01:47982] pmix:mca: base: components_register: found loaded component simulator [computer01:47982] pmix:mca: base: components_register: component simulator register function successful [computer01:47982] pmix:mca: base: components_register: found loaded component pbs [computer01:47982] pmix:mca: base: components_register: component pbs register function successful [computer01:47982] pmix:mca: base: components_register: found loaded component slurm [computer01:47982] pmix:mca: base: components_register: component slurm register function successful [computer01:47982] mca: base: components_open: opening ras components [computer01:47982] mca: base: components_open: found loaded component simulator [computer01:47982] mca: base: components_open: found loaded component pbs [computer01:47982] mca: base: components_open: component pbs open function successful [computer01:47982] mca: base: components_open: found loaded component slurm [computer01:47982] mca: base: components_open: component slurm open function successful [computer01:47982] mca:base:select: Auto-selecting ras components [computer01:47982] mca:base:select:( ras) Querying component [simulator] [computer01:47982] mca:base:select:( ras) Querying component [pbs] [computer01:47982] mca:base:select:( ras) Querying component [slurm] [computer01:47982] mca:base:select:( ras) No component selected! [computer01:47982] mca: base: component_find: searching NULL for rmaps components [computer01:47982] mca: base: find_dyn_components: checking NULL for rmaps components [computer01:47982] pmix:mca: base: components_register: registering framework rmaps components [computer01:47982] pmix:mca: base: components_register: found loaded component ppr [computer01:47982] pmix:mca: base: components_register: component ppr register function successful [computer01:47982] pmix:mca: base: components_register: found loaded component rank_file [computer01:47982] pmix:mca: base: components_register: component rank_file has no register or open function [computer01:47982] pmix:mca: base: components_register: found loaded component round_robin [computer01:47982] pmix:mca: base: components_register: component round_robin register function successful [computer01:47982] pmix:mca: base: components_register: found loaded component seq [computer01:47982] pmix:mca: base: components_register: component seq
Re: [OMPI users] users Digest, Vol 4818, Issue 1
: 192.168.180.48 hepslustretest03: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.60.203,hepslustretest03.ihep.ac.cn,172.17.180.203,172.168.10.23,172.168.10.143 = [computer01:39342] mca:rmaps: mapping job prterun-computer01-39342@1 [computer01:39342] mca:rmaps: setting mapping policies for job prterun-computer01-39342@1 inherit TRUE hwtcpus FALSE [computer01:39342] mca:rmaps[358] mapping not given - using bycore [computer01:39342] setdefaultbinding[365] binding not given - using bycore [computer01:39342] mca:rmaps:ppr: job prterun-computer01-39342@1 not using ppr mapper PPR NULL policy PPR NOTSET [computer01:39342] mca:rmaps:seq: job prterun-computer01-39342@1 not using seq mapper [computer01:39342] mca:rmaps:rr: mapping job prterun-computer01-39342@1 [computer01:39342] AVAILABLE NODES FOR MAPPING: [computer01:39342] node: computer01 daemon: 0 slots_available: 1 [computer01:39342] mca:rmaps:rr: mapping by Core for job prterun-computer01-39342@1 slots 1 num_procs 2 -- There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: which Either request fewer procs for your application, or make more slots available for use. A "slot" is the PRRTE term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which PRRTE processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, PRRTE defaults to the number of processor cores In all the above cases, if you want PRRTE to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the number of available slots when deciding the number of processes to launch. -- 在 2022/11/15 02:04, users-requ...@lists.open-mpi.org 写道: Send users mailing list submissions to users@lists.open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.open-mpi.org/mailman/listinfo/users or, via email, send a message with subject or body 'help' to users-requ...@lists.open-mpi.org You can reach the person managing the list at users-ow...@lists.open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Re: [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application (Jeff Squyres (jsquyres)) 2. Re: Tracing of openmpi internal functions (Jeff Squyres (jsquyres)) -- Message: 1 Date: Mon, 14 Nov 2022 17:04:24 + From: "Jeff Squyres (jsquyres)" To: Open MPI Users Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application Message-ID: Content-Type: text/plain; charset="utf-8" Yes, somehow I'm not seeing all the output that I expect to see. Can you ensure that if you're copy-and-pasting from the email, that it's actually using "dash dash" in front of "mca" and "machinefile" (vs. a copy-and-pasted "em dash")? -- Jeff Squyres jsquy...@cisco.com From: users on behalf of Gilles Gouaillardet via users Sent: Sunday, November 13, 2022 9:18 PM To: Open MPI Users Cc: Gilles Gouaillardet Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application There is a typo in your command line. You should use --mca (minus minus) instead of -mca Also, you can try --machinefile instead of -machinefile Cheers, Gilles There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: ?mca On Mon, Nov 14, 2022 at 11:04 AM timesir via users mailto:users@lists.open-mpi.org>> wrote: (py3.9) ? /share mpirun -n 2 -machinefile hosts ?mca rmaps_base_verbose 100 --mca ras_base_verbose 100 which mpirun [computer01:04570] mca: base: component_find: searching NULL for ras components [computer01:04570] mca: base: find_dyn_components: checking N
Re: [OMPI users] users Digest, Vol 4818, Issue 1
defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, PRRTE defaults to the number of processor cores In all the above cases, if you want PRRTE to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the number of available slots when deciding the number of processes to launch. -- 在 2022/11/15 02:04, users-requ...@lists.open-mpi.org 写道: Send users mailing list submissions to users@lists.open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.open-mpi.org/mailman/listinfo/users or, via email, send a message with subject or body 'help' to users-requ...@lists.open-mpi.org You can reach the person managing the list at users-ow...@lists.open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Re: [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application (Jeff Squyres (jsquyres)) 2. Re: Tracing of openmpi internal functions (Jeff Squyres (jsquyres)) -- Message: 1 Date: Mon, 14 Nov 2022 17:04:24 + From: "Jeff Squyres (jsquyres)" To: Open MPI Users Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application Message-ID: Content-Type: text/plain; charset="utf-8" Yes, somehow I'm not seeing all the output that I expect to see. Can you ensure that if you're copy-and-pasting from the email, that it's actually using "dash dash" in front of "mca" and "machinefile" (vs. a copy-and-pasted "em dash")? -- Jeff squyresjsquy...@cisco.com From: users on behalf of Gilles Gouaillardet via users Sent: Sunday, November 13, 2022 9:18 PM To: Open MPI Users Cc: Gilles Gouaillardet Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application There is a typo in your command line. You should use --mca (minus minus) instead of -mca Also, you can try --machinefile instead of -machinefile Cheers, Gilles There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: ?mca On Mon, Nov 14, 2022 at 11:04 AM timesir via users mailto:users@lists.open-mpi.org> > wrote: (py3.9) ? /share mpirun -n 2 -machinefile hosts ?mca rmaps_base_verbose 100 --mca ras_base_verbose 100 which mpirun [computer01:04570] mca: base: component_find: searching NULL for ras components [computer01:04570] mca: base: find_dyn_components: checking NULL for ras components [computer01:04570] pmix:mca: base: components_register: registering framework ras components [computer01:04570] pmix:mca: base: components_register: found loaded component simulator [computer01:04570] pmix:mca: base: components_register: component simulator register function successful [computer01:04570] pmix:mca: base: components_register: found loaded component pbs [computer01:04570] pmix:mca: base: components_register: component pbs register function successful [computer01:04570] pmix:mca: base: components_register: found loaded component slurm [computer01:04570] pmix:mca: base: components_register: component slurm register function successful [computer01:04570] mca: base: components_open: opening ras components [computer01:04570] mca: base: components_open: found loaded component simulator [computer01:04570] mca: base: components_open: found loaded component pbs [computer01:04570] mca: base: components_open: component pbs open function successful [computer01:04570] mca: base: components_open: found loaded component slurm [computer01:04570] mca: base: components_open: component slurm open function successful [computer01:04570] mca:base:select: Auto-selecting ras components [computer01:04570] mca:base:select:( ras) Querying component [simulator] [computer01:04570] mca:base:select:( ras) Querying component [pbs] [computer01:04570] mca:base:select:( ras) Querying component [slurm] [computer01:04570] mca:base:select:( ras) No component selected! == ALLOCATED NODES == [10/1444] computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 192.168.60.203<http://192.168.60.203> <http://192.168.60.203
Re: [OMPI users] users Digest, Vol 4818, Issue 1
Do you receive this email? 在 2022年11月23日星期三,timesir 写道: > > *1. This command now runs correctly * > > *(py3.9) ➜ /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose > 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime* > > > > *2. But this command gets stuck. It seems to be the mpi program that gets > stuck. * > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *[computer01:47982] mca: base: component_find: searching NULL for plm > components [computer01:47982] mca: base: find_dyn_components: checking NULL > for plm components [computer01:47982] pmix:mca: base: components_register: > registering framework plm components [computer01:47982] pmix:mca: base: > components_register: found loaded component slurm [computer01:47982] > pmix:mca: base: components_register: component slurm register function > successful [computer01:47982] pmix:mca: base: components_register: found > loaded component ssh [computer01:47982] pmix:mca: base: > components_register: component ssh register function successful > [computer01:47982] mca: base: components_open: opening plm components > [computer01:47982] mca: base: components_open: found loaded component slurm > [computer01:47982] mca: base: components_open: component slurm open > function successful [computer01:47982] mca: base: components_open: found > loaded component ssh [computer01:47982] mca: base: components_open: > component ssh open function successful [computer01:47982] mca:base:select: > Auto-selecting plm components [computer01:47982] mca:base:select:( plm) > Querying component [slurm] [computer01:47982] mca:base:select:( plm) > Querying component [ssh] [computer01:47982] [[INVALID],0] plm:ssh_lookup on > agent ssh : rsh path NULL [computer01:47982] mca:base:select:( plm) Query > of component [ssh] set priority to 10 [computer01:47982] mca:base:select:( > plm) Selected component [ssh] [computer01:47982] mca: base: close: > component slurm closed [computer01:47982] mca: base: close: unloading > component slurm [computer01:47982] [prterun-computer01-47982@0,0] > plm:ssh_setup on agent ssh : rsh path NULL [computer01:47982] > [prterun-computer01-47982@0,0] plm:base:receive start comm > [computer01:47982] mca: base: component_find: searching NULL for ras > components [computer01:47982] mca: base: find_dyn_components: checking NULL > for ras components [computer01:47982] pmix:mca: base: components_register: > registering framework ras components [computer01:47982] pmix:mca: base: > components_register: found loaded component simulator [computer01:47982] > pmix:mca: base: components_register: component simulator register function > successful [computer01:47982] pmix:mca: base: components_register: found > loaded component pbs [computer01:47982] pmix:mca: base: > components_register: component pbs register function successful > [computer01:47982] pmix:mca: base: components_register: found loaded > component slurm [computer01:47982] pmix:mca: base: components_register: > component slurm register function successful [computer01:47982] mca: base: > components_open: opening ras components [computer01:47982] mca: base: > components_open: found loaded component simulator [computer01:47982] mca: > base: components_open: found loaded component pbs [computer01:47982] mca: > base: components_open: component pbs open function successful > [computer01:47982] mca: base: components_open: found loaded component slurm > [computer01:47982] mca: base: components_open: component slurm open > function successful [computer01:47982] mca:base:select: Auto-selecting ras > components [computer01:47982] mca:base:select:( ras) Querying component > [simulator] [computer01:47982] mca:base:select:( ras) Querying component > [pbs] [computer01:47982] mca:base:select:( ras) Querying component [slurm] > [computer01:47982] mca:base:select:( ras) No component selected! > [computer01:47982] mca: base: component_find: searching NULL for rmaps > components [computer01:47982] mca: base: find_dyn_components: checking NULL > for rmaps components [computer01:47982] pmix:mca: base: > components_register: registering framework rmaps components > [computer01:47982] pmix:mca: base: components_register: found loaded > component ppr [computer01:47982] pmix:mca: base: components_register: > component ppr register function successful [computer01:47982] pmix:mca: > base: components_register: found loaded component rank_file > [computer01:47982] pmix:mca: base: components_register: component rank_file > has no register or open function [computer01:47982] pmix:mca: base: > components_register: found loaded component round_robin
Re: [OMPI users] users Digest, Vol 4818, Issue 1
defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, PRRTE defaults to the number of processor cores In all the above cases, if you want PRRTE to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the number of available slots when deciding the number of processes to launch. -- 在 2022/11/15 02:04, users-requ...@lists.open-mpi.org 写道: Send users mailing list submissions to users@lists.open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.open-mpi.org/mailman/listinfo/users or, via email, send a message with subject or body 'help' to users-requ...@lists.open-mpi.org You can reach the person managing the list at users-ow...@lists.open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Re: [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application (Jeff Squyres (jsquyres)) 2. Re: Tracing of openmpi internal functions (Jeff Squyres (jsquyres)) -- Message: 1 Date: Mon, 14 Nov 2022 17:04:24 + From: "Jeff Squyres (jsquyres)" To: Open MPI Users Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application Message-ID: Content-Type: text/plain; charset="utf-8" Yes, somehow I'm not seeing all the output that I expect to see. Can you ensure that if you're copy-and-pasting from the email, that it's actually using "dash dash" in front of "mca" and "machinefile" (vs. a copy-and-pasted "em dash")? -- Jeff squyresjsquy...@cisco.com From: users on behalf of Gilles Gouaillardet via users Sent: Sunday, November 13, 2022 9:18 PM To: Open MPI Users Cc: Gilles Gouaillardet Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application There is a typo in your command line. You should use --mca (minus minus) instead of -mca Also, you can try --machinefile instead of -machinefile Cheers, Gilles There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: ?mca On Mon, Nov 14, 2022 at 11:04 AM timesir via users mailto:users@lists.open-mpi.org> > wrote: (py3.9) ? /share mpirun -n 2 -machinefile hosts ?mca rmaps_base_verbose 100 --mca ras_base_verbose 100 which mpirun [computer01:04570] mca: base: component_find: searching NULL for ras components [computer01:04570] mca: base: find_dyn_components: checking NULL for ras components [computer01:04570] pmix:mca: base: components_register: registering framework ras components [computer01:04570] pmix:mca: base: components_register: found loaded component simulator [computer01:04570] pmix:mca: base: components_register: component simulator register function successful [computer01:04570] pmix:mca: base: components_register: found loaded component pbs [computer01:04570] pmix:mca: base: components_register: component pbs register function successful [computer01:04570] pmix:mca: base: components_register: found loaded component slurm [computer01:04570] pmix:mca: base: components_register: component slurm register function successful [computer01:04570] mca: base: components_open: opening ras components [computer01:04570] mca: base: components_open: found loaded component simulator [computer01:04570] mca: base: components_open: found loaded component pbs [computer01:04570] mca: base: components_open: component pbs open function successful [computer01:04570] mca: base: components_open: found loaded component slurm [computer01:04570] mca: base: components_open: component slurm open function successful [computer01:04570] mca:base:select: Auto-selecting ras components [computer01:04570] mca:base:select:( ras) Querying component [simulator] [computer01:04570] mca:base:select:( ras) Querying component [pbs] [computer01:04570] mca:base:select:( ras) Querying component [slurm] [computer01:04570] mca:base:select:( ras) No component selected! == ALLOCATED NODES == [10/1444] computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 192.168.60.203<http://192.168.60.203> <http://192.168.60.203
Re: [OMPI users] users Digest, Vol 4818, Issue 1
state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.60.203,hepslustretest03.ihep.ac.cn,172.17.180.203,172.168.10.23,172.168.10.143 = [computer01:39342] mca:rmaps: mapping job prterun-computer01-39342@1 [computer01:39342] mca:rmaps: setting mapping policies for job prterun-computer01-39342@1 inherit TRUE hwtcpus FALSE [computer01:39342] mca:rmaps[358] mapping not given - using bycore [computer01:39342] setdefaultbinding[365] binding not given - using bycore [computer01:39342] mca:rmaps:ppr: job prterun-computer01-39342@1 not using ppr mapper PPR NULL policy PPR NOTSET [computer01:39342] mca:rmaps:seq: job prterun-computer01-39342@1 not using seq mapper [computer01:39342] mca:rmaps:rr: mapping job prterun-computer01-39342@1 [computer01:39342] AVAILABLE NODES FOR MAPPING: [computer01:39342] node: computer01 daemon: 0 slots_available: 1 [computer01:39342] mca:rmaps:rr: mapping by Core for job prterun-computer01-39342@1 slots 1 num_procs 2 -- There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: which Either request fewer procs for your application, or make more slots available for use. A "slot" is the PRRTE term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which PRRTE processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, PRRTE defaults to the number of processor cores In all the above cases, if you want PRRTE to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the number of available slots when deciding the number of processes to launch. -- 在 2022/11/15 02:04, users-requ...@lists.open-mpi.org 写道: Send users mailing list submissions to users@lists.open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.open-mpi.org/mailman/listinfo/users or, via email, send a message with subject or body 'help' to users-requ...@lists.open-mpi.org You can reach the person managing the list at users-ow...@lists.open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Re: [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application (Jeff Squyres (jsquyres)) 2. Re: Tracing of openmpi internal functions (Jeff Squyres (jsquyres)) -- Message: 1 Date: Mon, 14 Nov 2022 17:04:24 + From: "Jeff Squyres (jsquyres)" To: Open MPI Users Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application Message-ID: Content-Type: text/plain; charset="utf-8" Yes, somehow I'm not seeing all the output that I expect to see. Can you ensure that if you're copy-and-pasting from the email, that it's actually using "dash dash" in front of "mca" and "machinefile" (vs. a copy-and-pasted "em dash")? -- Jeff Squyres jsquy...@cisco.com From: users on behalf of Gilles Gouaillardet via users Sent: Sunday, November 13, 2022 9:18 PM To: Open MPI Users Cc: Gilles Gouaillardet Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application There is a typo in your command line. You should use --mca (minus minus) instead of -mca Also, you can try --machinefile instead of -machinefile Cheers, Gilles There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: ?mca On Mon, Nov 14, 2022 at 11:04 AM timesir via users mailto:users@lists.open-mpi.org>> wrote: (py3.9) ? /share mpirun -n 2 -machinefile hosts ?mca rmaps_base_verbose 100 --mca ras_base_verbose 100 which mpirun [computer01:04570] mca: base: component_find: searching NULL for ras components [computer01:04570] mca: base: find_dyn_components: checking NULL for ras components [computer01:04570] pmix:mca: base: components_reg
Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application
*(py3.9) ➜ /share mpirun -n 2 -machinefile hosts –mca rmaps_base_verbose 100 --mca ras_base_verbose 100 which mpirun* [computer01:04570] mca: base: component_find: searching NULL for ras components [computer01:04570] mca: base: find_dyn_components: checking NULL for ras components [computer01:04570] pmix:mca: base: components_register: registering framework ras components [computer01:04570] pmix:mca: base: components_register: found loaded component simulator [computer01:04570] pmix:mca: base: components_register: component simulator register function successful [computer01:04570] pmix:mca: base: components_register: found loaded component pbs [computer01:04570] pmix:mca: base: components_register: component pbs register function successful [computer01:04570] pmix:mca: base: components_register: found loaded component slurm [computer01:04570] pmix:mca: base: components_register: component slurm register function successful [computer01:04570] mca: base: components_open: opening ras components [computer01:04570] mca: base: components_open: found loaded component simulator [computer01:04570] mca: base: components_open: found loaded component pbs [computer01:04570] mca: base: components_open: component pbs open function successful [computer01:04570] mca: base: components_open: found loaded component slurm [computer01:04570] mca: base: components_open: component slurm open function successful [computer01:04570] mca:base:select: Auto-selecting ras components [computer01:04570] mca:base:select:( ras) Querying component [simulator] [computer01:04570] mca:base:select:( ras) Querying component [pbs] [computer01:04570] mca:base:select:( ras) Querying component [slurm] [computer01:04570] mca:base:select:( ras) No component selected! == ALLOCATED NODES == [10/1444] computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN Flags: SLOTS_GIVEN aliases: NONE = == ALLOCATED NODES == computer01: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.180.48 hepslustretest03: slots=1 max_slots=0 slots_inuse=0 state=UP Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN aliases: 192.168.60.203,172.17.180.203,172.168.10.23,172.168.10.143 = -- There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: –mca Either request fewer procs for your application, or make more slots available for use. A "slot" is the PRRTE term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which PRRTE processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, PRRTE defaults to the number of processor cores In all the above cases, if you want PRRTE to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the number of available slots when deciding the number of processes to launch. -- 在 2022/11/13 23:42, Jeff Squyres (jsquyres) 写道: Interesting. It says: [computer01:106117] AVAILABLE NODES FOR MAPPING: [computer01:106117] node: computer01 daemon: 0 slots_available: 1 This is why it tells you you're out of slots: you're asking for 2, but it only found 1. This means it's not seeing your hostfile somehow. I should have asked you to run with *2* variables last time -- can you re-run with "mpirun --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 ..."? Turning on the RAS verbosity should show us what the hostfile component is doing. -- Jeff Squyres jsquy...@cisco.com *From:* 龙龙 *Sent:* Sunday, November 13, 2022 3:13 AM *To:* Jeff Squyres (jsquyres) ; Open MPI Users *Subject:* Re: [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application *(py3.9) ➜ /share mpirun –version* mpirun (Open MPI) 5.0.0rc9 Report bugs to https://www.open-mpi.org/community/help/