[OMPI users] mpi program gets stuck

2022-11-29 Thread timesir via users
see also: https://pastebin.com/s5tjaUkF

(py3.9) ➜  /share  cat hosts
192.168.180.48 slots=1
192.168.60.203 slots=1

1.  This command now runs correctly using your
openmpi-gitclone-pr11096.tar.bz2
(py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose
100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime


2. But this command gets stuck. It seems to be the mpi program that gets
stuck.
test.py:
import mpi4py
from mpi4py import MPI

(py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose
100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 python test.py
[computer01:47982] mca: base: component_find: searching NULL for plm
components
[computer01:47982] mca: base: find_dyn_components: checking NULL for plm
components
[computer01:47982] pmix:mca: base: components_register: registering
framework plm components
[computer01:47982] pmix:mca: base: components_register: found loaded
component slurm
[computer01:47982] pmix:mca: base: components_register: component slurm
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded
component ssh
[computer01:47982] pmix:mca: base: components_register: component ssh
register function successful
[computer01:47982] mca: base: components_open: opening plm components
[computer01:47982] mca: base: components_open: found loaded component slurm
[computer01:47982] mca: base: components_open: component slurm open
function successful
[computer01:47982] mca: base: components_open: found loaded component ssh
[computer01:47982] mca: base: components_open: component ssh open function
successful
[computer01:47982] mca:base:select: Auto-selecting plm components
[computer01:47982] mca:base:select:(  plm) Querying component [slurm]
[computer01:47982] mca:base:select:(  plm) Querying component [ssh]
[computer01:47982] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[computer01:47982] mca:base:select:(  plm) Query of component [ssh] set
priority to 10
[computer01:47982] mca:base:select:(  plm) Selected component [ssh]
[computer01:47982] mca: base: close: component slurm closed
[computer01:47982] mca: base: close: unloading component slurm
[computer01:47982] [prterun-computer01-47982@0,0] plm:ssh_setup on agent
ssh : rsh path NULL
[computer01:47982] [prterun-computer01-47982@0,0] plm:base:receive start
comm
[computer01:47982] mca: base: component_find: searching NULL for ras
components
[computer01:47982] mca: base: find_dyn_components: checking NULL for ras
components
[computer01:47982] pmix:mca: base: components_register: registering
framework ras components
[computer01:47982] pmix:mca: base: components_register: found loaded
component simulator
[computer01:47982] pmix:mca: base: components_register: component simulator
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded
component pbs
[computer01:47982] pmix:mca: base: components_register: component pbs
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded
component slurm
[computer01:47982] pmix:mca: base: components_register: component slurm
register function successful
[computer01:47982] mca: base: components_open: opening ras components
[computer01:47982] mca: base: components_open: found loaded component
simulator
[computer01:47982] mca: base: components_open: found loaded component pbs
[computer01:47982] mca: base: components_open: component pbs open function
successful
[computer01:47982] mca: base: components_open: found loaded component slurm
[computer01:47982] mca: base: components_open: component slurm open
function successful
[computer01:47982] mca:base:select: Auto-selecting ras components
[computer01:47982] mca:base:select:(  ras) Querying component [simulator]
[computer01:47982] mca:base:select:(  ras) Querying component [pbs]
[computer01:47982] mca:base:select:(  ras) Querying component [slurm]
[computer01:47982] mca:base:select:(  ras) No component selected!
[computer01:47982] mca: base: component_find: searching NULL for rmaps
components
[computer01:47982] mca: base: find_dyn_components: checking NULL for rmaps
components
[computer01:47982] pmix:mca: base: components_register: registering
framework rmaps components
[computer01:47982] pmix:mca: base: components_register: found loaded
component ppr
[computer01:47982] pmix:mca: base: components_register: component ppr
register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded
component rank_file
[computer01:47982] pmix:mca: base: components_register: component rank_file
has no register or open function
[computer01:47982] pmix:mca: base: components_register: found loaded
component round_robin
[computer01:47982] pmix:mca: base: components_register: component
round_robin register function successful
[computer01:47982] pmix:mca: base: components_register: found loaded
component seq
[computer01:47982] pmix:mca: base: components_register: component seq

Re: [OMPI users] users Digest, Vol 4818, Issue 1

2022-11-25 Thread timesir via users
: 192.168.180.48
    hepslustretest03: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 
192.168.60.203,hepslustretest03.ihep.ac.cn,172.17.180.203,172.168.10.23,172.168.10.143

=
[computer01:39342] mca:rmaps: mapping job prterun-computer01-39342@1
[computer01:39342] mca:rmaps: setting mapping policies for job 
prterun-computer01-39342@1 inherit TRUE hwtcpus FALSE

[computer01:39342] mca:rmaps[358] mapping not given - using bycore
[computer01:39342] setdefaultbinding[365] binding not given - using bycore
[computer01:39342] mca:rmaps:ppr: job prterun-computer01-39342@1 not 
using ppr mapper PPR NULL policy PPR NOTSET
[computer01:39342] mca:rmaps:seq: job prterun-computer01-39342@1 not 
using seq mapper

[computer01:39342] mca:rmaps:rr: mapping job prterun-computer01-39342@1
[computer01:39342] AVAILABLE NODES FOR MAPPING:
[computer01:39342] node: computer01 daemon: 0 slots_available: 1
[computer01:39342] mca:rmaps:rr: mapping by Core for job 
prterun-computer01-39342@1 slots 1 num_procs 2

--
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  which

Either request fewer procs for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
 processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
 hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
 RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to 
ignore the

number of available slots when deciding the number of processes to
launch.
--

在 2022/11/15 02:04, users-requ...@lists.open-mpi.org 写道:

Send users mailing list submissions to
users@lists.open-mpi.org

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.open-mpi.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
users-requ...@lists.open-mpi.org

You can reach the person managing the list at
users-ow...@lists.open-mpi.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of users digest..."


Today's Topics:

1. Re: [OMPI devel] There are not enough slots available in the
   system to satisfy the 2, slots that were requested by the
   application (Jeff Squyres (jsquyres))
2. Re: Tracing of openmpi internal functions
   (Jeff Squyres (jsquyres))


--

Message: 1
Date: Mon, 14 Nov 2022 17:04:24 +
From: "Jeff Squyres (jsquyres)"
To: Open MPI Users
Subject: Re: [OMPI users] [OMPI devel] There are not enough slots
available in the system to satisfy the 2, slots that were requested by
the application
Message-ID:



Content-Type: text/plain; charset="utf-8"

Yes, somehow I'm not seeing all the output that I expect to see.  Can you ensure that if you're copy-and-pasting from 
the email, that it's actually using "dash dash" in front of "mca" and "machinefile" (vs. 
a copy-and-pasted "em dash")?

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Gilles Gouaillardet via 
users
Sent: Sunday, November 13, 2022 9:18 PM
To: Open MPI Users
Cc: Gilles Gouaillardet
Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in 
the system to satisfy the 2, slots that were requested by the application

There is a typo in your command line.
You should use --mca (minus minus) instead of -mca

Also, you can try --machinefile instead of -machinefile

Cheers,

Gilles

There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

   ?mca

On Mon, Nov 14, 2022 at 11:04 AM timesir via users 
mailto:users@lists.open-mpi.org>> wrote:

(py3.9) ?  /share  mpirun -n 2 -machinefile hosts ?mca rmaps_base_verbose 100 
--mca ras_base_verbose 100  which mpirun
[computer01:04570] mca: base: component_find: searching NULL for ras components
[computer01:04570] mca: base: find_dyn_components: checking N

Re: [OMPI users] users Digest, Vol 4818, Issue 1

2022-11-25 Thread timesir via users
defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
 RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--
在 2022/11/15 02:04, users-requ...@lists.open-mpi.org 写道:

Send users mailing list submissions to
users@lists.open-mpi.org

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.open-mpi.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
users-requ...@lists.open-mpi.org

You can reach the person managing the list at
users-ow...@lists.open-mpi.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of users digest..."


Today's Topics:

   1. Re: [OMPI devel] There are not enough slots available in the
  system to satisfy the 2, slots that were requested by the
  application (Jeff Squyres (jsquyres))
   2. Re: Tracing of openmpi internal functions
  (Jeff Squyres (jsquyres))


--

Message: 1
Date: Mon, 14 Nov 2022 17:04:24 +
From: "Jeff Squyres (jsquyres)"  
To: Open MPI Users  
Subject: Re: [OMPI users] [OMPI devel] There are not enough slots
available in the system to satisfy the 2, slots that were requested by
the application
Message-ID:




Content-Type: text/plain; charset="utf-8"

Yes, somehow I'm not seeing all the output that I expect to see.  Can
you ensure that if you're copy-and-pasting from the email, that it's
actually using "dash dash" in front of "mca" and "machinefile" (vs. a
copy-and-pasted "em dash")?

--
Jeff squyresjsquy...@cisco.com

From: users 
 on behalf of Gilles Gouaillardet
via users  
Sent: Sunday, November 13, 2022 9:18 PM
To: Open MPI Users  
Cc: Gilles Gouaillardet 

Subject: Re: [OMPI users] [OMPI devel] There are not enough slots
available in the system to satisfy the 2, slots that were requested by
the application

There is a typo in your command line.
You should use --mca (minus minus) instead of -mca

Also, you can try --machinefile instead of -machinefile

Cheers,

Gilles

There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  ?mca

On Mon, Nov 14, 2022 at 11:04 AM timesir via users
mailto:users@lists.open-mpi.org>
> wrote:

(py3.9) ?  /share  mpirun -n 2 -machinefile hosts ?mca
rmaps_base_verbose 100 --mca ras_base_verbose 100  which mpirun
[computer01:04570] mca: base: component_find: searching NULL for ras components
[computer01:04570] mca: base: find_dyn_components: checking NULL for
ras components
[computer01:04570] pmix:mca: base: components_register: registering
framework ras components
[computer01:04570] pmix:mca: base: components_register: found loaded
component simulator
[computer01:04570] pmix:mca: base: components_register: component
simulator register function successful
[computer01:04570] pmix:mca: base: components_register: found loaded
component pbs
[computer01:04570] pmix:mca: base: components_register: component pbs
register function successful
[computer01:04570] pmix:mca: base: components_register: found loaded
component slurm
[computer01:04570] pmix:mca: base: components_register: component
slurm register function successful
[computer01:04570] mca: base: components_open: opening ras components
[computer01:04570] mca: base: components_open: found loaded component simulator
[computer01:04570] mca: base: components_open: found loaded component pbs
[computer01:04570] mca: base: components_open: component pbs open
function successful
[computer01:04570] mca: base: components_open: found loaded component slurm
[computer01:04570] mca: base: components_open: component slurm open
function successful
[computer01:04570] mca:base:select: Auto-selecting ras components
[computer01:04570] mca:base:select:(  ras) Querying component [simulator]
[computer01:04570] mca:base:select:(  ras) Querying component [pbs]
[computer01:04570] mca:base:select:(  ras) Querying component [slurm]
[computer01:04570] mca:base:select:(  ras) No component selected!

==   ALLOCATED NODES   ==

 [10/1444]
computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: 192.168.180.48
192.168.60.203<http://192.168.60.203> <http://192.168.60.203

Re: [OMPI users] users Digest, Vol 4818, Issue 1

2022-11-25 Thread timesir via users
Do you receive this email?


在 2022年11月23日星期三,timesir  写道:

>
> *1.  This command now runs correctly *
>
> *(py3.9) ➜  /share mpirun -n 2 --machinefile hosts --mca plm_base_verbose
> 100 --mca rmaps_base_verbose 100 --mca ras_base_verbose 100 uptime*
>
>
>
> *2. But this command gets stuck. It seems to be the mpi program that gets
> stuck. *
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *[computer01:47982] mca: base: component_find: searching NULL for plm
> components [computer01:47982] mca: base: find_dyn_components: checking NULL
> for plm components [computer01:47982] pmix:mca: base: components_register:
> registering framework plm components [computer01:47982] pmix:mca: base:
> components_register: found loaded component slurm [computer01:47982]
> pmix:mca: base: components_register: component slurm register function
> successful [computer01:47982] pmix:mca: base: components_register: found
> loaded component ssh [computer01:47982] pmix:mca: base:
> components_register: component ssh register function successful
> [computer01:47982] mca: base: components_open: opening plm components
> [computer01:47982] mca: base: components_open: found loaded component slurm
> [computer01:47982] mca: base: components_open: component slurm open
> function successful [computer01:47982] mca: base: components_open: found
> loaded component ssh [computer01:47982] mca: base: components_open:
> component ssh open function successful [computer01:47982] mca:base:select:
> Auto-selecting plm components [computer01:47982] mca:base:select:(  plm)
> Querying component [slurm] [computer01:47982] mca:base:select:(  plm)
> Querying component [ssh] [computer01:47982] [[INVALID],0] plm:ssh_lookup on
> agent ssh : rsh path NULL [computer01:47982] mca:base:select:(  plm) Query
> of component [ssh] set priority to 10 [computer01:47982] mca:base:select:(
> plm) Selected component [ssh] [computer01:47982] mca: base: close:
> component slurm closed [computer01:47982] mca: base: close: unloading
> component slurm [computer01:47982] [prterun-computer01-47982@0,0]
> plm:ssh_setup on agent ssh : rsh path NULL [computer01:47982]
> [prterun-computer01-47982@0,0] plm:base:receive start comm
> [computer01:47982] mca: base: component_find: searching NULL for ras
> components [computer01:47982] mca: base: find_dyn_components: checking NULL
> for ras components [computer01:47982] pmix:mca: base: components_register:
> registering framework ras components [computer01:47982] pmix:mca: base:
> components_register: found loaded component simulator [computer01:47982]
> pmix:mca: base: components_register: component simulator register function
> successful [computer01:47982] pmix:mca: base: components_register: found
> loaded component pbs [computer01:47982] pmix:mca: base:
> components_register: component pbs register function successful
> [computer01:47982] pmix:mca: base: components_register: found loaded
> component slurm [computer01:47982] pmix:mca: base: components_register:
> component slurm register function successful [computer01:47982] mca: base:
> components_open: opening ras components [computer01:47982] mca: base:
> components_open: found loaded component simulator [computer01:47982] mca:
> base: components_open: found loaded component pbs [computer01:47982] mca:
> base: components_open: component pbs open function successful
> [computer01:47982] mca: base: components_open: found loaded component slurm
> [computer01:47982] mca: base: components_open: component slurm open
> function successful [computer01:47982] mca:base:select: Auto-selecting ras
> components [computer01:47982] mca:base:select:(  ras) Querying component
> [simulator] [computer01:47982] mca:base:select:(  ras) Querying component
> [pbs] [computer01:47982] mca:base:select:(  ras) Querying component [slurm]
> [computer01:47982] mca:base:select:(  ras) No component selected!
> [computer01:47982] mca: base: component_find: searching NULL for rmaps
> components [computer01:47982] mca: base: find_dyn_components: checking NULL
> for rmaps components [computer01:47982] pmix:mca: base:
> components_register: registering framework rmaps components
> [computer01:47982] pmix:mca: base: components_register: found loaded
> component ppr [computer01:47982] pmix:mca: base: components_register:
> component ppr register function successful [computer01:47982] pmix:mca:
> base: components_register: found loaded component rank_file
> [computer01:47982] pmix:mca: base: components_register: component rank_file
> has no register or open function [computer01:47982] pmix:mca: base:
> components_register: found loaded component round_robin 

Re: [OMPI users] users Digest, Vol 4818, Issue 1

2022-11-25 Thread timesir via users
defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
 RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--
在 2022/11/15 02:04, users-requ...@lists.open-mpi.org 写道:

Send users mailing list submissions to
users@lists.open-mpi.org

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.open-mpi.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
users-requ...@lists.open-mpi.org

You can reach the person managing the list at
users-ow...@lists.open-mpi.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of users digest..."


Today's Topics:

   1. Re: [OMPI devel] There are not enough slots available in the
  system to satisfy the 2, slots that were requested by the
  application (Jeff Squyres (jsquyres))
   2. Re: Tracing of openmpi internal functions
  (Jeff Squyres (jsquyres))


--

Message: 1
Date: Mon, 14 Nov 2022 17:04:24 +
From: "Jeff Squyres (jsquyres)"  
To: Open MPI Users  
Subject: Re: [OMPI users] [OMPI devel] There are not enough slots
available in the system to satisfy the 2, slots that were requested by
the application
Message-ID:




Content-Type: text/plain; charset="utf-8"

Yes, somehow I'm not seeing all the output that I expect to see.  Can
you ensure that if you're copy-and-pasting from the email, that it's
actually using "dash dash" in front of "mca" and "machinefile" (vs. a
copy-and-pasted "em dash")?

--
Jeff squyresjsquy...@cisco.com

From: users 
 on behalf of Gilles Gouaillardet
via users  
Sent: Sunday, November 13, 2022 9:18 PM
To: Open MPI Users  
Cc: Gilles Gouaillardet 

Subject: Re: [OMPI users] [OMPI devel] There are not enough slots
available in the system to satisfy the 2, slots that were requested by
the application

There is a typo in your command line.
You should use --mca (minus minus) instead of -mca

Also, you can try --machinefile instead of -machinefile

Cheers,

Gilles

There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  ?mca

On Mon, Nov 14, 2022 at 11:04 AM timesir via users
mailto:users@lists.open-mpi.org>
> wrote:

(py3.9) ?  /share  mpirun -n 2 -machinefile hosts ?mca
rmaps_base_verbose 100 --mca ras_base_verbose 100  which mpirun
[computer01:04570] mca: base: component_find: searching NULL for ras components
[computer01:04570] mca: base: find_dyn_components: checking NULL for
ras components
[computer01:04570] pmix:mca: base: components_register: registering
framework ras components
[computer01:04570] pmix:mca: base: components_register: found loaded
component simulator
[computer01:04570] pmix:mca: base: components_register: component
simulator register function successful
[computer01:04570] pmix:mca: base: components_register: found loaded
component pbs
[computer01:04570] pmix:mca: base: components_register: component pbs
register function successful
[computer01:04570] pmix:mca: base: components_register: found loaded
component slurm
[computer01:04570] pmix:mca: base: components_register: component
slurm register function successful
[computer01:04570] mca: base: components_open: opening ras components
[computer01:04570] mca: base: components_open: found loaded component simulator
[computer01:04570] mca: base: components_open: found loaded component pbs
[computer01:04570] mca: base: components_open: component pbs open
function successful
[computer01:04570] mca: base: components_open: found loaded component slurm
[computer01:04570] mca: base: components_open: component slurm open
function successful
[computer01:04570] mca:base:select: Auto-selecting ras components
[computer01:04570] mca:base:select:(  ras) Querying component [simulator]
[computer01:04570] mca:base:select:(  ras) Querying component [pbs]
[computer01:04570] mca:base:select:(  ras) Querying component [slurm]
[computer01:04570] mca:base:select:(  ras) No component selected!

==   ALLOCATED NODES   ==

 [10/1444]
computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: 192.168.180.48
192.168.60.203<http://192.168.60.203> <http://192.168.60.203

Re: [OMPI users] users Digest, Vol 4818, Issue 1

2022-11-14 Thread timesir via users
 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 
192.168.60.203,hepslustretest03.ihep.ac.cn,172.17.180.203,172.168.10.23,172.168.10.143

=
[computer01:39342] mca:rmaps: mapping job prterun-computer01-39342@1
[computer01:39342] mca:rmaps: setting mapping policies for job 
prterun-computer01-39342@1 inherit TRUE hwtcpus FALSE

[computer01:39342] mca:rmaps[358] mapping not given - using bycore
[computer01:39342] setdefaultbinding[365] binding not given - using bycore
[computer01:39342] mca:rmaps:ppr: job prterun-computer01-39342@1 not 
using ppr mapper PPR NULL policy PPR NOTSET
[computer01:39342] mca:rmaps:seq: job prterun-computer01-39342@1 not 
using seq mapper

[computer01:39342] mca:rmaps:rr: mapping job prterun-computer01-39342@1
[computer01:39342] AVAILABLE NODES FOR MAPPING:
[computer01:39342] node: computer01 daemon: 0 slots_available: 1
[computer01:39342] mca:rmaps:rr: mapping by Core for job 
prterun-computer01-39342@1 slots 1 num_procs 2

--
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  which

Either request fewer procs for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
 processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
 hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
 RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--

在 2022/11/15 02:04, users-requ...@lists.open-mpi.org 写道:

Send users mailing list submissions to
users@lists.open-mpi.org

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.open-mpi.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
users-requ...@lists.open-mpi.org

You can reach the person managing the list at
users-ow...@lists.open-mpi.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of users digest..."


Today's Topics:

1. Re: [OMPI devel] There are not enough slots available in the
   system to satisfy the 2, slots that were requested by the
   application (Jeff Squyres (jsquyres))
2. Re: Tracing of openmpi internal functions
   (Jeff Squyres (jsquyres))


--

Message: 1
Date: Mon, 14 Nov 2022 17:04:24 +
From: "Jeff Squyres (jsquyres)"
To: Open MPI Users
Subject: Re: [OMPI users] [OMPI devel] There are not enough slots
available in the system to satisfy the 2, slots that were requested by
the application
Message-ID:



Content-Type: text/plain; charset="utf-8"

Yes, somehow I'm not seeing all the output that I expect to see.  Can you ensure that if you're copy-and-pasting from 
the email, that it's actually using "dash dash" in front of "mca" and "machinefile" (vs. 
a copy-and-pasted "em dash")?

--
Jeff Squyres
jsquy...@cisco.com

From: users  on behalf of Gilles Gouaillardet via 
users
Sent: Sunday, November 13, 2022 9:18 PM
To: Open MPI Users
Cc: Gilles Gouaillardet
Subject: Re: [OMPI users] [OMPI devel] There are not enough slots available in 
the system to satisfy the 2, slots that were requested by the application

There is a typo in your command line.
You should use --mca (minus minus) instead of -mca

Also, you can try --machinefile instead of -machinefile

Cheers,

Gilles

There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

   ?mca

On Mon, Nov 14, 2022 at 11:04 AM timesir via users 
mailto:users@lists.open-mpi.org>> wrote:

(py3.9) ?  /share  mpirun -n 2 -machinefile hosts ?mca rmaps_base_verbose 100 
--mca ras_base_verbose 100  which mpirun
[computer01:04570] mca: base: component_find: searching NULL for ras components
[computer01:04570] mca: base: find_dyn_components: checking NULL for ras 
components
[computer01:04570] pmix:mca: base: components_reg

Re: [OMPI users] [OMPI devel] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application

2022-11-13 Thread timesir via users
*(py3.9) ➜  /share  mpirun -n 2 -machinefile hosts –mca 
rmaps_base_verbose 100 --mca ras_base_verbose 100 which mpirun*
[computer01:04570] mca: base: component_find: searching NULL for ras 
components
[computer01:04570] mca: base: find_dyn_components: checking NULL for ras 
components
[computer01:04570] pmix:mca: base: components_register: registering 
framework ras components
[computer01:04570] pmix:mca: base: components_register: found loaded 
component simulator
[computer01:04570] pmix:mca: base: components_register: component 
simulator register function successful
[computer01:04570] pmix:mca: base: components_register: found loaded 
component pbs
[computer01:04570] pmix:mca: base: components_register: component pbs 
register function successful
[computer01:04570] pmix:mca: base: components_register: found loaded 
component slurm
[computer01:04570] pmix:mca: base: components_register: component slurm 
register function successful

[computer01:04570] mca: base: components_open: opening ras components
[computer01:04570] mca: base: components_open: found loaded component 
simulator

[computer01:04570] mca: base: components_open: found loaded component pbs
[computer01:04570] mca: base: components_open: component pbs open 
function successful

[computer01:04570] mca: base: components_open: found loaded component slurm
[computer01:04570] mca: base: components_open: component slurm open 
function successful

[computer01:04570] mca:base:select: Auto-selecting ras components
[computer01:04570] mca:base:select:(  ras) Querying component [simulator]
[computer01:04570] mca:base:select:(  ras) Querying component [pbs]
[computer01:04570] mca:base:select:(  ras) Querying component [slurm]
[computer01:04570] mca:base:select:(  ras) No component selected!

==   ALLOCATED NODES == [10/1444]
    computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 192.168.180.48
    192.168.60.203: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
    Flags: SLOTS_GIVEN
    aliases: NONE
=

==   ALLOCATED NODES   ==
    computer01: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 192.168.180.48
    hepslustretest03: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 192.168.60.203,172.17.180.203,172.168.10.23,172.168.10.143
=
--
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  –mca

Either request fewer procs for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
 processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
 hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
 RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--



在 2022/11/13 23:42, Jeff Squyres (jsquyres) 写道:

Interesting.  It says:

[computer01:106117] AVAILABLE NODES FOR MAPPING:
[computer01:106117] node: computer01 daemon: 0 slots_available: 1

This is why it tells you you're out of slots: you're asking for 2, but 
it only found 1.  This means it's not seeing your hostfile somehow.


I should have asked you to run with *2*​ variables last time -- can 
you re-run with "mpirun --mca rmaps_base_verbose 100 --mca 
ras_base_verbose 100 ..."?


Turning on the RAS verbosity should show us what the hostfile 
component is doing.


--
Jeff Squyres
jsquy...@cisco.com

*From:* 龙龙 
*Sent:* Sunday, November 13, 2022 3:13 AM
*To:* Jeff Squyres (jsquyres) ; Open MPI Users 

*Subject:* Re: [OMPI devel] There are not enough slots available in 
the system to satisfy the 2, slots that were requested by the application


*(py3.9) ➜ /share mpirun –version*

mpirun (Open MPI) 5.0.0rc9

Report bugs to https://www.open-mpi.org/community/help/