[OMPI users] There are not enough slots available in the system to satisfy the 2, slots that were requested by the application

2022-11-07 Thread mrlong via users

*Two machines, each with 64 cores. The contents of the hosts file are:*

192.168.180.48 slots=1
192.168.60.203 slots=1

*Why do you get the following error when running with openmpi 5.0.0rc9?*

(py3.9) [user@machine01 share]$  mpirun -n 2 --machinefile hosts hostname
--
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:

  hostname

Either request fewer procs for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
 processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
 hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
 RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.



[OMPI users] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread mrlong via users

The execution of openmpi 5.0.0rc9 results in the following:

(py3.9) [user@machine01 share]$  mpirun -n 2 python test.py
[LOG_CAT_ML] component basesmuma is not available but requested in 
hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p

[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in 
hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p

[LOG_CAT_ML] ml_discover_hierarchy exited with error

Why is this message printed?


Re: [OMPI users] [EXTERNAL] OFI, destroy_vni_context(1137).......: OFI domain close failed (ofi_init.c:1137:destroy_vni_context:Device or resource busy)

2022-11-01 Thread mrlong via users
Thanks, what you said seems to be right, I just checked and solved it. 
It might be caused by a conflict between openmpi and mpich library.


在 2022/11/2 02:06, Pritchard Jr., Howard 写道:


HI,

You are using MPICH or a vendor derivative of MPICH.  You probably 
want to resend this email to the mpich users/help mail list.


Howard

*From: *users  on behalf of mrlong 
via users 

*Reply-To: *Open MPI Users 
*Date: *Tuesday, November 1, 2022 at 11:26 AM
*To: *"de...@lists.open-mpi.org" , 
"users@lists.open-mpi.org" 

*Cc: *mrlong 
*Subject: *[EXTERNAL] [OMPI users] OFI, 
destroy_vni_context(1137)...: OFI domain close failed 
(ofi_init.c:1137:destroy_vni_context:Device or resource busy)


Hi, teachers

code:

import mpi4py
import time
import numpy as np
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
print("rank",rank)


if __name__ == '__main__':
    if rank == 0:
    mem = np.array([0], dtype='i')
    win = MPI.Win.Create(mem, comm=comm)
    else:
    win = MPI.Win.Create(None, comm=comm)
    print(rank, "end")

(py3.6.8) ➜  ~  mpirun -n 2 python -u test.py 
<https://urldefense.com/v3/__http:/test.py__;!!Bt8fGhp8LhKGRg!EpS4l-5_ADRkiOPiRrqKHV_deuvAYDui9_niJetq7MR6TwaQ5cLC_akDsMLZGdFmPOtiSFaby1mi2zqnczR1$>

rank 0
rank 1
0 end
1 end
Abort(806449679): Fatal error in internal_Finalize: Other MPI error, 
error stack:

internal_Finalize(50)...: MPI_Finalize failed
MPII_Finalize(345)..:
MPID_Finalize(511)..:
MPIDI_OFI_mpi_finalize_hook(895):
destroy_vni_context(1137)...: OFI domain close failed 
(ofi_init.c:1137:destroy_vni_context:Device or resource busy)


*Why is this happening? How to debug? This error is not reported on 
the other machine.*


[OMPI users] OFI, destroy_vni_context(1137).......: OFI domain close failed (ofi_init.c:1137:destroy_vni_context:Device or resource busy)

2022-11-01 Thread mrlong via users

Hi, teachers

code:

import mpi4py
import time
import numpy as np
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
print("rank",rank)


if __name__ == '__main__':
    if rank == 0:
    mem = np.array([0], dtype='i')
    win = MPI.Win.Create(mem, comm=comm)
    else:
    win = MPI.Win.Create(None, comm=comm)
    print(rank, "end")


(py3.6.8) ➜  ~  mpirun -n 2 python -u test.py
rank 0
rank 1
0 end
1 end
Abort(806449679): Fatal error in internal_Finalize: Other MPI error, 
error stack:

internal_Finalize(50)...: MPI_Finalize failed
MPII_Finalize(345)..:
MPID_Finalize(511)..:
MPIDI_OFI_mpi_finalize_hook(895):
destroy_vni_context(1137)...: OFI domain close failed 
(ofi_init.c:1137:destroy_vni_context:Device or resource busy)


*Why is this happening? How to debug? This error is not reported on the 
other machine.*


[OMPI users] --mca btl_base_verbose 30 not working in version 5.0

2022-10-30 Thread mrlong via users
mpirun --mca btl self,sm,tcp --mca btl_base_verbose 30 -np 2 
--machinefile hostfile  hostname


Why this sentence does not print IP addresses are routable in openmpi 
5.0.0.rc9?




[OMPI users] Open MPI 5.0.0rc8 failure but 4.1.4version work well

2022-10-30 Thread mrlong via users

Two machines.
A: 192.168.180.48
B: 192.168.60.203

The hostfile content is
192.168.60.203 slots=2

1. using openmpi 4.1.4, execute "mpirun -n 2 --machinefile hostfile 
hostname" on machine A. The hostname of B is printed correctly.

2. However, using openmpi 5.0.0rc8, the result on machine A is
$mpirun -n 2 --machinefile hostfile hostname
--
All nodes which are allocated for this job are already filled.
-- -- -- ---
Why is this happening?