Re: [OMPI users] [EXTERNAL] Confusions on building and running OpenMPI over Slingshot 10 on Cray EX HPC
Hi Jerry, Cray EX HPC with slingshot 10 (NOT 11!!!) is basically a Mellanox IB cluster using RoCE rather than IB. For this sort of interconnect, don’t use OFI, use UCX. Although UCX 1.12.0 is getting a bit old. I’d recommend 1.14.0 or newer, esp. if your system has nodes with GPUs. CXI is the name of the vendor libfabric provider and doesn’t function on this system – unless parts of the cluster are wired up with slingshot 11 (nics). For the node where you ran lspci this doesn’t seem to be the case. You’d see something like this if you had Slingshot 11: 27:00.0 Ethernet controller: Cray Inc Device 0501 (rev 02) a8:00.0 Ethernet controller: Cray Inc Device 0501 (rev 02) For your first question, you want to double check the final output from a configure run and make sure that the summary says UCX support is enabled. Please see https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/ib-and-roce.html for answers to some of your other questions below. Note there are some RoCE specific items in the doc page you may want to check. The PMIx slingshot config option is getting you confused. Just ignore it for this network. I’d suggest tweaking your configure options to the following: --enable-mpi-fortran \ --enable-shared \ --with-pic \ --with-ofi=no \ --with-ucx=/project/app/ucx/1.12.1 \ --with-pmix=internal \ --with-pbs \ --with-tm=/opt/pbs \ --with-singularity=/project/app/singularity/3.10.3 \ --with-lustre=/usr \ CC=icc \ FC=ifort \ CXX=icpc This will end up with a build of Open MPI that uses UCX – which is what you want. You are getting the error message from the btl framework because the OFI BTL can’t find a suitable/workable OFI provider. If you really need to build with OFI support, add –with-ofi, but set the following MCA parameters (here shown using env. Variables): export OMPI_MCA_pml=ucx export OMPI_MCA_osc=ucx export OMPI_MCA_btl=^ofi when running applications built using this Open MPI installation. Hope this helps, Howard From: users on behalf of Jianyu Liu via users Reply-To: Open MPI Users Date: Wednesday, May 8, 2024 at 7:41 PM To: "users@lists.open-mpi.org" Cc: Jianyu Liu Subject: [EXTERNAL] [OMPI users] Confusions on building and running OpenMPI over Slingshot 10 on Cray EX HPC Hi, I'm trying to build an OpenMPI 5.0.3 environment on the Cray EX HPC with Slingshot 10 support. General speaking, there were error messages while building OpenMPI, and make check also didn't report any failure. While tested OpenMPI Env. with a simple 'hello world' MPI Fortran codes, it threw out these error messages and caught signal 11 with libucs if specified '-mca btl ofi'. No components were able to be opened in the btl framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded Host: x3001c027b4n0 Framework: btl - Caught signal 11 ( Segmentation fault: address not mapped to object at address (nil)) /project/app/ucx/1.12.1/lib/libucs.so.0 (ucs_handle_error+0x134) This made me confused and not sure if got OpenMPI built with full Slingshot 10 support successfully and run over Slingshot 10 properly. Here are the building env. on Cray EX HPC with SLES 15 SP3 OpenMPI 5.0.3 + Intel 2022.0.2 + UCX 1.12.1 + libfabric 1.11.0.4.125-SSHOT2.0.0 + mlnx-ofed 5.5.1 Here are my configurations --enable-mpi-fortran \ --enable-shared \ --with-pic \ --with-ofi=/opt/cray/libfabric/1.11.0.4.125 \ --with-ofi-libdir=/opt/cray/libfabric/1.11.0.4.125/lib64 \ --with-ucx=/project/app/ucx/1.12.1 \ --with-pmix=internal \ --with-slingshot \ --with-pbs \ --with-tm=/opt/pbs \ --with-singularity=/project/app/singularity/3.10.3 \ --with-lustre=/usr \ CC=icc \ FC=ifort \ CXX=icpc Here are output of lspci on computing nodes 03:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 24:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) Here are what I'm confusing 1. After the configuration completed, the pmix summary didn't tell slingshot support is turned on for the transports 2. config.log didn't show any checking info. against slingshot while conducting mca checking, just showed --with-slingshot was passed as an argument. 3. Further looked into the configure script, the only script which will check Slingshot support is 3rd-party/openmix/src/mca/pnet/sshot/configure.m4, but looked like it's never called, as config.log didn't show any checking info. against appropriate dependencies, such as CXI, JANSSON, and I believed that CXI library was not installed on the machine. Here are my questions 1. How it could tell OpenMPI was built with full Slingshot
Re: [OMPI users] [EXTERNAL] Helping interpreting error output
Hi Jeffrey, I would suggest trying to debug what may be going wrong with UCX on your DGX box. There are several things to try from the UCX faq - https://openucx.readthedocs.io/en/master/faq.html I’d suggest setting the UCX_LOG_LEVEL environment variable to info or debug and see if UCX says something about what’s going wrong. Also add --mca plm_base_verbose 10 to the mpirun command line. Have you used DGX boxes with only a single NIC successfully? Howard From: users on behalf of Jeffrey Layton via users Reply-To: Open MPI Users Date: Tuesday, April 16, 2024 at 12:30 PM To: Open MPI Users Cc: Jeffrey Layton Subject: [EXTERNAL] [OMPI users] Helping interpreting error output Good afternoon MPI fans of all ages, Yet again, I'm getting an error that I'm having trouble interpreting. This time, I'm trying to run ior. I've done it a thousand times but not on an NVIDIA DGX A100 with multiple NICs. The ultimate command is the following: /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 4 -map-by ppr:4:node --allow-run-as-root --mca btl_openib_warn_default_gid_prefix 0 --mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 --mca plm rsh /home/bcm/bin/bin/ior -w -r -z -e -C -t 1m -b 1g -s 1000 -o /mnt/test It was suggested to me to use these MPI options. The error I get is the following. -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: dgx-02 Framework: pml Component: ucx -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): mca_pml_base_open() failed --> Returned "Not found" (-13) instead of "Success" (0) -- [dgx-02:2399932] *** An error occurred in MPI_Init [dgx-02:2399932] *** reported by process [2099773441,3] [dgx-02:2399932] *** on a NULL communicator [dgx-02:2399932] *** Unknown error [dgx-02:2399932] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [dgx-02:2399932] ***and potentially your MPI job) My first inclination was that it couldn't find ucx. So I loaded that module and re-ran it. I get the exact same error message. I'm still checking if the ucx module gets loaded when I run via Slurm, but mdtest ran without issue. But I'm checking that. Any thoughts? Thanks! Jeff
Re: [OMPI users] [EXTERNAL] Help deciphering error message
Hello Jeffrey, A couple of things to try first. Try running without UCX. Add –-mca pml ^ucx to the mpirun command line. If the app functions without ucx, then the next thing is to see what may be going wrong with UCX and the Open MPI components that use it. You may want to set the UCX_LOG_LEVEL environment variable to see if Open MPI’s UCX PML component is actually able to initialize UCX and start trying to use it. See https://openucx.readthedocs.io/en/master/faq.html for an example to do this using mpirun and the type of output you should be getting. Another simple thing to try is mpirun -np 1 ucx_info -v and see it you get something like this back on stdout: Library version: 1.14.0 # Library path: /usr/lib64/libucs.so.0 # API headers version: 1.14.0 # Git branch '', revision f8877c5 # Configured with: --build=aarch64-redhat-linux-gnu --host=aarch64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.7 Are you running the mpirun command on dgx-14? If that’s a different host a likely problem is that for some reason, the information in your ucx/1.10.1 is not getting picked up on dgx-14. One other thing, if the module UCX module name is indicating the version of UCX, its rather old. I’d suggest, if possible, updating to a newer version, like 1.14.1 or newer. There are many enhancements in more recent versions of UCX for GPU support and I would bet you’d want that for your DGX boxes. Howard From: users on behalf of Jeffrey Layton via users Reply-To: Open MPI Users Date: Thursday, March 7, 2024 at 11:53 AM To: Open MPI Users Cc: Jeffrey Layton Subject: [EXTERNAL] [OMPI users] Help deciphering error message Good afternoon, I'm getting an error message I'm not sure how to use to debug an issue. I'll try to give you all of the pertinent about the setup, but I didn't build the system nor install the software. It's an NVIDIA SuperPod system with Base Command Manager 10.0. I'm building IOR but I'm really interested in mdtest. "module list" says I'm using the following modules: gcc/64/4.1.5a1 ucx/1.10.1 openmpi4/gcc/4.1.5 There are no problems building the code. I'm using Slurm to run mdtest using a script. The output from the script and Slurm is the following (the command to run it is included). /cm/shared/apps/openmpi4/gcc/4.1.5/bin/mpirun --mca btl '^openib' -np 1 -map-by ppr:1:node --allow-run-as-root --mca btl_openib_warn_default_gid_prefix 0 --mca btl_openib_if_exclude mlx5_0,mlx5_5,mlx5_6 --mca plm_base_verbose 0 --mca plm rsh /home/bcm/bin/bin/mdtest -i 3 -I 4 -z 3 -b 8 -u -u -d /raid/bcm/mdtest -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: dgx-14 Framework: pml Component: ucx -- [dgx-14:4055623] [[42340,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501 [dgx-14:4055632] *** An error occurred in MPI_Init [dgx-14:4055632] *** reported by process [2774794241,0] [dgx-14:4055632] *** on a NULL communicator [dgx-14:4055632] *** Unknown error [dgx-14:4055632] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [dgx-14:4055632] ***and potentially your MPI job) Any pointers/help is greatly appreciated. Thanks! Jeff [Image removed by sender.]<https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$> Virus-free.www.avast.com<https://urldefense.com/v3/__https:/www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail__;!!Bt8fGhp8LhKGRg!GGFR_2AtIN0Dbylq3ttogLFBwT42S3e13_UYzR_YUkDVstH634RE2pbn7KvjLJdB87B1dsHEoE-U5XXEZ_IC$>
Re: [OMPI users] [EXTERNAL] Re: MPI_Init_thread error
HI Aziz, Oh I see you referenced the faq. That section of the faq is discussing how to make the Open MPI 4 series (and older) job launcher “know” about the batch scheduler you are using. The relevant section for launching with srun is covered by this faq - https://www-lb.open-mpi.org/faq/?category=slurm Howard From: "Pritchard Jr., Howard" Date: Tuesday, July 25, 2023 at 8:26 AM To: Open MPI Users Cc: Aziz Ogutlu Subject: Re: [EXTERNAL] Re: [OMPI users] MPI_Init_thread error HI Aziz, Did you include –with-pmi2 on your Open MPI configure line? Howard From: users on behalf of Aziz Ogutlu via users Organization: Eduline Bilisim Reply-To: Open MPI Users Date: Tuesday, July 25, 2023 at 8:18 AM To: Open MPI Users Cc: Aziz Ogutlu Subject: [EXTERNAL] Re: [OMPI users] MPI_Init_thread error Hi Gilles, Thank you for your response. When I run srun --mpi=list, I get only pmi2. When I run command with --mpi=pmi2 parameter, I got same error. OpenMPI automatically support slurm after 4.x version. https://www.open-mpi.org/faq/?category=building#build-rte<https://urldefense.com/v3/__https:/www.open-mpi.org/faq/?category=building*build-rte__;Iw!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl12eaQ2CN$> On 7/25/23 12:55, Gilles Gouaillardet via users wrote: Aziz, When using direct run (e.g. srun), OpenMPI has to interact with SLURM. This is typically achieved via PMI2 or PMIx You can srun --mpi=list to list the available options on your system if PMIx is available, you can srun --mpi=pmix ... if only PMI2 is available, you need to make sure Open MPI was built with SLURM support (e.g. configure --with-slurm ...) and then srun --mpi=pmi2 ... Cheers, Gilles On Tue, Jul 25, 2023 at 5:07 PM Aziz Ogutlu via users mailto:users@lists.open-mpi.org>> wrote: Hi there all, We're using Slurm 21.08 on Redhat 7.9 HPC cluster with OpenMPI 4.0.3 + gcc 8.5.0. When we run command below for call SU2, we get an error message: $ srun -p defq --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i $ module load su2/7.5.1 $ SU2_CFD config.cfg *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [cnode003.hpc:17534] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -- Best regards, Aziz Öğütlü Eduline Bilişim Sanayi ve Ticaret Ltd. Şti. www.eduline.com.tr<https://urldefense.com/v3/__http:/www.eduline.com.tr__;!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl1wUXbUrh$> Merkez Mah. Ayazma Cad. No:37 Papirus Plaza Kat:6 Ofis No:118 Kağıthane - İstanbul - Türkiye 34406 Tel : +90 212 324 60 61 Cep: +90 541 350 40 72 -- İyi çalışmalar, Aziz Öğütlü Eduline Bilişim Sanayi ve Ticaret Ltd. Şti. www.eduline.com.tr<https://urldefense.com/v3/__http:/www.eduline.com.tr__;!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl1wUXbUrh$> Merkez Mah. Ayazma Cad. No:37 Papirus Plaza Kat:6 Ofis No:118 Kağıthane - İstanbul - Türkiye 34406 Tel : +90 212 324 60 61 Cep: +90 541 350 40 72
Re: [OMPI users] [EXTERNAL] Re: MPI_Init_thread error
HI Aziz, Did you include –with-pmi2 on your Open MPI configure line? Howard From: users on behalf of Aziz Ogutlu via users Organization: Eduline Bilisim Reply-To: Open MPI Users Date: Tuesday, July 25, 2023 at 8:18 AM To: Open MPI Users Cc: Aziz Ogutlu Subject: [EXTERNAL] Re: [OMPI users] MPI_Init_thread error Hi Gilles, Thank you for your response. When I run srun --mpi=list, I get only pmi2. When I run command with --mpi=pmi2 parameter, I got same error. OpenMPI automatically support slurm after 4.x version. https://www.open-mpi.org/faq/?category=building#build-rte<https://urldefense.com/v3/__https:/www.open-mpi.org/faq/?category=building*build-rte__;Iw!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl12eaQ2CN$> On 7/25/23 12:55, Gilles Gouaillardet via users wrote: Aziz, When using direct run (e.g. srun), OpenMPI has to interact with SLURM. This is typically achieved via PMI2 or PMIx You can srun --mpi=list to list the available options on your system if PMIx is available, you can srun --mpi=pmix ... if only PMI2 is available, you need to make sure Open MPI was built with SLURM support (e.g. configure --with-slurm ...) and then srun --mpi=pmi2 ... Cheers, Gilles On Tue, Jul 25, 2023 at 5:07 PM Aziz Ogutlu via users mailto:users@lists.open-mpi.org>> wrote: Hi there all, We're using Slurm 21.08 on Redhat 7.9 HPC cluster with OpenMPI 4.0.3 + gcc 8.5.0. When we run command below for call SU2, we get an error message: $ srun -p defq --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i $ module load su2/7.5.1 $ SU2_CFD config.cfg *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [cnode003.hpc:17534] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -- Best regards, Aziz Öğütlü Eduline Bilişim Sanayi ve Ticaret Ltd. Şti. www.eduline.com.tr<https://urldefense.com/v3/__http:/www.eduline.com.tr__;!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl1wUXbUrh$> Merkez Mah. Ayazma Cad. No:37 Papirus Plaza Kat:6 Ofis No:118 Kağıthane - İstanbul - Türkiye 34406 Tel : +90 212 324 60 61 Cep: +90 541 350 40 72 -- İyi çalışmalar, Aziz Öğütlü Eduline Bilişim Sanayi ve Ticaret Ltd. Şti. www.eduline.com.tr<https://urldefense.com/v3/__http:/www.eduline.com.tr__;!!Bt8fGhp8LhKGRg!DNaoJu7zrmKRHUF76zzyFXi9n2Bq8K8Ud-yvTEIkUYtxz_1_2DFwrZAKofSbiBD1rhLyttDpQVrl1wUXbUrh$> Merkez Mah. Ayazma Cad. No:37 Papirus Plaza Kat:6 Ofis No:118 Kağıthane - İstanbul - Türkiye 34406 Tel : +90 212 324 60 61 Cep: +90 541 350 40 72
Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx
HI Arun, Interesting. For problem b) I would suggest one of two things - if you want to dig deeper yourself, and its possible on your system, I'd look at the output of dmesg -H -w on the node where the job is hitting this failure (you'll need to rerun the job) - ping the UCX group mail list (see https://elist.ornl.gov/mailman/listinfo/ucx-group . As for your more general question, I would suggest keeping it simple and letting the applications use large pages via the usual libhugetlbfs mechanism (LD_PRELOAD libhugetlbfs and set libhugetlbfs env variables for specifying what type of process memory to try and map to large pages).But I'm no expert in the ways UCX may be able to take advantage of internally allocated large pages nor the extent to which such use of large pages has led to demonstrable application speedups. Howard On 7/21/23, 8:37 AM, "Chandran, Arun" mailto:arun.chand...@amd.com>> wrote: Hi Howard, Thank you very much for the reply. Ucx is trying to setup the FIFO for shared memory communication using both sysv and posix. By default, these allocations are failing when tried with hugetlbfs a) Failure log from strace(Pasting only for rank0): [pid 3541286] shmget(IPC_PRIVATE, 6291456, IPC_CREAT|IPC_EXCL|SHM_HUGETLB|0660) = -1 EPERM (Operation not permitted) [pid 3541286] mmap(NULL, 6291456, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_HUGETLB, 29, 0) = -1 EINVAL (Invalid argument) b) I was able to overcome the failure for shmget allocation with hugetlbfs by adding my gid to "/proc/sys/vm/hugetlb_shm_group" [pid 3541465] shmget(IPC_PRIVATE, 6291456, IPC_CREAT|IPC_EXCL|SHM_HUGETLB|0660) = 2916410--> success [pid 3541465] mmap(NULL, 6291456, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_HUGETLB, 29, 0) = -1 EINVAL (Invalid argument) --> still fail But mmap with " MAP_SHARED|MAP_HUGETLB" is still failing. Any clues? I am aware of the advantages of huge pagetables, I am asking from the openmpi library perspective, Should I use it for openmpi internal buffers and data structures or leave it for the applications to use? What are the community recommendations in this regard? --Arun -Original Message- From: Pritchard Jr., Howard mailto:howa...@lanl.gov>> Sent: Thursday, July 20, 2023 9:36 PM To: Open MPI Users mailto:users@lists.open-mpi.org>>; Florent GERMAIN mailto:florent.germ...@eviden.com>> Cc: Chandran, Arun mailto:arun.chand...@amd.com>> Subject: Re: [EXTERNAL] Re: [OMPI users] How to use hugetlbfs with openmpi and ucx HI Arun, Its going to be chatty, but you may want to see if strace helps in diagnosing: mpirun -np 2 (all your favorite mpi args) strace -f send_recv 1000 1 huge pages often helps reduce pressure on a NIC's I/O MMU widget and speeds up resolving va to pa memory addresses. On 7/19/23, 9:24 PM, "users on behalf of Chandran, Arun via users" mailto:users-boun...@lists.open-mpi.org> <mailto:users-boun...@lists.open-mpi.org <mailto:users-boun...@lists.open-mpi.org>> on behalf of users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>> wrote: Good luck, Howard Hi, I am trying to use static huge pages, not transparent huge pages. Ucx is allowed to allocate via hugetlbfs. $ ./bin/ucx_info -c | grep -i huge UCX_SELF_ALLOC=huge,thp,md,mmap,heap UCX_TCP_ALLOC=huge,thp,md,mmap,heap UCX_SYSV_HUGETLB_MODE=try --->It is trying this and failing UCX_SYSV_FIFO_HUGETLB=n UCX_POSIX_HUGETLB_MODE=try---> it is trying this and failing UCX_POSIX_FIFO_HUGETLB=n UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap UCX_CMA_ALLOC=huge,thp,mmap,heap It is failing even though I have static hugepages available in my system. $ cat /proc/meminfo | grep HugePages_Total HugePages_Total: 20 THP is also enabled: $ cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never --Arun -Original Message- From: Florent GERMAIN mailto:florent.germ...@eviden.com> <mailto:florent.germ...@eviden.com <mailto:florent.germ...@eviden.com>>> Sent: Wednesday, July 19, 2023 7:51 PM To: Open MPI Users mailto:users@lists.open-mpi.org> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>>; Chandran, Arun mailto:arun.chand...@amd.com> <mailto:arun.chand...@amd.com <mailto:arun.chand...@amd.com>>> Subject: RE: How to use hugetlbfs with openmpi and ucx Hi, You can check if there are dedicated huge pages on your system or if transparent huge pages are allowed. Transparent huge pages on rhel systems : $cat /sys/kernel/mm/transparent_hugepage/enabled always [madvise] never -> this means that transparent huge pages are selected through mmap + -> madvise always = always try to aggregate pages on thp (for large -> enough allocation with good alignment) nev
Re: [OMPI users] [EXTERNAL] Re: How to use hugetlbfs with openmpi and ucx
HI Arun, Its going to be chatty, but you may want to see if strace helps in diagnosing: mpirun -np 2 (all your favorite mpi args) strace -f send_recv 1000 1 huge pages often helps reduce pressure on a NIC's I/O MMU widget and speeds up resolving va to pa memory addresses. On 7/19/23, 9:24 PM, "users on behalf of Chandran, Arun via users" mailto:users-boun...@lists.open-mpi.org> on behalf of users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: Good luck, Howard Hi, I am trying to use static huge pages, not transparent huge pages. Ucx is allowed to allocate via hugetlbfs. $ ./bin/ucx_info -c | grep -i huge UCX_SELF_ALLOC=huge,thp,md,mmap,heap UCX_TCP_ALLOC=huge,thp,md,mmap,heap UCX_SYSV_HUGETLB_MODE=try --->It is trying this and failing UCX_SYSV_FIFO_HUGETLB=n UCX_POSIX_HUGETLB_MODE=try---> it is trying this and failing UCX_POSIX_FIFO_HUGETLB=n UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap UCX_CMA_ALLOC=huge,thp,mmap,heap It is failing even though I have static hugepages available in my system. $ cat /proc/meminfo | grep HugePages_Total HugePages_Total: 20 THP is also enabled: $ cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never --Arun -Original Message- From: Florent GERMAIN mailto:florent.germ...@eviden.com>> Sent: Wednesday, July 19, 2023 7:51 PM To: Open MPI Users mailto:users@lists.open-mpi.org>>; Chandran, Arun mailto:arun.chand...@amd.com>> Subject: RE: How to use hugetlbfs with openmpi and ucx Hi, You can check if there are dedicated huge pages on your system or if transparent huge pages are allowed. Transparent huge pages on rhel systems : $cat /sys/kernel/mm/transparent_hugepage/enabled always [madvise] never -> this means that transparent huge pages are selected through mmap + -> madvise always = always try to aggregate pages on thp (for large -> enough allocation with good alignment) never = never try to aggregate -> pages on thp Dedicated huge pages on rhel systems : $ cat /proc/meminfo | grep HugePages_Total HugePages_Total: 0 -> no dedicated huge pages here It seems that ucx tries to use dedicated huge pages (mmap(addr=(nil), length=6291456, flags= HUGETLB, fd=29)). If there are no dedicated huge pages available, mmap fails. Huge pages can accelerate virtual address to physical address translation and reduce TLB consumption. It may be useful for large and frequently used buffers. Regards, Florent -Message d'origine- De : users mailto:users-boun...@lists.open-mpi.org>> De la part de Chandran, Arun via users Envoyé : mercredi 19 juillet 2023 15:44 À : users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> Cc : Chandran, Arun mailto:arun.chand...@amd.com>> Objet : [OMPI users] How to use hugetlbfs with openmpi and ucx Hi All, I am trying to see whether hugetlbfs is improving the latency of communication with a small send receive program. mpirun -np 2 --map-by core --bind-to core --mca pml ucx --mca opal_common_ucx_tls any --mca opal_common_ucx_devices any -mca pml_base_verbose 10 --mca mtl_base_verbose 10 -x OMPI_MCA_pml_ucx_verbose=10 -x UCX_LOG_LEVEL=debu g -x UCX_PROTO_INFO=y send_recv 1000 1 But the internal buffer allocation in ucx is unable to select the hugetlbfs. [1688297246.205092] [lib-ssp-04:4022755:0] ucp_context.c:1979 UCX DEBUG allocation method[2] is 'huge' [1688297246.208660] [lib-ssp-04:4022755:0] mm_sysv.c:97 UCX DEBUG mm failed to allocate 8447 bytes with hugetlb-> I checked the code, this is a valid failure as the size is small compared to huge page size of 2MB [1688297246.208704] [lib-ssp-04:4022755:0] mm_sysv.c:97 UCX DEBUG mm failed to allocate 4292720 bytes with hugetlb [1688297246.210048] [lib-ssp-04:4022755:0] mm_posix.c:332 UCX DEBUG shared memory mmap(addr=(nil), length=6291456, flags= HUGETLB, fd=29) failed: Invalid argument [1688297246.211451] [lib-ssp-04:4022754:0] ucp_context.c:1979 UCX DEBUG allocation method[2] is 'huge' [1688297246.214849] [lib-ssp-04:4022754:0] mm_sysv.c:97 UCX DEBUG mm failed to allocate 8447 bytes with hugetlb [1688297246.214888] [lib-ssp-04:4022754:0] mm_sysv.c:97 UCX DEBUG mm failed to allocate 4292720 bytes with hugetlb [1688297246.216235] [lib-ssp-04:4022754:0] mm_posix.c:332 UCX DEBUG shared memory mmap(addr=(nil), length=6291456, flags= HUGETLB, fd=29) failed: Invalid argument Can someone suggest what are the steps to be done to enable hugetlbfs [I cannot run my application as root] ? Is using hugetlbfs for the internal buffers is recommended? --Arun
Re: [OMPI users] [EXTERNAL] Requesting information about MPI_T events
Hi Kingshuk, Looks like the MPI_T Events feature is parked in this PR - https://github.com/open-mpi/ompi/pull/8057 - at the moment. Hoawrd From: users on behalf of Kingshuk Haldar via users Reply-To: Open MPI Users Date: Wednesday, March 15, 2023 at 4:00 AM To: OpenMPI-lists-users Cc: Kingshuk Haldar Subject: [EXTERNAL] [OMPI users] Requesting information about MPI_T events Hi all, is there any public branch of OpenMPI with which one can test the MPI_T Events interface? Alternatively, any information about its potential availability in next releases would be good to know. Best, -- Kingshuk Haldar email: kingshuk.hal...@hlrs.de
Re: [OMPI users] [EXTERNAL] OFI, destroy_vni_context(1137).......: OFI domain close failed (ofi_init.c:1137:destroy_vni_context:Device or resource busy)
HI, You are using MPICH or a vendor derivative of MPICH. You probably want to resend this email to the mpich users/help mail list. Howard From: users on behalf of mrlong via users Reply-To: Open MPI Users Date: Tuesday, November 1, 2022 at 11:26 AM To: "de...@lists.open-mpi.org" , "users@lists.open-mpi.org" Cc: mrlong Subject: [EXTERNAL] [OMPI users] OFI, destroy_vni_context(1137)...: OFI domain close failed (ofi_init.c:1137:destroy_vni_context:Device or resource busy) Hi, teachers code: import mpi4py import time import numpy as np from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() print("rank",rank) if __name__ == '__main__': if rank == 0: mem = np.array([0], dtype='i') win = MPI.Win.Create(mem, comm=comm) else: win = MPI.Win.Create(None, comm=comm) print(rank, "end") (py3.6.8) ➜ ~ mpirun -n 2 python -u test.py<https://urldefense.com/v3/__http:/test.py__;!!Bt8fGhp8LhKGRg!EpS4l-5_ADRkiOPiRrqKHV_deuvAYDui9_niJetq7MR6TwaQ5cLC_akDsMLZGdFmPOtiSFaby1mi2zqnczR1$> rank 0 rank 1 0 end 1 end Abort(806449679): Fatal error in internal_Finalize: Other MPI error, error stack: internal_Finalize(50)...: MPI_Finalize failed MPII_Finalize(345)..: MPID_Finalize(511)..: MPIDI_OFI_mpi_finalize_hook(895): destroy_vni_context(1137)...: OFI domain close failed (ofi_init.c:1137:destroy_vni_context:Device or resource busy) Why is this happening? How to debug? This error is not reported on the other machine.
Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI Installation - pmi.h Error
Hi Jeff, I think you are now in the “send the system admin an email to install RPMs, in particular ask that the numa and udev devel rpms be installed”. They will need to install these rpms on the compute node image(s) as well. Howard From: "Jeffrey D. (JD) Tamucci" Date: Wednesday, October 5, 2022 at 9:20 AM To: "Pritchard Jr., Howard" Cc: "bbarr...@amazon.com" , Open MPI Users Subject: Re: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI Installation - pmi.h Error Gladly, I tried it that way and it worked in that it was able to find pmi.h. Unfortunately there's a new error about finding lnuma and ludev. make[2]: Entering directory '/shared/maylab/src/openmpi-4.1.4/opal' CCLD libopen-pal.la<https://urldefense.com/v3/__http:/libopen-pal.la__;!!Bt8fGhp8LhKGRg!BWR7snajnpicZF4YgkUocF-Zm3n1tT0PSpwsOGfvHrB1qcFmYIq9xU56yhcjTEBv6oq1Z5meQDixEwQJWs4fc6wp3HEFfA$> /usr/bin/ld: cannot find -lnuma /usr/bin/ld: cannot find -ludev collect2: error: ld returned 1 exit status make[2]: *** [Makefile:2249: libopen-pal.la<https://urldefense.com/v3/__http:/libopen-pal.la__;!!Bt8fGhp8LhKGRg!BWR7snajnpicZF4YgkUocF-Zm3n1tT0PSpwsOGfvHrB1qcFmYIq9xU56yhcjTEBv6oq1Z5meQDixEwQJWs4fc6wp3HEFfA$>] Error 1 make[2]: Leaving directory '/shared/maylab/src/openmpi-4.1.4/opal' make[1]: *** [Makefile:2394: install-recursive] Error 1 make[1]: Leaving directory '/shared/maylab/src/openmpi-4.1.4/opal' make: *** [Makefile:1912: install-recursive] Error 1 Here is a dropbox link to the full output: https://www.dropbox.com/s/4rv8n2yp320ix08/ompi-output_Oct4_2022.tar.bz2?dl=0<https://urldefense.com/v3/__https:/www.dropbox.com/s/4rv8n2yp320ix08/ompi-output_Oct4_2022.tar.bz2?dl=0__;!!Bt8fGhp8LhKGRg!BWR7snajnpicZF4YgkUocF-Zm3n1tT0PSpwsOGfvHrB1qcFmYIq9xU56yhcjTEBv6oq1Z5meQDixEwQJWs4fc6y8gBZt9g$> Thank you for your help! JD Jeffrey D. (JD) Tamucci University of Connecticut Molecular & Cell Biology RA in Lab of Eric R. May PhD / MPH Candidate he/him On Tue, Oct 4, 2022 at 1:51 PM Pritchard Jr., Howard mailto:howa...@lanl.gov>> wrote: *Message sent from a system outside of UConn.* Could you change the –with-pmi to be --with-pmi=/cm/shared/apps/slurm21.08.8 ? From: "Jeffrey D. (JD) Tamucci" mailto:jeffrey.tamu...@uconn.edu>> Date: Tuesday, October 4, 2022 at 10:40 AM To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, "bbarr...@amazon.com<mailto:bbarr...@amazon.com>" mailto:bbarr...@amazon.com>> Cc: Open MPI Users mailto:users@lists.open-mpi.org>> Subject: Re: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI Installation - pmi.h Error Hi Howard and Brian, Of course. Here's a dropbox link to the full folder: https://www.dropbox.com/s/raqlcnpgk9wz78b/ompi-output_Sep30_2022.tar.bz2?dl=0<https://urldefense.com/v3/__https:/www.dropbox.com/s/raqlcnpgk9wz78b/ompi-output_Sep30_2022.tar.bz2?dl=0__;!!Bt8fGhp8LhKGRg!Gbf2ik51d_yyLNSd0MxiRpzUUleMIUbnc_K_GZiX3bNyn_5hxYeebIpaGygYEZebCOMxxbVZugqOTreswGqTKVLD8RFMow$> This was the configure and make commands: ./configure \ --prefix=/shared/maylab/mayapps/mpi/openmpi/4.1.4 \ --with-slurm \ --with-lsf=no \ --with-pmi=/cm/shared/apps/slurm/21.08.8/include/slurm \ --with-pmi-libdir=/cm/shared/apps/slurm/21.08.8/lib64 \ --with-hwloc=/cm/shared/apps/hwloc/1.11.11 \ --with-cuda=/gpfs/sharedfs1/admin/hpc2.0/apps/cuda/11.6 \ --enable-shared \ --enable-static && make -j 32 && make -j 32 check make install The output of the make command is in the install_open-mpi_4.1.4_hpc2.log file. Jeffrey D. (JD) Tamucci University of Connecticut Molecular & Cell Biology RA in Lab of Eric R. May PhD / MPH Candidate he/him On Tue, Oct 4, 2022 at 12:33 PM Pritchard Jr., Howard mailto:howa...@lanl.gov>> wrote: *Message sent from a system outside of UConn.* HI JD, Could you post the configure options your script uses to build Open MPI? Howard From: users mailto:users-boun...@lists.open-mpi.org>> on behalf of "Jeffrey D. (JD) Tamucci via users" mailto:users@lists.open-mpi.org>> Reply-To: Open MPI Users mailto:users@lists.open-mpi.org>> Date: Tuesday, October 4, 2022 at 10:07 AM To: "users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" mailto:users@lists.open-mpi.org>> Cc: "Jeffrey D. (JD) Tamucci" mailto:jeffrey.tamu...@uconn.edu>> Subject: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI Installation - pmi.h Error Hi, I have been trying to install OpenMPI v4.1.4 on a university HPC cluster. We use the Bright cluster manager and have SLURM v21.08.8 and RHEL 8.6. I used a script to install OpenMPI that a former co-worker had used to successfully install OpenMPI v3.0.0 previously. I updated it to include new versions of the dependencies and new paths to those installs. Each t
Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI Installation - pmi.h Error
Could you change the –with-pmi to be --with-pmi=/cm/shared/apps/slurm21.08.8 ? From: "Jeffrey D. (JD) Tamucci" Date: Tuesday, October 4, 2022 at 10:40 AM To: "Pritchard Jr., Howard" , "bbarr...@amazon.com" Cc: Open MPI Users Subject: Re: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI Installation - pmi.h Error Hi Howard and Brian, Of course. Here's a dropbox link to the full folder: https://www.dropbox.com/s/raqlcnpgk9wz78b/ompi-output_Sep30_2022.tar.bz2?dl=0<https://urldefense.com/v3/__https:/www.dropbox.com/s/raqlcnpgk9wz78b/ompi-output_Sep30_2022.tar.bz2?dl=0__;!!Bt8fGhp8LhKGRg!Gbf2ik51d_yyLNSd0MxiRpzUUleMIUbnc_K_GZiX3bNyn_5hxYeebIpaGygYEZebCOMxxbVZugqOTreswGqTKVLD8RFMow$> This was the configure and make commands: ./configure \ --prefix=/shared/maylab/mayapps/mpi/openmpi/4.1.4 \ --with-slurm \ --with-lsf=no \ --with-pmi=/cm/shared/apps/slurm/21.08.8/include/slurm \ --with-pmi-libdir=/cm/shared/apps/slurm/21.08.8/lib64 \ --with-hwloc=/cm/shared/apps/hwloc/1.11.11 \ --with-cuda=/gpfs/sharedfs1/admin/hpc2.0/apps/cuda/11.6 \ --enable-shared \ --enable-static && make -j 32 && make -j 32 check make install The output of the make command is in the install_open-mpi_4.1.4_hpc2.log file. Jeffrey D. (JD) Tamucci University of Connecticut Molecular & Cell Biology RA in Lab of Eric R. May PhD / MPH Candidate he/him On Tue, Oct 4, 2022 at 12:33 PM Pritchard Jr., Howard mailto:howa...@lanl.gov>> wrote: *Message sent from a system outside of UConn.* HI JD, Could you post the configure options your script uses to build Open MPI? Howard From: users mailto:users-boun...@lists.open-mpi.org>> on behalf of "Jeffrey D. (JD) Tamucci via users" mailto:users@lists.open-mpi.org>> Reply-To: Open MPI Users mailto:users@lists.open-mpi.org>> Date: Tuesday, October 4, 2022 at 10:07 AM To: "users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" mailto:users@lists.open-mpi.org>> Cc: "Jeffrey D. (JD) Tamucci" mailto:jeffrey.tamu...@uconn.edu>> Subject: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI Installation - pmi.h Error Hi, I have been trying to install OpenMPI v4.1.4 on a university HPC cluster. We use the Bright cluster manager and have SLURM v21.08.8 and RHEL 8.6. I used a script to install OpenMPI that a former co-worker had used to successfully install OpenMPI v3.0.0 previously. I updated it to include new versions of the dependencies and new paths to those installs. Each time, it fails in the make install step. There is a fatal error about finding pmi.h. It specifically says: make[2]: Entering directory '/shared/maylab/src/openmpi-4.1.4/opal/mca/pmix/s1' CC libmca_pmix_s1_la-pmix_s1_component.lo CC libmca_pmix_s1_la-pmix_s1.lo pmix_s1.c:29:10: fatal error: pmi.h: No such file or directory 29 | #include I've looked through the archives and seen others face similar errors in years past but I couldn't understand the solutions. One person suggested that SLURM may be missing PMI libraries. I think I've verified that SLURM has PMI. I include paths to those files and it seems to find them earlier in the process. I'm not sure what the next step is in troubleshooting this. I have included a bz2 file containing my install script, a log file containing the script output (from build, make, make install), the config.log, and the opal_config.h file. If anyone could provide any guidance, I'd sincerely appreciate it. Best, JD
Re: [OMPI users] [EXTERNAL] Beginner Troubleshooting OpenMPI Installation - pmi.h Error
HI JD, Could you post the configure options your script uses to build Open MPI? Howard From: users on behalf of "Jeffrey D. (JD) Tamucci via users" Reply-To: Open MPI Users Date: Tuesday, October 4, 2022 at 10:07 AM To: "users@lists.open-mpi.org" Cc: "Jeffrey D. (JD) Tamucci" Subject: [EXTERNAL] [OMPI users] Beginner Troubleshooting OpenMPI Installation - pmi.h Error Hi, I have been trying to install OpenMPI v4.1.4 on a university HPC cluster. We use the Bright cluster manager and have SLURM v21.08.8 and RHEL 8.6. I used a script to install OpenMPI that a former co-worker had used to successfully install OpenMPI v3.0.0 previously. I updated it to include new versions of the dependencies and new paths to those installs. Each time, it fails in the make install step. There is a fatal error about finding pmi.h. It specifically says: make[2]: Entering directory '/shared/maylab/src/openmpi-4.1.4/opal/mca/pmix/s1' CC libmca_pmix_s1_la-pmix_s1_component.lo CC libmca_pmix_s1_la-pmix_s1.lo pmix_s1.c:29:10: fatal error: pmi.h: No such file or directory 29 | #include I've looked through the archives and seen others face similar errors in years past but I couldn't understand the solutions. One person suggested that SLURM may be missing PMI libraries. I think I've verified that SLURM has PMI. I include paths to those files and it seems to find them earlier in the process. I'm not sure what the next step is in troubleshooting this. I have included a bz2 file containing my install script, a log file containing the script output (from build, make, make install), the config.log, and the opal_config.h file. If anyone could provide any guidance, I'd sincerely appreciate it. Best, JD
Re: [OMPI users] [EXTERNAL] Problem with Mellanox ConnectX3 (FDR) and openmpi 4
Hi Boyrie, The warning message is coming from the older ibverbs component of the Open MPI 4.0/4.1 releases. You can make this message using several ways. One at configure time is to add --disable-verbs to the configure options. At runtime you can set export OMPI_MCA_btl=^openib The ucx messages are just being chatty about which ucx transport type is being selected. The VASP hang may be something else. Howard From: users on behalf of Boyrie Fabrice via users Reply-To: Open MPI Users Date: Friday, August 19, 2022 at 9:51 AM To: "users@lists.open-mpi.org" Cc: Boyrie Fabrice Subject: [EXTERNAL] [OMPI users] Problem with Mellanox ConnectX3 (FDR) and openmpi 4 Hi I had to reinstall a cluster in AlmaLinux 8.6 I am unable to make openmpi 4 working with infiniband. I have the following message in a trivial pingpong test mpirun --hostfile hostfile -np 2 pingpong -- WARNING: There was an error initializing an OpenFabrics device. Local host: node2 Local device: mlx4_0 -- [node2:12431] common_ucx.c:107 using OPAL memory hooks as external events [node2:12431] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2 [node1:13188] common_ucx.c:174 using OPAL memory hooks as external events [node1:13188] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2 [node2:12431] pml_ucx.c:289 mca_pml_ucx_init [node1:13188] common_ucx.c:333 posix/memory: did not match transport list [node1:13188] common_ucx.c:333 sysv/memory: did not match transport list [node1:13188] common_ucx.c:333 self/memory0: did not match transport list [node1:13188] common_ucx.c:333 tcp/lo: did not match transport list [node1:13188] common_ucx.c:333 tcp/eno1: did not match transport list [node1:13188] common_ucx.c:333 tcp/ib0: did not match transport list [node1:13188] common_ucx.c:228 driver '../../../../bus/pci/drivers/mlx4_core' matched by 'mlx*' [node1:13188] common_ucx.c:324 rc_verbs/mlx4_0:1: matched both transport and device list [node1:13188] common_ucx.c:337 support level is transports and devices [node1:13188] pml_ucx.c:289 mca_pml_ucx_init [node2:12431] pml_ucx.c:114 Pack remote worker address, size 155 [node2:12431] pml_ucx.c:114 Pack local worker address, size 291 [node2:12431] pml_ucx.c:351 created ucp context 0xf832a0, worker 0x109fc50 [node1:13188] pml_ucx.c:114 Pack remote worker address, size 155 [node1:13188] pml_ucx.c:114 Pack local worker address, size 291 [node1:13188] pml_ucx.c:351 created ucp context 0x1696320, worker 0x16c9ce0 [node1:13188] pml_ucx_component.c:147 returning priority 51 [node2:12431] pml_ucx.c:182 Got proc 0 address, size 291 [node2:12431] pml_ucx.c:411 connecting to proc. 0 [node1:13188] pml_ucx.c:182 Got proc 1 address, size 291 [node1:13188] pml_ucx.c:411 connecting to proc. 1 length time/message (usec)transfer rate (Gbyte/sec) [node2:12431] pml_ucx.c:182 Got proc 1 address, size 155 [node2:12431] pml_ucx.c:411 connecting to proc. 1 [node1:13188] pml_ucx.c:182 Got proc 0 address, size 155 [node1:13188] pml_ucx.c:411 connecting to proc. 0 1 45.683729 0.88 1001 4.286029 0.934198 2001 5.755391 1.390696 3001 6.902443 1.739095 4001 8.485305 1.886084 5001 9.596994 2.084403 6001 11.055146 2.171297 7001 11.977093 2.338130 8001 13.324408 2.401908 9001 14.471116 2.487991 10001 15.806676 2.530829 [node2:12431] common_ucx.c:240 disconnecting from rank 0 [node2:12431] common_ucx.c:240 disconnecting from rank 1 [node2:12431] common_ucx.c:204 waiting for 1 disconnect requests [node2:12431] common_ucx.c:204 waiting for 0 disconnect requests [node1:13188] common_ucx.c:466 disconnecting from rank 0 [node1:13188] common_ucx.c:430 waiting for 1 disconnect requests [node1:13188] common_ucx.c:466 disconnecting from rank 1 [node1:13188] common_ucx.c:430 waiting for 0 disconnect requests [node2:12431] pml_ucx.c:367 mca_pml_ucx_cleanup [node1:13188] pml_ucx.c:367 mca_pml_ucx_cleanup [node2:12431] pml_ucx.c:268 mca_pml_ucx_close [node1:13188] pml_ucx.c:268 mca_pml_ucx_close cat hostfile node1 slots=1 node2 slots=1 And with a real program (Vasp) it stops. Infinband seems to be working. I can ssh over infiniband and qperf works in rdma mode qperf -t 10 ibnode1 ud_lat ud_bw ud_lat: latency = 18.2 us ud_bw: send_bw = 2.81 GB/sec recv_bw = 2.81 GB/sec I use the standard AlmaLinux module for infiniband 82:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] I can not install MLNX_OFED_LINUX-5.6-2.0.9.0-rhel8.6-x86_64 because it does not supports ConnectX-3 And I can not install MLNX_OFED_LINUX-4.9-5.1.0.0-rhel8.6-x86_64 because the module compilation fail
Re: [OMPI users] [EXTERNAL] Java Segentation Fault
HI Janek, A few questions. First which version of Open MPI are you using? Did you compile your code with the Open MPI mpijavac wrapper? Howard From: users on behalf of "Laudan, Janek via users" Reply-To: "Laudan, Janek" , Open MPI Users Date: Thursday, March 17, 2022 at 9:52 AM To: "users@lists.open-mpi.org" Cc: "Laudan, Janek" Subject: [EXTERNAL] [OMPI users] Java Segentation Fault Hi, I am trying to extend an existing Java-Project to be run with open-mpi. I have managed to successfully set up open-mpi and my project on my local machine to conduct some test runs. However, when I tried to set up things on our cluster I ran into some problems. I was able to run some trivial examples such as "HelloWorld" and "Ring" which I found on in the ompi-Github-repo. Unfortunately, when I try to run our app wrapped between MPI.Init(args) and MPI.Finalize() I get the following segmentation fault: $ mpirun -np 1 java -cp matsim-p-1.0-SNAPSHOT.jar org.matsim.parallel.RunMinimalMPIExample Java-Version: 11.0.2 before getTestScenario before load config WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. [cluster-i:1272 :0:1274] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc) backtrace (tid: 1274) = # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x14a85752fdf4, pid=1272, tid=1274 # # JRE version: Java(TM) SE Runtime Environment (11.0.2+9) (build 11.0.2+9-LTS) # Java VM: Java HotSpot(TM) 64-Bit Server VM (11.0.2+9-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64) # Problematic frame: # J 612 c2 java.lang.StringBuilder.append(Ljava/lang/String;)Ljava/lang/StringBuilder; java.base@11.0.2 (8 bytes) @ 0x14a85752fdf4 [0x14a85752fdc0+0x0034] # # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /net/ils/laudan/mpi-test/matsim-p/hs_err_pid1272.log Compiled method (c2)1052 612 4 java.lang.StringBuilder::append (8 bytes) total in heap [0x14a85752fc10,0x14a8575306a8] = 2712 relocation [0x14a85752fd88,0x14a85752fdb8] = 48 main code [0x14a85752fdc0,0x14a857530360] = 1440 stub code [0x14a857530360,0x14a857530378] = 24 metadata [0x14a857530378,0x14a8575303c0] = 72 scopes data[0x14a8575303c0,0x14a857530578] = 440 scopes pcs [0x14a857530578,0x14a857530658] = 224 dependencies [0x14a857530658,0x14a857530660] = 8 handler table [0x14a857530660,0x14a857530678] = 24 nul chk table [0x14a857530678,0x14a8575306a8] = 48 Compiled method (c1)1053 263 3 java.lang.StringBuilder:: (7 bytes) total in heap [0x14a850102790,0x14a850102b30] = 928 relocation [0x14a850102908,0x14a850102940] = 56 main code [0x14a850102940,0x14a850102a20] = 224 stub code [0x14a850102a20,0x14a850102ac8] = 168 metadata [0x14a850102ac8,0x14a850102ad0] = 8 scopes data[0x14a850102ad0,0x14a850102ae8] = 24 scopes pcs [0x14a850102ae8,0x14a850102b28] = 64 dependencies [0x14a850102b28,0x14a850102b30] = 8 Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # [cluster-i:01272] *** Process received signal *** [cluster-i:01272] Signal: Aborted (6) [cluster-i:01272] Signal code: (-6) [cluster-i:01272] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x14a86e477630] [cluster-i:01272] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x14a86dcbb387] [cluster-i:01272] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x14a86dcbca78] [cluster-i:01272] [ 3] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xc00be9)[0x14a86d3f8be9] [cluster-i:01272] [ 4] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29619)[0x14a86d621619] [cluster-i:01272] [ 5] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29e9b)[0x14a86d621e9b] [cluster-i:01272] [ 6] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xe29ece)[0x14a86d621ece] [cluster-i:01272] [ 7] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(JVM_handle_linux_signal+0x1c0)[0x14a86d403a00] [cluster-i:01272] [ 8] /afs/math.tu-berlin.de/software/java/jdk-11.0.2/lib/server/libjvm.so(+0xbff5e8)[0x14a86d3f75e8] [cluster-i:01272] [ 9] /usr/lib64/libpthread.so.0(+0xf630)[0x14a86e477630] [cluster-i:01272] [10] [0x14a85752fdf4] [cluster-i:01272] *** End of error message ***
Re: [OMPI users] [EXTERNAL] OpenMPI, Slurm and MPI_Comm_spawn
Hi Kurt, This documentation is rather slurm-centric. If you build Open MPI 4.1.x series the default way, it will build its internal pmix package and use that when launching your app using mpirun. In that case, you can use MPI_comm_spawn within a slurm allocation as long as there are sufficient slots in the allocation to hold both the spawner processes and the spawnee processes. Note the slurm pmix implementation doesn’t support spawn – at least currently – so the documentation is accurate if you are building Open MPI against the SLURM PMix library. In any case, you can’t use MPI_Comm_spawn if you use srun to launch the application. Hope this helps, Howard From: users on behalf of "Mccall, Kurt E. (MSFC-EV41) via users" Reply-To: Open MPI Users Date: Tuesday, March 8, 2022 at 7:49 AM To: "OpenMpi User List (users@lists.open-mpi.org)" Cc: "Mccall, Kurt E. (MSFC-EV41)" Subject: [EXTERNAL] [OMPI users] OpenMPI, Slurm and MPI_Comm_spawn The Slurm MPI User’s Guide at https://slurm.schedmd.com/mpi_guide.html#open_mpi has a note that states: NOTE: OpenMPI has a limitation that does not support calls to MPI_Comm_spawn() from within a Slurm allocation. If you need to use the MPI_Comm_spawn() function you will need to use another MPI implementation combined with PMI-2 since PMIx doesn't support it either. Is this still true in OpenMPI 4.1? Thanks, Kurt
Re: [OMPI users] [EXTERNAL] openib BTL disabled when using MPI_Init_thread
HI Jose, I bet this device has not been tested with ucx. You may want to join the ucx users mail list at https://elist.ornl.gov/mailman/listinfo/ucx-group and ask whether this Marvell device has been tested and workarounds for disabling features that this device doesn't support. Again though, you really may want to first see if the TCP btl will be good enough for your cluster. Howard On 2/4/22, 8:03 AM, "Jose E. Roman" wrote: Howard, I don't have much time now to try with --enable-debug. The RoCE device we have is FastLinQ QL41000 Series 10/25/40/50GbE Controller The output of ibv_devinfo is: hca_id: qedr0 transport: InfiniBand (0) fw_ver: 8.20.0.0 node_guid: 2267:7cff:fe11:4a50 sys_image_guid: 2267:7cff:fe11:4a50 vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: qedr1 transport: InfiniBand (0) fw_ver: 8.20.0.0 node_guid: 2267:7cff:fe11:4a51 sys_image_guid: 2267:7cff:fe11:4a51 vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu:4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet Regarding UCX, we have tried with the latest version. Compilation goes through, but the ucv_info command gives an error: # Memory domain: qedr0 # Component: ib # register: unlimited, cost: 180 nsec # remote key: 8 bytes # local memory handle is required for zcopy # # Transport: rc_verbs # Device: qedr0:1 # Type: network # System device: qedr0 (0) [1643982133.674556] [kahan01:8217 :0]rc_iface.c:505 UCX ERROR ibv_create_srq() failed: Function not implemented # < failed to open interface > # # Transport: ud_verbs # Device: qedr0:1 # Type: network # System device: qedr0 (0) [qelr_create_qp:545]create qp: failed on ibv_cmd_create_qp with 22 [1643982133.681169] [kahan01:8217 :0]ib_iface.c:994 UCX ERROR iface=0x56074944bf10: failed to create UD QP TX wr:256 sge:6 inl:64 resp:0 RX wr:4096 sge:1 resp:0: Invalid argument # < failed to open interface > # # Memory domain: qedr1 # Component: ib # register: unlimited, cost: 180 nsec # remote key: 8 bytes # local memory handle is required for zcopy # < no supported devices found > Any idea what the error in ibv_create_srq() means? Thanks for your help. Jose > El 3 feb 2022, a las 17:52, Pritchard Jr., Howard escribió: > > Hi Jose, > > A number of things. > > First for recent versions of Open MPI including the 4.1.x release stream, MPI_THREAD_MULTIPLE is supported by default. However, some transport options available when using MPI_Init may not be available when requesting MPI_THREAD_MULTIPLE. > You may want to let Open MPI trundle along with tcp used for inter-node messaging and see if your application performs well enough. For a small system tcp may well suffice. > > Second, if you want to pursue this further, you want to rebuild Open MPI with --enable-debug. The debug output will be considerably more verbose and provides more info. I think you will get a message saying rdmacm CPC is excluded owing to the requested thread support level. There may be info about why udcm is not selected as well. > > Third, what sort of RoCE devices are available on your system? The output from ibv_devinfo may be useful. > > As for UCX, if it’s the version that came with your ubuntu release 18.0.4 it may be pretty old. It's likely that UCX has not been tested on the RoCE devi
Re: [OMPI users] [EXTERNAL] openib BTL disabled when using MPI_Init_thread
Hi Jose, A number of things. First for recent versions of Open MPI including the 4.1.x release stream, MPI_THREAD_MULTIPLE is supported by default. However, some transport options available when using MPI_Init may not be available when requesting MPI_THREAD_MULTIPLE. You may want to let Open MPI trundle along with tcp used for inter-node messaging and see if your application performs well enough. For a small system tcp may well suffice. Second, if you want to pursue this further, you want to rebuild Open MPI with --enable-debug. The debug output will be considerably more verbose and provides more info. I think you will get a message saying rdmacm CPC is excluded owing to the requested thread support level. There may be info about why udcm is not selected as well. Third, what sort of RoCE devices are available on your system? The output from ibv_devinfo may be useful. As for UCX, if it’s the version that came with your ubuntu release 18.0.4 it may be pretty old. It's likely that UCX has not been tested on the RoCE devices on your system. You can run ucx_info -v to check the version number of UCX that you are picking up. You can download the latest release of UCX at https://github.com/openucx/ucx/releases/tag/v1.12.0 Instructions for how to build are in the README.md at https://github.com/openucx/ucx. You will want to configure with contrib/configure-release-mt --enable-gtest You want to add the --enable-gtest to the configure options so that you can run the ucx sanity checks. Note this takes quite a while to run but is pretty thorough at validating your UCX build. You'll want to run this test on one of the nodes with a RoCE device - ucx_info -d This will show which UCX transports/devices are available. See the Running internal unit tests section of the README.md Hope this helps, Howard On 2/3/22, 8:46 AM, "Jose E. Roman" wrote: Thanks. The verbose output is: [kahan01.upvnet.upv.es:29732] mca: base: components_register: registering framework btl components [kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded component self [kahan01.upvnet.upv.es:29732] mca: base: components_register: component self register function successful [kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded component sm [kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded component openib [kahan01.upvnet.upv.es:29732] mca: base: components_register: component openib register function successful [kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded component vader [kahan01.upvnet.upv.es:29732] mca: base: components_register: component vader register function successful [kahan01.upvnet.upv.es:29732] mca: base: components_register: found loaded component tcp [kahan01.upvnet.upv.es:29732] mca: base: components_register: component tcp register function successful [kahan01.upvnet.upv.es:29732] mca: base: components_open: opening btl components [kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded component self [kahan01.upvnet.upv.es:29732] mca: base: components_open: component self open function successful [kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded component openib [kahan01.upvnet.upv.es:29732] mca: base: components_open: component openib open function successful [kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded component vader [kahan01.upvnet.upv.es:29732] mca: base: components_open: component vader open function successful [kahan01.upvnet.upv.es:29732] mca: base: components_open: found loaded component tcp [kahan01.upvnet.upv.es:29732] mca: base: components_open: component tcp open function successful [kahan01.upvnet.upv.es:29732] select: initializing btl component self [kahan01.upvnet.upv.es:29732] select: init of component self returned success [kahan01.upvnet.upv.es:29732] select: initializing btl component openib [kahan01.upvnet.upv.es:29732] Checking distance from this process to device=qedr0 [kahan01.upvnet.upv.es:29732] hwloc_distances->nbobjs=4 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[0]=10 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[1]=16 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[2]=16 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[3]=16 [kahan01.upvnet.upv.es:29732] ibv_obj->type set to NULL [kahan01.upvnet.upv.es:29732] Process is bound: distance to device is 0.00 [kahan01.upvnet.upv.es:29732] Checking distance from this process to device=qedr1 [kahan01.upvnet.upv.es:29732] hwloc_distances->nbobjs=4 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[0]=10 [kahan01.upvnet.upv.es:29732] hwloc_distances->values[1]=16 [kahan01.upvnet.upv.es:29732] hwloc_distances->value
Re: [OMPI users] [EXTERNAL] openib BTL disabled when using MPI_Init_thread
Hello Jose, I suspect the issue here is that the OpenIB BTl isn't finding a connection module when you are requesting MPI_THREAD_MULTIPLE. The rdmacm connection is deselected if MPI_THREAD_MULTIPLE thread support level is being requested. If you run the test in a shell with export OMPI_MCA_btl_base_verbose=100 there may be some more info to help diagnose what's going on. Another option would be to build Open MPI with UCX support. That's the better way to use Open MPI over IB/RoCE. Howard On 2/2/22, 10:52 AM, "users on behalf of Jose E. Roman via users" wrote: Hi. I am using Open MPI 4.1.1 with the openib BTL on a 4-node cluster with Ethernet 10/25Gb (RoCE). It is using libibverbs from Ubuntu 18.04 (kernel 4.15.0-166-generic). With this hello world example: #include #include int main (int argc, char *argv[]) { int rank, size, provided; MPI_Init_thread(, , MPI_THREAD_FUNNELED, ); MPI_Comm_rank(MPI_COMM_WORLD, ); MPI_Comm_size(MPI_COMM_WORLD, ); printf("Hello world from process %d of %d, provided=%d\n", rank, size, provided); MPI_Finalize(); return 0; } I get the following output when run on one node: $ ./hellow -- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: kahan01 Local device: qedr0 Local port: 1 CPCs attempted: rdmacm, udcm -- Hello world from process 0 of 1, provided=1 The message does not appear if I run on the front-end (does not have RoCE network) or if I run it on the node either using MPI_Init() instead of MPI_Init_thread() or using MPI_THREAD_SINGLE instead of MPI_THREAD_FUNNELED. Is there any reason why MPI_Init_thread() is behaving differently to MPI_Init()? Note that I am not using threads, and just one MPI process. The question has a second part: is there a way to determine (without running an MPI program) that MPI_Init_thread() won't work but MPI_Init() will work? I am asking this because PETSc programs default to use MPI_Init_thread() when PETSc's configure script finds the MPI_Init_thread() symbol in the MPI library. But in situations like the one reported here, it would be better to revert to MPI_Init() since MPI_Init_thread() will not work as expected. [The configure script cannot run an MPI program due to batch systems.] Thanks for your help. Jose
[OMPI users] Open MPI v4.0.7rc2 available for testing
A second release candidate for Open MPI v4.0.7 is now available for testing: https://www.open-mpi.org/software/ompi/v4.0/ New fixes with this release candidate: - Fix an issue with MPI_IALLREDUCE_SCATTER when using large count arguments. - Fixed an issue with POST/START/COMPLETE/WAIT when using subsets of processes. Thanks to Thomas Gilles for reporting. Your Open MPI release team. — [signature_61897647] Howard Pritchard Research Scientist HPC-ENV Los Alamos National Laboratory howa...@lanl.gov [signature_1468325140]<https://www.instagram.com/losalamosnatlab/>[signature_524373090]<https://twitter.com/LosAlamosNatLab>[signature_1595424545]<https://www.linkedin.com/company/los-alamos-national-laboratory/>[signature_371999348]<https://www.facebook.com/LosAlamosNationalLab/>
[OMPI users] Open MPI v4.0.7rc1 available for testing
The first release candidate for Open MPI v4.0.7 is now available for testing: https://www.open-mpi.org/software/ompi/v4.0/ Some fixes include: - Numerous fixes from vendor partners. - Fix a problem with a couple of MPI_IALLREDUCE algorithms. Thanks to John Donners for reporting. - Fix an edge case where MPI_Reduce is invoked with zero count and NULL source and destination buffers. - Use the mfence instruction in opal_atomic_rmb on x86_64 cpus. Thanks to George Katevenis for proposing a fix. - Fix an issue with the Open MPI build system using the SLURM provided PMIx when not requested by the user. Thanks to Alexander Grund for reporting. - Fix a problem compiling Open MPI with clang on case-insensitive file systems. Thanks to @srpgilles for reporting. - Fix some OFI usNIC/OFI MTL interaction problems. Thanks to @roguephysicist for reporting this issue. - Fix a problem with the Posix fbtl component failing to load. Thanks to Honggang Li for reporting. Your Open MPI release team. — [signature_61897647] Howard Pritchard Research Scientist HPC-ENV Los Alamos National Laboratory howa...@lanl.gov [signature_629633375]<https://www.instagram.com/losalamosnatlab/>[signature_843372916]<https://twitter.com/LosAlamosNatLab>[signature_178570432]<https://www.linkedin.com/company/los-alamos-national-laboratory/>[signature_1871057199]<https://www.facebook.com/LosAlamosNationalLab/>
Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"
HI Greg, I believe so concerning your TCP question. I think the patch probably isn’t actually being used otherwise you would have noticed the curious print statement. Sorry about that. I’m out of ideas on what may be happening. Howard From: "Fischer, Greg A." Date: Friday, October 15, 2021 at 9:17 AM To: "Pritchard Jr., Howard" , Open MPI Users Cc: "Fischer, Greg A." Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" I tried the patch, but I get the same result: error obtaining device attributes for mlx4_0 errno says Success I’m getting (what I think are) good transfer rates using “--mca btl self,tcp” on the osu_bw test (~7000 MB/s). It seems to me that the only way that could be happening is if the infiniband interfaces are being used over TCP, correct? Would such an arrangement preclude the ability to do RDMA or openib? Perhaps the network is setup in such a way that the IB hardware is not discoverable by openib? (I’m not a network admin, and I wasn’t involved in the setup of the network. Unfortunately, the person who knows the most has recently left the organization.) Greg From: Pritchard Jr., Howard Sent: Thursday, October 14, 2021 5:45 PM To: Fischer, Greg A. ; Open MPI Users Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] HI Greg, Oh yes that’s not good about rdmacm. Yes the OFED looks pretty old. Did you by any chance apply that patch? I generated that for a sysadmin here who was in the situation where they needed to maintain Open MPI 3.1.6 but had to also upgrade to some newer RHEL release, but the Open MPi wasn’t compiling after the RHEL upgrade. Howard From: "Fischer, Greg A." mailto:fisch...@westinghouse.com>> Date: Thursday, October 14, 2021 at 1:47 PM To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, Open MPI Users mailto:users@lists.open-mpi.org>> Cc: "Fischer, Greg A." mailto:fisch...@westinghouse.com>> Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" I added –enable-mt and re-installed UCX. Same result. (I didn’t re-compile OpenMPI.) A conspicuous warning I see in my UCX configure output is: checking for rdma_establish in -lrdmacm... no configure: WARNING: RDMACM requested but librdmacm is not found or does not provide rdma_establish() API The version of librdmacm we have comes from librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64, which seems to date from mid-2017. I wonder if that’s too old? Greg From: Pritchard Jr., Howard mailto:howa...@lanl.gov>> Sent: Thursday, October 14, 2021 3:31 PM To: Fischer, Greg A. mailto:fisch...@westinghouse.com>>; Open MPI Users mailto:users@lists.open-mpi.org>> Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] Hi Greg, I think the UCX PML may be discomfited by the lack of thread safety. Could you try using the contrib/configure-release-mt in your ucx folder? You want to add –enable-mt. That’s what stands out in your configure output from the one I usually get when building on a MLNX connectx5 cluster with MLNX_OFED_LINUX-4.5-1.0.1.0 Here’s the output from one of my UCX configs: configure: = configure: UCX build configuration: configure: Build prefix: /ucx_testing/ucx/test_install configure:Configuration dir: ${prefix}/etc/ucx configure: Preprocessor flags: -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src configure: C compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement configure: C++ compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch configure: Multi-thread: enabled configure: NUMA support: disabled configure:MPI tests: disabled configure: VFS support: no configure:Devel headers: no configure: io_demo CUDA support: no configure: Bindings: < > configure: UCS modules: < &
Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"
HI Greg, Oh yes that’s not good about rdmacm. Yes the OFED looks pretty old. Did you by any chance apply that patch? I generated that for a sysadmin here who was in the situation where they needed to maintain Open MPI 3.1.6 but had to also upgrade to some newer RHEL release, but the Open MPi wasn’t compiling after the RHEL upgrade. Howard From: "Fischer, Greg A." Date: Thursday, October 14, 2021 at 1:47 PM To: "Pritchard Jr., Howard" , Open MPI Users Cc: "Fischer, Greg A." Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" I added –enable-mt and re-installed UCX. Same result. (I didn’t re-compile OpenMPI.) A conspicuous warning I see in my UCX configure output is: checking for rdma_establish in -lrdmacm... no configure: WARNING: RDMACM requested but librdmacm is not found or does not provide rdma_establish() API The version of librdmacm we have comes from librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64, which seems to date from mid-2017. I wonder if that’s too old? Greg From: Pritchard Jr., Howard Sent: Thursday, October 14, 2021 3:31 PM To: Fischer, Greg A. ; Open MPI Users Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] Hi Greg, I think the UCX PML may be discomfited by the lack of thread safety. Could you try using the contrib/configure-release-mt in your ucx folder? You want to add –enable-mt. That’s what stands out in your configure output from the one I usually get when building on a MLNX connectx5 cluster with MLNX_OFED_LINUX-4.5-1.0.1.0 Here’s the output from one of my UCX configs: configure: = configure: UCX build configuration: configure: Build prefix: /ucx_testing/ucx/test_install configure:Configuration dir: ${prefix}/etc/ucx configure: Preprocessor flags: -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src configure: C compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement configure: C++ compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch configure: Multi-thread: enabled configure: NUMA support: disabled configure:MPI tests: disabled configure: VFS support: no configure:Devel headers: no configure: io_demo CUDA support: no configure: Bindings: < > configure: UCS modules: < > configure: UCT modules: < ib cma knem > configure: CUDA modules: < > configure: ROCM modules: < > configure: IB modules: < > configure: UCM modules: < > configure: Perf modules: < > configure: = Howard From: "Fischer, Greg A." mailto:fisch...@westinghouse.com>> Date: Thursday, October 14, 2021 at 12:46 PM To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, Open MPI Users mailto:users@lists.open-mpi.org>> Cc: "Fischer, Greg A." mailto:fisch...@westinghouse.com>> Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" Thanks, Howard. I downloaded a current version of UCX (1.11.2) and installed it with OpenMPI 4.1.1. When I try to specify the “-mca pml ucx” for a simple, 2-process benchmark problem, I get: -- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: bl1311 Framework: pml -- [bl1311:20168] PML ucx cannot be selected [bl1311:20169] PML ucx cannot be selected I’ve attached my ucx_info -d output, as well as the ucx configuration information. I’m n
Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"
Hi Greg, I think the UCX PML may be discomfited by the lack of thread safety. Could you try using the contrib/configure-release-mt in your ucx folder? You want to add –enable-mt. That’s what stands out in your configure output from the one I usually get when building on a MLNX connectx5 cluster with MLNX_OFED_LINUX-4.5-1.0.1.0 Here’s the output from one of my UCX configs: configure: = configure: UCX build configuration: configure: Build prefix: /ucx_testing/ucx/test_install configure:Configuration dir: ${prefix}/etc/ucx configure: Preprocessor flags: -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src configure: C compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement configure: C++ compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch configure: Multi-thread: enabled configure: NUMA support: disabled configure:MPI tests: disabled configure: VFS support: no configure:Devel headers: no configure: io_demo CUDA support: no configure: Bindings: < > configure: UCS modules: < > configure: UCT modules: < ib cma knem > configure: CUDA modules: < > configure: ROCM modules: < > configure: IB modules: < > configure: UCM modules: < > configure: Perf modules: < > configure: ===== Howard From: "Fischer, Greg A." Date: Thursday, October 14, 2021 at 12:46 PM To: "Pritchard Jr., Howard" , Open MPI Users Cc: "Fischer, Greg A." Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" Thanks, Howard. I downloaded a current version of UCX (1.11.2) and installed it with OpenMPI 4.1.1. When I try to specify the “-mca pml ucx” for a simple, 2-process benchmark problem, I get: -- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: bl1311 Framework: pml -- [bl1311:20168] PML ucx cannot be selected [bl1311:20169] PML ucx cannot be selected I’ve attached my ucx_info -d output, as well as the ucx configuration information. I’m not sure I follow everything on the UCX FAQ page, but it seems like everything is being routed over TCP, which is probably not what I want. Any thoughts as to what I might be doing wrong? Thanks, Greg From: Pritchard Jr., Howard Sent: Wednesday, October 13, 2021 12:28 PM To: Open MPI Users Cc: Fischer, Greg A. Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] HI Greg, It’s the aging of the openib btl. You may be able to apply the attached patch. Note the 3.1.x release stream is no longer supported. You may want to try using the 4.1.1 release, in which case you’ll want to use UCX. Howard From: users mailto:users-boun...@lists.open-mpi.org>> on behalf of "Fischer, Greg A. via users" mailto:users@lists.open-mpi.org>> Reply-To: Open MPI Users mailto:users@lists.open-mpi.org>> Date: Wednesday, October 13, 2021 at 10:06 AM To: "users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" mailto:users@lists.open-mpi.org>> Cc: "Fischer, Greg A." mailto:fisch...@westinghouse.com>> Subject: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" Hello, I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the following errors when I try to use the openib btl: WARNING: There was an error initializing an OpenFabrics device. Local ho
Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"
HI Greg, It’s the aging of the openib btl. You may be able to apply the attached patch. Note the 3.1.x release stream is no longer supported. You may want to try using the 4.1.1 release, in which case you’ll want to use UCX. Howard From: users on behalf of "Fischer, Greg A. via users" Reply-To: Open MPI Users Date: Wednesday, October 13, 2021 at 10:06 AM To: "users@lists.open-mpi.org" Cc: "Fischer, Greg A." Subject: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" Hello, I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the following errors when I try to use the openib btl: WARNING: There was an error initializing an OpenFabrics device. Local host: bl1308 Local device: mlx4_0 -- [bl1308][[44866,1],5][../../../../../openmpi-3.1.6/opal/mca/btl/openib/btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success I have disabled UCX ("--without-ucx") because the UCX installation we have seems to be too out-of-date. ofed_info says "MLNX_OFED_LINUX-4.1-1.0.2.0". I've attached the detailed output of ofed_info and ompi_info. This issue seems similar to Issue #7461 (https://github.com/open-mpi/ompi/issues/7461), which I don't see a resolution for. Does anyone know what the likely explanation is? Is the version of OFED on the system badly out-of-sync with contemporary OpenMPI? Thanks, Greg This e-mail may contain proprietary information of the sending organization. Any unauthorized or improper disclosure, copying, distribution, or use of the contents of this e-mail and attached document(s) is prohibited. The information contained in this e-mail and attached document(s) is intended only for the personal and private use of the recipient(s) named above. If you have received this communication in error, please notify the sender immediately by email and delete the original e-mail and attached document(s). 0001-patch-ibv_exp_dev_query-function-call.patch Description: 0001-patch-ibv_exp_dev_query-function-call.patch
Re: [OMPI users] [EXTERNAL] Error Signal code: Address not mapped (1)
Hello Arturo, Would you mind filing an issue against Open MPI and use the template to provide info we could use to help triage this problem? https://github.com/open-mpi/ompi/issues/new Thanks, Howard From: users on behalf of Arturo Fernandez via users Reply-To: Open MPI Users Date: Monday, June 21, 2021 at 3:33 PM To: Open MPI Users Cc: Arturo Fernandez Subject: [EXTERNAL] [OMPI users] Error Signal code: Address not mapped (1) Hello, I'm getting the error message (with either v4.1.0 or v4.1.1) *** Process received signal *** Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: (nil) *** End of error message *** Segmentation fault (core dumped) The AWS system is running CentOS8 but I don't think that is the problem. After some troubleshooting, the error seems to appear and disappear depending on the libfabric version. When the system uses libfabric-aws-1.10.2g everything sails smoothly, the problems appear when libfabric-aws is upgraded to 1.11.2. I've tried to understand the differences between these versions but it's beyond my expertise. Thanks, Arturo
Re: [OMPI users] [EXTERNAL] Linker errors in Fedora 34 Docker container
Hi John, Good to know. For the record were you using a docker container unmodified from docker hub? Howard From: John Haiducek Date: Wednesday, May 26, 2021 at 9:35 AM To: "Pritchard Jr., Howard" Cc: "users@lists.open-mpi.org" Subject: Re: [EXTERNAL] [OMPI users] Linker errors in Fedora 34 Docker container That was it, thank you! After installing findutils it builds successfully. John On May 26, 2021, at 10:49 AM, Pritchard Jr., Howard mailto:howa...@lanl.gov>> wrote: Hi John, I don’t like this in the make output: ../../libtool: line 5705: find: command not found Maybe you need to install findutils or relevant fedora rpm in your container? Howard From: John Haiducek mailto:jhaid...@gmail.com>> Date: Wednesday, May 26, 2021 at 7:29 AM To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, "users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" mailto:users@lists.open-mpi.org>> Subject: Re: [EXTERNAL] [OMPI users] Linker errors in Fedora 34 Docker container On May 25, 2021, at 6:53 PM, Pritchard Jr., Howard mailto:howa...@lanl.gov>> wrote: In your build area, do you see any .lo files in opal/util/coeval? That directory doesn’t exist in my build area. In opal/util/keyval I have keyval_lex.lo. Which compiler are you using? gcc 11.1.1 Also, are you building from the tarballs at https://www.open-mpi.org/software/ompi/v4.1/ ? Yes; specifically I’m using the tarball from https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.bz2 John
Re: [OMPI users] [EXTERNAL] Linker errors in Fedora 34 Docker container
Hi John, I don’t like this in the make output: ../../libtool: line 5705: find: command not found Maybe you need to install findutils or relevant fedora rpm in your container? Howard From: John Haiducek Date: Wednesday, May 26, 2021 at 7:29 AM To: "Pritchard Jr., Howard" , "users@lists.open-mpi.org" Subject: Re: [EXTERNAL] [OMPI users] Linker errors in Fedora 34 Docker container On May 25, 2021, at 6:53 PM, Pritchard Jr., Howard mailto:howa...@lanl.gov>> wrote: In your build area, do you see any .lo files in opal/util/coeval? That directory doesn’t exist in my build area. In opal/util/keyval I have keyval_lex.lo. Which compiler are you using? gcc 11.1.1 Also, are you building from the tarballs at https://www.open-mpi.org/software/ompi/v4.1/ ? Yes; specifically I’m using the tarball from https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.bz2 John
Re: [OMPI users] [EXTERNAL] Linker errors in Fedora 34 Docker container
Hi John, I don’t think an external dependency is going to fix this. In your build area, do you see any .lo files in opal/util/keyval ? Which compiler are you using? Also, are you building from the tarballs at https://www.open-mpi.org/software/ompi/v4.1/ ? Howard From: users on behalf of John Haiducek via users Reply-To: Open MPI Users Date: Tuesday, May 25, 2021 at 3:49 PM To: "users@lists.open-mpi.org" Cc: John Haiducek Subject: [EXTERNAL] [OMPI users] Linker errors in Fedora 34 Docker container Hi, When attempting to build OpenMPI in a Fedora 34 Docker image I get the following linker errors: #22 77.36 make[2]: Entering directory '/build/openmpi-4.1.1/opal/tools/wrappers' #22 77.37 CC opal_wrapper.o #22 77.67 CCLD opal_wrapper #22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_util_keyval_yytext' #22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_util_keyval_yyin' #22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_util_keyval_yylineno' #22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_util_keyval_yynewlines' #22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_util_keyval_yylex' #22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_util_keyval_parse_done' #22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_util_keyval_yylex_destroy' #22 77.81 /usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_util_keyval_init_buffer' #22 77.81 collect2: error: ld returned 1 exit status My configure command is just ./configure --prefix=/usr/local/openmpi. I also tried ./configure --prefix=/usr/local/openmpi --disable-silent-rules --enable-builtin-atomics --with-hwloc=/usr --with-libevent=external --with-pmix=external --with-valgrind (similar to what is in the Fedora spec file for OpenMPI) but that produces the same errors. Is there a third-party library I need to install or an additional configure option I can set that will fix these? John
Re: [OMPI users] [EXTERNAL] Re: Newbie With Issues
Hi Ben, You're heading down the right path On our HPC systems, we use modules to handle things like setting LD_LIBRARY_PATH etc. when using Intel 21.x.y and other Intel compilers. For example, for the Intel/21.1.1 the following were added to LD_LIBRARY_PATH (edited to avoid posting explicit paths on our systems) prepend-path LD_LIBRARY_PATH /path_to_compiler_install /x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib:path_to_compiler_install /x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/compiler/lib/intel64_lin prepend-path PATH /path_to_compiler_install/x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/bin prepend-path LD_LIBRARY_PATH /path_to_compiler_install /x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib/emu prepend-path LD_LIBRARY_PATH /path_to_compiler_install /x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib/x64 prepend-path LD_LIBRARY_PATH /path_to_compiler_isnstall /x86_64/oneapi/2021.1.0.2684/compiler/2021.1.1/linux/lib You should check which intel compiler libraries you installed and make sure you're prepending the relevant folders to LD_LIBRARY_PATH. We have tested building Open MPI with the Intel OneAPI compilers and except for ifx, things went okay. Howard On 3/30/21, 11:12 AM, "users on behalf of bend linux4ms.net via users" wrote: I think I have found one of the issues. I took the check c program from openmpi and tried to compile and got the following: [root@jean-r8-sch24 benchmarks]# icc dummy.c ld: cannot find -lstdc++ [root@jean-r8-sch24 benchmarks]# cat dummy.c int main () { ; return 0; } [root@jean-r8-sch24 benchmarks]# Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 39212 "Never attribute to malice, that which can be adequately explained by stupidity" - Hanlon's Razor From: users on behalf of bend linux4ms.net via users Sent: Tuesday, March 30, 2021 12:00 PM To: Open MPI Users Cc: bend linux4ms.net Subject: Re: [OMPI users] Newbie With Issues Thanks Mr Heinz for responding. It maybe the case with clang, but doing a intel setvars.sh then issuing the following compile gives me the message: [root@jean-r8-sch24 openmpi-4.1.0]# icc icc: command line error: no files specified; for help type "icc -help" [root@jean-r8-sch24 openmpi-4.1.0]# icc -v icc version 2021.1 (gcc version 8.3.1 compatibility) [root@jean-r8-sch24 openmpi-4.1.0]# Would lead me to believe that icc is still available to use. This is a government contract and they want the latest and greatest. Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 39212 "Never attribute to malice, that which can be adequately explained by stupidity" - Hanlon's Razor From: Heinz, Michael William Sent: Tuesday, March 30, 2021 11:52 AM To: Open MPI Users Cc: bend linux4ms.net Subject: RE: Newbie With Issues It looks like you're trying to build Open MPI with the Intel C compiler. TBH - I think that icc isn't included with the latest release of oneAPI, I think they've switched to including clang instead. I had a similar issue to yours but I resolved it by installing a 2020 version of the Intel HPC software. Unfortunately, those versions require purchasing a license. -Original Message- From: users On Behalf Of bend linux4ms.net via users Sent: Tuesday, March 30, 2021 12:42 PM To: Open MPI Open MPI Cc: bend linux4ms.net Subject: [OMPI users] Newbie With Issues Hello group, My name is Ben Duncan. I have been tasked with installing openMPI and Intel compiler on a HPC systems. I am new to the the whole HPC and MPI environment so be patient with me. I have successfully gotten the Intel compiler (oneapi version from l_HPCKit_p_2021.1.0.2684_offline.sh installed without any errors. I am trying to install and configure the openMPI version 4.1.0 however trying to run configuration for openmpi gives me the following error: == Configuring Open MPI *** Startup tests checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu checking target system type... x86_64-unknown-linux-gnu checking for gcc... icc checking whether the C compiler works... no configure: error: in `/p/app/openmpi-4.1.0': configure: error: C compiler cannot create executables See `config.log' for more details With the error in config.log being: configure:6499: $? = 0 configure:6488: icc -qversion >&5 icc: command line warning #100
Re: [OMPI users] [EXTERNAL] building openshem on opa
HI Michael, You may want to try https://github.com/Sandia-OpenSHMEM/SOS if you want to use OpenSHMEM over OPA. If you have lots of cycles for development work, you could write an OFI SPML for the OSHMEM component of Open MPI. Howard On 3/22/21, 8:56 AM, "users on behalf of Michael Di Domenico via users" wrote: i can build and run openmpi on an opa network just fine, but it turns out building openshmem fails. the message is (no spml) found looking at the config log it looks like it tries to build spml ikrit and ucx which fail. i turn ucx off because it doesn't support opa and isn't needed. so this message is really just a confirmation that openshmem and opa are not capable of being built or did i do something wrong and a curiosity if anyone knows what kind of effort would be involved in getting it to work
Re: [OMPI users] [EXTERNAL] Re: OpenMPI 4.0.5 error with Omni-path
Hi Folks, I'm also have problems reproducing this on one of our OPA clusters: libpsm2-11.2.78-1.el7.x86_64 libpsm2-devel-11.2.78-1.el7.x86_64 cluster runs RHEL 7.8 hca_id: hfi1_0 transport: InfiniBand (0) fw_ver: 1.27.0 node_guid: 0011:7501:0179:e2d7 sys_image_guid: 0011:7501:0179:e2d7 vendor_id: 0x1175 vendor_part_id: 9456 hw_ver: 0x11 board_id: Intel Omni-Path Host Fabric Interface Adapter 100 Series phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu:4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 99 port_lmc: 0x00 link_layer: InfiniBand using gcc/gfortran 9.3.0 Built Open MPI 4.0.5 without any special configure options. Howard On 1/27/21, 9:47 AM, "users on behalf of Michael Di Domenico via users" wrote: for whatever it's worth running the test program on my OPA cluster seems to work. well it keeps spitting out [INFO MEMORY] lines, not sure if it's supposed to stop at some point i'm running rhel7, gcc 10.1, openmpi 4.0.5rc2, with-ofi, without-{psm,ucx,verbs} On Tue, Jan 26, 2021 at 3:44 PM Patrick Begou via users wrote: > > Hi Michael > > indeed I'm a little bit lost with all these parameters in OpenMPI, mainly because for years it works just fine out of the box in all my deployments on various architectures, interconnects and linux flavor. Some weeks ago I deploy OpenMPI4.0.5 in Centos8 with gcc10, slurm and UCX on an AMD epyc2 cluster with connectX6, and it just works fine. It is the first time I've such trouble to deploy this library. > > If you have my mail posted the 25/01/2021 in this discussion at 18h54 (may be Paris TZ) there is a small test case attached that show the problem. Did you got it or did the list strip these attachments ? I can provide it again. > > Many thanks > > Patrick > > Le 26/01/2021 à 19:25, Heinz, Michael William a écrit : > > Patrick how are you using original PSM if you’re using Omni-Path hardware? The original PSM was written for QLogic DDR and QDR Infiniband adapters. > > As far as needing openib - the issue is that the PSM2 MTL doesn’t support a subset of MPI operations that we previously used the pt2pt BTL for. For recent version of OMPI, the preferred BTL to use with PSM2 is OFI. > > Is there any chance you can give us a sample MPI app that reproduces the problem? I can’t think of another way I can give you more help without being able to see what’s going on. It’s always possible there’s a bug in the PSM2 MTL but it would be surprising at this point. > > Sent from my iPad > > On Jan 26, 2021, at 1:13 PM, Patrick Begou via users wrote: > > > Hi all, > > I ran many tests today. I saw that an older 4.0.2 version of OpenMPI packaged with Nix was running using openib. So I add the --with-verbs option to setup this module. > > That I can see now is that: > > mpirun -hostfile $OAR_NODEFILE --mca mtl psm -mca btl_openib_allow_ib true > > - the testcase test_layout_array is running without error > > - the bandwidth measured with osu_bw is half of thar it should be: > > # OSU MPI Bandwidth Test v5.7 > # Size Bandwidth (MB/s) > 1 0.54 > 2 1.13 > 4 2.26 > 8 4.51 > 16 9.06 > 32 17.93 > 64 33.87 > 12869.29 > 256 161.24 > 512 333.82 > 1024 682.66 > 2048 1188.63 > 4096 1760.14 > 8192 2166.08 > 163842036.95 > 327683466.63 > 655366296.73 > 131072 7509.43 > 262144 9104.78 > 524288 6908.55 > 1048576 5530.37 > 2097152 4489.16 > 4194304 3498.14 > > mpirun -hostfile $OAR_NODEFILE --mca mtl psm2 -mca btl_openib_allow_ib true ... > > - the testcase test_layout_array is not giving correct results > >
Re: [OMPI users] [EXTERNAL] OpenMPI 4.0.5 error with Omni-path
Hi Patrick, Also it might not hurt to disable the Open IB BTL by setting export OMPI_MCA_btl=^openib in your shell prior to invoking mpirun Howard From: users on behalf of "Heinz, Michael William via users" Reply-To: Open MPI Users Date: Monday, January 25, 2021 at 8:47 AM To: "users@lists.open-mpi.org" Cc: "Heinz, Michael William" Subject: [EXTERNAL] [OMPI users] OpenMPI 4.0.5 error with Omni-path Patrick, You really have to provide us some detailed information if you want assistance. At a minimum we need to know if you’re using the PSM2 MTL or the OFI MTL and what the actual error is. Please provide the actual command line you are having problems with, along with any errors. In addition, I recommend adding the following to your command line: -mca mtl_base_verbose 99 If you have a way to reproduce the problem quickly you might also want to add: -x PSM2_TRACEMASK=11 But that will add very detailed debug output to your command and you haven’t mentioned that PSM2 is failing, so it may not be useful.
Re: [OMPI users] [EXTERNAL] RMA breakage
Hello Dave, There's an issue opened about this - https://github.com/open-mpi/ompi/issues/8252 However, I'm not observing failures with IMB RMA on a IB/aarch64 system and UCX 1.9.0 using OMPI 4.0.x at 6ea9d98. This cluster is running RHEL 7.6 and MLNX_OFED_LINUX-4.5-1.0.1.0. Howard On 12/7/20, 7:21 AM, "users on behalf of Dave Love via users" wrote: After seeing several failures with RMA with the change needed to get 4.0.5 through IMB, I looked for simple tests. So, I built the mpich 3.4b1 tests -- or the ones that would build, and I haven't checked why some fail -- and ran the rma set. Three out of 180 passed. Many (most?) aborted in ucx, like I saw with production code, with a backtrace like below; others at least reported an MPI error. This was on two nodes of a ppc64le RHEL7 IB system with 4.0.5, ucx 1.9, and MCA parameters from the ucx FAQ (though I got the same result without those parameters). I haven't tried to reproduce it on x86_64, but it seems unlikely to be CPU-specific. Is there anything we can do to run RMA without just moving to mpich? Do releases actually get tested on run-of-the-mill IB+Lustre systems? + mpirun -n 2 winname [gpu005:50906:0:50906] ucp_worker.c:183 Fatal: failed to set active message handler id 1: Invalid parameter backtrace (tid: 50906) 0 0x0005453c ucs_debug_print_backtrace() .../src/ucs/debug/debug.c:656 1 0x00028218 ucp_worker_set_am_handlers() .../src/ucp/core/ucp_worker.c:182 2 0x00029ae0 ucp_worker_iface_deactivate() .../src/ucp/core/ucp_worker.c:816 3 0x00029ae0 ucp_worker_iface_check_events() .../src/ucp/core/ucp_worker.c:766 4 0x00029ae0 ucp_worker_iface_deactivate() .../src/ucp/core/ucp_worker.c:819 5 0x00029ae0 ucp_worker_iface_unprogress_ep() .../src/ucp/core/ucp_worker.c:841 6 0x000582a8 ucp_wireup_ep_t_cleanup() .../src/ucp/wireup/wireup_ep.c:381 7 0x00068124 ucs_class_call_cleanup_chain() .../src/ucs/type/class.c:56 8 0x00057420 ucp_wireup_ep_t_delete() .../src/ucp/wireup/wireup_ep.c:28 9 0x00013de8 uct_ep_destroy() .../src/uct/base/uct_iface.c:546 10 0x000252f4 ucp_proxy_ep_replace() .../src/ucp/core/ucp_proxy_ep.c:236 11 0x00057b88 ucp_wireup_ep_progress() .../src/ucp/wireup/wireup_ep.c:89 12 0x00049820 ucs_callbackq_slow_proxy() .../src/ucs/datastruct/callbackq.c:400 13 0x0002ca04 ucs_callbackq_dispatch() .../src/ucs/datastruct/callbackq.h:211 14 0x0002ca04 uct_worker_progress() .../src/uct/api/uct.h:2346 15 0x0002ca04 ucp_worker_progress() .../src/ucp/core/ucp_worker.c:2040 16 0xc144 progress_callback() osc_ucx_component.c:0 17 0x000374ac opal_progress() ???:0 18 0x0006cc74 ompi_request_default_wait() ???:0 19 0x000e6fcc ompi_coll_base_sendrecv_actual() ???:0 20 0x000e5530 ompi_coll_base_allgather_intra_two_procs() ???:0 21 0x6c44 ompi_coll_tuned_allgather_intra_dec_fixed() ???:0 22 0xdc20 component_select() osc_ucx_component.c:0 23 0x00115b90 ompi_osc_base_select() ???:0 24 0x00075264 ompi_win_create() ???:0 25 0x000cb4e8 PMPI_Win_create() ???:0 26 0x10006ecc MTestGetWin() .../mpich-3.4b1/test/mpi/util/mtest.c:1173 27 0x10002e40 main() .../mpich-3.4b1/test/mpi/rma/winname.c:25 28 0x00025200 generic_start_main.isra.0() libc-start.c:0 29 0x000253f4 __libc_start_main() ???:0 followed by the abort backtrace
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
HI All, I opened a new issue to track the coll_perf failure in case its not related to the HDF5 problem reported earlier. https://github.com/open-mpi/ompi/issues/8246 Howard Am Mo., 23. Nov. 2020 um 12:14 Uhr schrieb Dave Love via users < users@lists.open-mpi.org>: > Mark Dixon via users writes: > > > Surely I cannot be the only one who cares about using a recent openmpi > > with hdf5 on lustre? > > I generally have similar concerns. I dug out the romio tests, assuming > something more basic is useful. I ran them with ompi 4.0.5+ucx on > Mark's lustre system (similar to a few nodes of Summit, apart from the > filesystem, but with quad-rail IB which doesn't give the bandwidth I > expected). > > The perf test says romio performs a bit better. Also -- from overall > time -- it's faster on IMB-IO (which I haven't looked at in detail, and > ran with suboptimal striping). > > Test: perf > romio321 > Access size per process = 4194304 bytes, ntimes = 5 > Write bandwidth without file sync = 19317.372354 Mbytes/sec > Read bandwidth without prior file sync = 35033.325451 Mbytes/sec > Write bandwidth including file sync = 1081.096713 Mbytes/sec > Read bandwidth after file sync = 47135.349155 Mbytes/sec > ompio > Access size per process = 4194304 bytes, ntimes = 5 > Write bandwidth without file sync = 18442.698536 Mbytes/sec > Read bandwidth without prior file sync = 31958.198676 Mbytes/sec > Write bandwidth including file sync = 1081.058583 Mbytes/sec > Read bandwidth after file sync = 31506.854710 Mbytes/sec > > However, romio coll_perf fails as follows, and ompio runs. Isn't there > mpi-io regression testing? > > [gpu025:89063:0:89063] Caught signal 11 (Segmentation fault: address not > mapped to object at address 0x1fffbc10) > backtrace (tid: 89063) >0 0x0005453c ucs_debug_print_backtrace() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucs/debug/debug.c:656 >1 0x00041b04 ucp_rndv_pack_data() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1335 >2 0x0001c814 uct_self_ep_am_bcopy() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:278 >3 0x0003f7ac uct_ep_am_bcopy() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2561 >4 0x0003f7ac ucp_do_am_bcopy_multi() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.inl:79 >5 0x0003f7ac ucp_rndv_progress_am_bcopy() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1352 >6 0x00041cb8 ucp_request_try_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 >7 0x00041cb8 ucp_request_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 >8 0x00041cb8 ucp_rndv_rtr_handler() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1754 >9 0x0001c984 uct_iface_invoke_am() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635 > 10 0x0001c984 uct_self_iface_sendrecv_am() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149 > 11 0x0001c984 uct_self_ep_am_short() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262 > 12 0x0002ee30 uct_ep_am_short() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549 > 13 0x0002ee30 ucp_do_am_single() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68 > 14 0x00042908 ucp_proto_progress_rndv_rtr() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:172 > 15 0x0003f4c4 ucp_request_try_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 > 16 0x0003f4c4 ucp_request_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 > 17 0x0003f4c4 ucp_rndv_req_send_rtr() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/ta
Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.
HI Martin, Thanks this is helpful. Are you getting this timeout when you're running the spawner process as a singleton? Howard Am Fr., 14. Aug. 2020 um 17:44 Uhr schrieb Martín Morales < martineduardomora...@hotmail.com>: > Howard, > > > > I pasted below, the error message after a while of the hang I referred. > > Regards, > > > > Martín > > > > - > > > > *A request has timed out and will therefore fail:* > > > > * Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345* > > > > *Your job may terminate as a result of this problem. You may want to* > > *adjust the MCA parameter pmix_server_max_wait and try again. If this* > > *occurred during a connect/accept operation, you can adjust that time* > > *using the pmix_base_exchange_timeout parameter.* > > > *--* > > > *--* > > *It looks like MPI_INIT failed for some reason; your parallel process is* > > *likely to abort. There are many reasons that a parallel process can* > > *fail during MPI_INIT; some of which are due to configuration or > environment* > > *problems. This failure appears to be an internal failure; here's some* > > *additional information (which may only be relevant to an Open MPI* > > *developer):* > > > > * ompi_dpm_dyn_init() failed* > > * --> Returned "Timeout" (-15) instead of "Success" (0)* > > > *--* > > *[nos-GF7050VT-M:03767] *** An error occurred in MPI_Init* > > *[nos-GF7050VT-M:03767] *** reported by process [2337734658,0]* > > *[nos-GF7050VT-M:03767] *** on a NULL communicator* > > *[nos-GF7050VT-M:03767] *** Unknown error* > > *[nos-GF7050VT-M:03767] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort,* > > *[nos-GF7050VT-M:03767] ***and potentially your MPI job)* > > *[osboxes:02457] *** An error occurred in MPI_Comm_spawn* > > *[osboxes:02457] *** reported by process [2337734657,0]* > > *[osboxes:02457] *** on communicator MPI_COMM_WORLD* > > *[osboxes:02457] *** MPI_ERR_UNKNOWN: unknown error* > > *[osboxes:02457] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort,* > > *[osboxes:02457] ***and potentially your MPI job)* > > *[osboxes:02458] 1 more process has sent help message help-orted.txt / > timedout* > > *[osboxes:02458] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages* > > > > > > > > > > *From: *Martín Morales via users > *Sent: *viernes, 14 de agosto de 2020 19:40 > *To: *Howard Pritchard > *Cc: *Martín Morales ; Open MPI Users > > *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with > dynamically processes allocation. OMPI 4.0.1 don't. > > > > Hi Howard. > > > > Thanks for the track in Github. I have run with mpirun without “master” in > the hostfile and runs ok. The hanging occurs when I run like a singleton > (no mpirun) which is the way I need to run. If I make a top in both > machines the processes are correctly mapped but hangued. Seems the > MPI_Init() function doesn’t return. Thanks for your help. > > Best regards, > > > > Martín > > > > > > > > > > > > > > *From: *Howard Pritchard > *Sent: *viernes, 14 de agosto de 2020 15:18 > *To: *Martín Morales > *Cc: *Open MPI Users > *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with > dynamically processes allocation. OMPI 4.0.1 don't. > > > > Hi Martin, > > > > I opened an issue on Open MPI's github to track this > https://github.com/open-mpi/ompi/issues/8005 > > > > You may be seeing another problem if you removed master from the host > file. > > Could you add the --debug-daemons option to the mpirun and post the output? > > > > Howard > > > > > > Am Di., 11. Aug. 2020 um 17:35 Uhr schrieb Martín Morales < > martineduardomora...@hotmail.com>: > > Hi Howard. > > > > Great!, that works for the crashing problem with OMPI 4.0.4. However It > stills hanging if I remove “master” (host which launches spawning > processes) from my hostfile. > > I need spawn only in “worker”. Is there a way or workaround for doing this > without mpirun? > > Thanks a lot for your assistance. > > > > Martín > > > > > > > > > > *From: *Howard Pritchard > *Sent: *lunes, 10 de agosto de 2020 19:13 > *To: *Martín
Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.
Hi Martin, I opened an issue on Open MPI's github to track this https://github.com/open-mpi/ompi/issues/8005 You may be seeing another problem if you removed master from the host file. Could you add the --debug-daemons option to the mpirun and post the output? Howard Am Di., 11. Aug. 2020 um 17:35 Uhr schrieb Martín Morales < martineduardomora...@hotmail.com>: > Hi Howard. > > > > Great!, that works for the crashing problem with OMPI 4.0.4. However It > stills hanging if I remove “master” (host which launches spawning > processes) from my hostfile. > > I need spawn only in “worker”. Is there a way or workaround for doing this > without mpirun? > > Thanks a lot for your assistance. > > > > Martín > > > > > > > > > > *From: *Howard Pritchard > *Sent: *lunes, 10 de agosto de 2020 19:13 > *To: *Martín Morales > *Cc: *Open MPI Users > *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with > dynamically processes allocation. OMPI 4.0.1 don't. > > > > Hi Martin, > > > > I was able to reproduce this with 4.0.x branch. I'll open an issue. > > > > If you really want to use 4.0.4, then what you'll need to do is build an > external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and > then build Open MPI using the --with-pmix=where your pmix is installed > > You will also need to build both Open MPI and PMIx against the same > libevent. There's a configure option with both packages to use an > external libevent installation. > > > > Howard > > > > > > Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales < > martineduardomora...@hotmail.com>: > > Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have > to post this on the bug section? Thanks and regards. > > > > Martín > > > > *From: *Howard Pritchard > *Sent: *lunes, 10 de agosto de 2020 14:44 > *To: *Open MPI Users > *Cc: *Martín Morales > *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with > dynamically processes allocation. OMPI 4.0.1 don't. > > > > Hello Martin, > > > > Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx > version that introduced a problem with spawn for the 4.0.2-4.0.4 versions. > > This is supposed to be fixed in the 4.0.5 release. Could you try the > 4.0.5rc1 tarball and see if that addresses the problem you're seeing? > > > > https://www.open-mpi.org/software/ompi/v4.0/ > > > > Howard > > > > > > > > Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users < > users@lists.open-mpi.org>: > > > > Hello people! > > I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one > "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded > OMPI just like this: > > > > ./configure --prefix=/usr/local/openmpi-4.0.4/bin/ > > > > My hostfile is this: > > > > master slots=2 > worker slots=2 > > > > I'm trying to dynamically allocate the processes with MPI_Comm_Spawn(). > > If I launch the processes only on the "master" machine It's ok. But if I > use the hostfile crashes with this: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *-- > At least one pair of MPI processes are unable to reach each other for MPI > communications. This means that no Open MPI device has indicated that it > can be used to communicate between these processes. This is an error; Open > MPI requires that all MPI processes be able to reach each other. This > error can sometimes be the result of forgetting to specify the "self" BTL. > Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M Process 2 > ([[35155,1],0]) is on host: unknown! BTLs attempted: tcp self Your MPI > job is now going to abort; sorry. > -- > [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file > dpm/dpm.c at line 493 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can fail > during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): ompi_dpm_dyn_init() failed
Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.
Hi Ralph, I've not yet determined whether this is actually a PMIx issue or the way the dpm stuff in OMPI is handling PMIx namespaces. Howard Am Di., 11. Aug. 2020 um 19:34 Uhr schrieb Ralph Castain via users < users@lists.open-mpi.org>: > Howard - if there is a problem in PMIx that is causing this problem, then > we really could use a report on it ASAP as we are getting ready to release > v3.1.6 and I doubt we have addressed anything relevant to what is being > discussed here. > > > > On Aug 11, 2020, at 4:35 PM, Martín Morales via users < > users@lists.open-mpi.org> wrote: > > Hi Howard. > > Great!, that works for the crashing problem with OMPI 4.0.4. However It > stills hanging if I remove “master” (host which launches spawning > processes) from my hostfile. > I need spawn only in “worker”. Is there a way or workaround for doing this > without mpirun? > Thanks a lot for your assistance. > > Martín > > > > > *From: *Howard Pritchard > *Sent: *lunes, 10 de agosto de 2020 19:13 > *To: *Martín Morales > *Cc: *Open MPI Users > *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with > dynamically processes allocation. OMPI 4.0.1 don't. > > Hi Martin, > > I was able to reproduce this with 4.0.x branch. I'll open an issue. > > If you really want to use 4.0.4, then what you'll need to do is build an > external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and > then build Open MPI using the --with-pmix=where your pmix is installed > You will also need to build both Open MPI and PMIx against the same > libevent. There's a configure option with both packages to use an > external libevent installation. > > Howard > > > Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales < > martineduardomora...@hotmail.com>: > > Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have > to post this on the bug section? Thanks and regards. > > > Martín > > > *From: *Howard Pritchard > *Sent: *lunes, 10 de agosto de 2020 14:44 > *To: *Open MPI Users > *Cc: *Martín Morales > *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with > dynamically processes allocation. OMPI 4.0.1 don't. > > > Hello Martin, > > > Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx > version that introduced a problem with spawn for the 4.0.2-4.0.4 versions. > This is supposed to be fixed in the 4.0.5 release. Could you try the > 4.0.5rc1 tarball and see if that addresses the problem you're seeing? > > > https://www.open-mpi.org/software/ompi/v4.0/ > > > Howard > > > > > > > Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users < > users@lists.open-mpi.org>: > > > Hello people! > I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one > "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded > OMPI just like this: > > > ./configure --prefix=/usr/local/openmpi-4.0.4/bin/ > > > My hostfile is this: > > > master slots=2 > worker slots=2 > > > I'm trying to dynamically allocate the processes with MPI_Comm_Spawn(). > If I launch the processes only on the "master" machine It's ok. But if I > use the hostfile crashes with this: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *--At > least one pair of MPI processes are unable to reach each other forMPI > communications. This means that no Open MPI device has indicatedthat it > can be used to communicate between these processes. This isan error; Open > MPI requires that all MPI processes be able to reacheach other. This error > can sometimes be the result of forgetting tospecify the "self" BTL. > Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M Process 2 > ([[35155,1],0]) is on host: unknown! BTLs attempted: tcp selfYour MPI job > is now going to abort; > sorry.--[nos-GF7050VT-M:22526] > [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line > 493--It > looks like MPI_INIT failed for some reason; your parallel process islikely > to abort. There are many reasons that a parallel process canfail during > MPI_INIT; some of which are due to configuration or environmentproblems. > This failure appears to be an internal failure; here's someadditional > information (which may only be relevant to an Open MPIdeveloper): >
Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.
Hi Martin, I was able to reproduce this with 4.0.x branch. I'll open an issue. If you really want to use 4.0.4, then what you'll need to do is build an external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and then build Open MPI using the --with-pmix=where your pmix is installed You will also need to build both Open MPI and PMIx against the same libevent. There's a configure option with both packages to use an external libevent installation. Howard Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales < martineduardomora...@hotmail.com>: > Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have > to post this on the bug section? Thanks and regards. > > > > Martín > > > > *From: *Howard Pritchard > *Sent: *lunes, 10 de agosto de 2020 14:44 > *To: *Open MPI Users > *Cc: *Martín Morales > *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with > dynamically processes allocation. OMPI 4.0.1 don't. > > > > Hello Martin, > > > > Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx > version that introduced a problem with spawn for the 4.0.2-4.0.4 versions. > > This is supposed to be fixed in the 4.0.5 release. Could you try the > 4.0.5rc1 tarball and see if that addresses the problem you're seeing? > > > > https://www.open-mpi.org/software/ompi/v4.0/ > > > > Howard > > > > > > > > Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users < > users@lists.open-mpi.org>: > > > > Hello people! > > I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one > "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded > OMPI just like this: > > > > ./configure --prefix=/usr/local/openmpi-4.0.4/bin/ > > > > My hostfile is this: > > > > master slots=2 > worker slots=2 > > > > I'm trying to dynamically allocate the processes with MPI_Comm_Spawn(). > > If I launch the processes only on the "master" machine It's ok. But if I > use the hostfile crashes with this: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *-- > At least one pair of MPI processes are unable to reach each other for MPI > communications. This means that no Open MPI device has indicated that it > can be used to communicate between these processes. This is an error; Open > MPI requires that all MPI processes be able to reach each other. This > error can sometimes be the result of forgetting to specify the "self" BTL. > Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M Process 2 > ([[35155,1],0]) is on host: unknown! BTLs attempted: tcp self Your MPI > job is now going to abort; sorry. > -- > [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file > dpm/dpm.c at line 493 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can fail > during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): ompi_dpm_dyn_init() failed --> Returned "Unreachable" (-12) > instead of "Success" (0) > -- > [nos-GF7050VT-M:22526] *** An error occurred in MPI_Init > [nos-GF7050VT-M:22526] *** reported by process [2303918082,1] > [nos-GF7050VT-M:22526] *** on a NULL communicator [nos-GF7050VT-M:22526] > *** Unknown error [nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL > (processes in this communicator will now abort, [nos-GF7050VT-M:22526] *** >and potentially your MPI job)* > > > > Note: host "nos-GF7050VT-M" is "worker" > > > > But If I run without "master" in hostfile, the processes are launched but > It hangs: MPI_Init() doesn't returns. > > I launched the script (pasted below) in this 2 ways with the same result: > > > > $ ./simple_spawn 2 > > $ mpirun -np 1 ./simple_spawn 2 > > > > The "simple_spawn" script: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > &
Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.
Hello Martin, Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx version that introduced a problem with spawn for the 4.0.2-4.0.4 versions. This is supposed to be fixed in the 4.0.5 release. Could you try the 4.0.5rc1 tarball and see if that addresses the problem you're seeing? https://www.open-mpi.org/software/ompi/v4.0/ Howard Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users < users@lists.open-mpi.org>: > > > Hello people! > > I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one > "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded > OMPI just like this: > > > > ./configure --prefix=/usr/local/openmpi-4.0.4/bin/ > > > > My hostfile is this: > > > > master slots=2 > worker slots=2 > > > > I'm trying to dynamically allocate the processes with MPI_Comm_Spawn(). > > If I launch the processes only on the "master" machine It's ok. But if I > use the hostfile crashes with this: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *-- > At least one pair of MPI processes are unable to reach each other for MPI > communications. This means that no Open MPI device has indicated that it > can be used to communicate between these processes. This is an error; Open > MPI requires that all MPI processes be able to reach each other. This > error can sometimes be the result of forgetting to specify the "self" BTL. > Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M Process 2 > ([[35155,1],0]) is on host: unknown! BTLs attempted: tcp self Your MPI > job is now going to abort; sorry. > -- > [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file > dpm/dpm.c at line 493 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can fail > during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): ompi_dpm_dyn_init() failed --> Returned "Unreachable" (-12) > instead of "Success" (0) > -- > [nos-GF7050VT-M:22526] *** An error occurred in MPI_Init > [nos-GF7050VT-M:22526] *** reported by process [2303918082,1] > [nos-GF7050VT-M:22526] *** on a NULL communicator [nos-GF7050VT-M:22526] > *** Unknown error [nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL > (processes in this communicator will now abort, [nos-GF7050VT-M:22526] *** >and potentially your MPI job)* > > > > Note: host "nos-GF7050VT-M" is "worker" > > > > But If I run without "master" in hostfile, the processes are launched but > It hangs: MPI_Init() doesn't returns. > > I launched the script (pasted below) in this 2 ways with the same result: > > > > $ ./simple_spawn 2 > > $ mpirun -np 1 ./simple_spawn 2 > > > > The "simple_spawn" script: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *#include "mpi.h" #include #include int main(int > argc, char ** argv){ int processesToRun; MPI_Comm parentcomm, > intercomm; MPI_Info info; int rank, size, hostName_len; char > hostName[200]; MPI_Init( , ); MPI_Comm_get_parent( > ); MPI_Comm_rank(MPI_COMM_WORLD, ); > MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Get_processor_name(hostName, > _len); if (parentcomm == MPI_COMM_NULL) { > if(argc < 2 ){ printf("Processes number needed!"); > return 0; } processesToRun = atoi(argv[1]); > MPI_Info_create( ); MPI_Info_set( info, "hostfile", > "./hostfile" ); MPI_Info_set( info, "map_by", "node" ); > MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, processesToRun, info, 0, > MPI_COMM_WORLD, , MPI_ERRCODES_IGNORE); printf("I'm the > parent.\n"); } else { printf("I'm the spawned h: %s r/s: > %i/%i.\n", hostName, rank, size ); } fflush(stdout); > MPI_Finalize(); return 0; }* > > > > I came from OMPI 4.0.1. In this version It's working... with some > inconsistencies I'm afraid. That's why I decided to upgrade to OMPI 4.0.4. > > I tried several versions with no luck. Is there maybe an intrinsic problem > with the OMPI dynamic allocation functionality? > > Any help will be very appreciated. Best regards. > > > > Martín > > >
Re: [OMPI users] Differences 4.0.3 -> 4.0.4 (Regression?)
Hello Michael, Not sure what could be causing this in terms of delta between v4.0.3 and v4.0.4. Two things to try - add --debug-daemons and --mca pmix_base_verbose 100 to the mpirun line and compare output from the v4.0.3 and v4.0.4 installs - perhaps try using the --enable-mpirun-prefix-by-default configure option and reinstall v4.0.4 Howard Am Do., 6. Aug. 2020 um 04:48 Uhr schrieb Michael Fuckner via users < users@lists.open-mpi.org>: > Hi, > > I have a small setup with one headnode and two compute nodes connected > via IB-QDR running CentOS 8.2 and Mellanox OFED 4.9 LTS. I installed > openmpi 3.0.6, 3.1.6, 4.0.3 and 4.0.4 with identical configuration > (configure, compile, nothing configured in openmpi-mca-params.conf), the > output from ompi-info and orte-info looks identical. > > There is a small benchmark basically just doing MPI_Send() and > MPI_Recv(). I can invoke it directly like this (as 4.0.3 and 4.0.4) > > /opt/openmpi/4.0.3/gcc/bin/mpirun -np 16 -hostfile HOSTFILE_2x8 -nolocal > ./OWnetbench.openmpi-4.0.3 > > when running this job from slurm, it works with 4.0.3, but there is an > error with 4.0.4. Any hint what to check? > > > ### running ./OWnetbench/OWnetbench.openmpi-4.0.4 with > /opt/openmpi/4.0.4/gcc/bin/mpirun ### > [node002.cluster:04960] MCW rank 0 bound to socket 0[core 7[hwt 0-1]]: > [../../../../../../../BB] > [node002.cluster:04963] PMIX ERROR: OUT-OF-RESOURCE in file > client/pmix_client.c at line 231 > [node002.cluster:04963] OPAL ERROR: Error in file pmix3x_client.c at > line 112 > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > ***and potentially your MPI job) > [node002.cluster:04963] Local abort before MPI_INIT completed completed > successfully, but am not able to aggregate error messages, and not able > to guarantee that all other processes were kil > led! > -- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > -- > -- > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so was: > >Process name: [[15424,1],0] >Exit code:1 > -- > > Any hint why 4.0.4 behaves not like the other versions? > > -- > DELTA Computer Products GmbH > Röntgenstr. 4 > D-21465 Reinbek bei Hamburg > T: +49 40 300672-30 > F: +49 40 300672-11 > E: michael.fuck...@delta.de > > Internet: https://www.delta.de > Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550 > Geschäftsführer: Hans-Peter Hellmann >
Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Collin, A couple of things to try. First, could you just configure without using the mellanox platform file and see if you can run the app with 100 or more processes? Another thing to try is to keep using the mellanox platform file, but run the app with mpirun --mca pml ob1 -np 100 bin/xhpcg and see if the app runs successfully. Howard Am Mo., 27. Jan. 2020 um 09:29 Uhr schrieb Collin Strassburger < cstrassbur...@bihrle.com>: > Hello Howard, > > > > To remove potential interactions, I have found that the issue persists > without ucx and hcoll support. > > > > Run command: mpirun -np 128 bin/xhpcg > > Output: > > -- > > mpirun was unable to start the specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node4 > > > > when attempting to start process rank 0. > > -- > > 128 total processes failed to start > > > > It returns this error for any process I initialize with >100 processes per > node. I get the same error message for multiple different codes, so the > error code is mpi related rather than being program specific. > > > > Collin > > > > *From:* Howard Pritchard > *Sent:* Monday, January 27, 2020 11:20 AM > *To:* Open MPI Users > *Cc:* Collin Strassburger > *Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when > utilizing 100+ processors per node > > > > Hello Collen, > > > > Could you provide more information about the error. Is there any output > from either Open MPI or, maybe, UCX, that could provide more information > about the problem you are hitting? > > > > Howard > > > > > > Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users < > users@lists.open-mpi.org>: > > Hello, > > > > I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of > these versions cause the same error (error code 63) when utilizing more > than 100 cores on a single node. The processors I am utilizing are AMD > Epyc “Rome” 7742s. The OS is CentOS 8.1. I have tried compiling with both > the default gcc 8 and locally compiled gcc 9. I have already tried > modifying the maximum name field values with no success. > > > > My compile options are: > > ./configure > > --prefix=${HPCX_HOME}/ompi > > --with-platform=contrib/platform/mellanox/optimized > > > > Any assistance would be appreciated, > > Collin > > > > Collin Strassburger > > Bihrle Applied Research Inc. > > > >
Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node
Hello Collen, Could you provide more information about the error. Is there any output from either Open MPI or, maybe, UCX, that could provide more information about the problem you are hitting? Howard Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users < users@lists.open-mpi.org>: > Hello, > > > > I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of > these versions cause the same error (error code 63) when utilizing more > than 100 cores on a single node. The processors I am utilizing are AMD > Epyc “Rome” 7742s. The OS is CentOS 8.1. I have tried compiling with both > the default gcc 8 and locally compiled gcc 9. I have already tried > modifying the maximum name field values with no success. > > > > My compile options are: > > ./configure > > --prefix=${HPCX_HOME}/ompi > > --with-platform=contrib/platform/mellanox/optimized > > > > Any assistance would be appreciated, > > Collin > > > > Collin Strassburger > > Bihrle Applied Research Inc. > > >
Re: [OMPI users] Do idle MPI threads consume clock cycles?
Hello Mark, You may want to checkout this package: https://github.com/lanl/libquo Another option would be to do something like use an MPI_Ibarrier in the application with all the MPI processes but rank 0 going into a loop over waiting for completion of the barrier and doing a sleep. Once rank 0 had completed the OpenMP work, it would then enter the barrier and wait for completion. This type of problem may be helped in a future MPI that supports the notion of MPI Sessions. With this approach, you would initialize one MPI session for normal messaging behavior, using polling for fast processing of messages. Your MPI library would use this for its existing messaging. You could initialize a second MPI session to use blocking methods for message receipt. You would use a communicator derived from the second session to do what's described above for the loop with sleep on an Ibarrier. Good luck, Howard Am Do., 21. Feb. 2019 um 11:25 Uhr schrieb Mark McClure < mark.w.m...@gmail.com>: > I have the following, rather unusual, scenario... > > I have a program running with OpenMP on a multicore computer. At one point > in the program, I want to use an external package that is written to > exploit MPI, not OpenMP, parallelism. So a (rather awkward) solution could > be to launch the program in MPI, but most of the time, everything is being > done in a single MPI process, which is using OpenMP (ie, run my current > program in a single MPI process). Then, when I get to the part where I need > to use the external package, distribute out the information to all the MPI > processes, run it across all, and then pull them back to the master > process. This is awkward, but probably better than my current approach, > which is running the external package on a single processor (ie, not > exploiting parallelism in this time-consuming part of the code). > > If I use this strategy, I fear that the idle MPI processes may be > consuming clock cycles while I am running the rest of the program on the > master process with OpenMP. Thus, they may compete with the OpenMP threads. > OpenMP does not close threads between every pragma, but OMP_WAIT_POLICY can > be set to sleep idle threads (actually, this is the default behavior). I > have not been able to find any equivalent documentation regarding the > behavior of idle threads in MPI. > > Best regards, > Mark > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)
Hello Adam, This helps some. Could you post first 20 lines of you config.log. This will help in trying to reproduce. The content of your host file (you can use generic names for the nodes if that'a an issue to publicize) would also help as the number of nodes and number of MPI processes/node impacts the way the reduce scatter operation works. One thing to note about the openib BTL - it is on life support. That's why you needed to set btl_openib_allow_ib 1 on the mpirun command line. You may get much better success by installing UCX <https://github.com/openucx/ucx/releases> and rebuilding Open MPI to use UCX. You may actually already have UCX installed on your system if a recent version of MOFED is installed. You can check this by running /usr/bin/ofed_rpm_info. It will show which ucx version has been installed. If UCX is installed, you can add --with-ucx to the Open MPi configuration line and it should build in UCX support. If Open MPI is built with UCX support, it will by default use UCX for message transport rather than the OpenIB BTL. thanks, Howard Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc < alebl...@iol.unh.edu>: > On tcp side it doesn't seg fault anymore but will timeout on some tests > but on the openib side it will still seg fault, here is the output: > > [pandora:19256] *** Process received signal *** > [pandora:19256] Signal: Segmentation fault (11) > [pandora:19256] Signal code: Address not mapped (1) > [pandora:19256] Failing at address: 0x7f911c69fff0 > [pandora:19255] *** Process received signal *** > [pandora:19255] Signal: Segmentation fault (11) > [pandora:19255] Signal code: Address not mapped (1) > [pandora:19255] Failing at address: 0x7ff09cd3fff0 > [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680] > [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0] > [pandora:19256] [ 2] > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55] > [pandora:19256] [ 3] > /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b] > [pandora:19256] [ 4] [pandora:19255] [ 0] > /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680] > [pandora:19255] [ 1] > /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7] > [pandora:19256] [ 5] IMB-MPI1[0x40b83b] > [pandora:19256] [ 6] IMB-MPI1[0x407155] > [pandora:19256] [ 7] IMB-MPI1[0x4022ea] > [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0] > [pandora:19255] [ 2] > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5] > [pandora:19256] [ 9] IMB-MPI1[0x401d49] > [pandora:19256] *** End of error message *** > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55] > [pandora:19255] [ 3] > /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b] > [pandora:19255] [ 4] > /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7] > [pandora:19255] [ 5] IMB-MPI1[0x40b83b] > [pandora:19255] [ 6] IMB-MPI1[0x407155] > [pandora:19255] [ 7] IMB-MPI1[0x4022ea] > [pandora:19255] [ 8] > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5] > [pandora:19255] [ 9] IMB-MPI1[0x401d49] > [pandora:19255] *** End of error message *** > [phoebe:12418] *** Process received signal *** > [phoebe:12418] Signal: Segmentation fault (11) > [phoebe:12418] Signal code: Address not mapped (1) > [phoebe:12418] Failing at address: 0x7f5ce27dfff0 > [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680] > [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0] > [phoebe:12418] [ 2] > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55] > [phoebe:12418] [ 3] > /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b] > [phoebe:12418] [ 4] > /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7] > [phoebe:12418] [ 5] IMB-MPI1[0x40b83b] > [phoebe:12418] [ 6] IMB-MPI1[0x407155] > [phoebe:12418] [ 7] IMB-MPI1[0x4022ea] > [phoebe:12418] [ 8] > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5] > [phoebe:12418] [ 9] IMB-MPI1[0x401d49] > [phoebe:12418] *** End of error message *** > -- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > -- > -- > mpirun noticed that process rank 0 with PID 0 on node pandora exited on > signal 11 (Segmentation fault). > --
Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)
HI Adam, As a sanity check, if you try to use --mca btl self,vader,tcp do you still see the segmentation fault? Howard Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc < alebl...@iol.unh.edu>: > Hello, > > When I do a run with OpenMPI v4.0.0 on Infiniband with this command: > mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca > orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca > btl_openib_allow_ib 1 -np 6 > -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1 > > I get this error: > > # > # Benchmarking Reduce_scatter > # #processes = 4 > # ( 2 additional processes waiting in MPI_Barrier) > # >#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] > 0 1000 0.14 0.15 0.14 > 4 1000 5.00 7.58 6.28 > 8 1000 5.13 7.68 6.41 >16 1000 5.05 7.74 6.39 >32 1000 5.43 7.96 6.75 >64 1000 6.78 8.56 7.69 > 128 1000 7.77 9.55 8.59 > 256 1000 8.2810.96 9.66 > 512 1000 9.1912.4910.85 > 1024 100011.7815.0113.38 > 2048 100017.4119.5118.52 > 4096 100025.7328.2226.89 > 8192 100047.7549.4448.79 > 16384 100081.1090.1584.75 > 32768 1000 163.01 178.58 173.19 > 65536 640 315.63 340.51 333.18 >131072 320 475.48 528.82 510.85 >262144 160 979.70 1063.81 1035.61 >524288 80 2070.51 2242.58 2150.15 > 1048576 40 4177.36 4527.25 4431.65 > 2097152 20 8738.08 9340.50 9147.89 > [pandora:04500] *** Process received signal *** > [pandora:04500] Signal: Segmentation fault (11) > [pandora:04500] Signal code: Address not mapped (1) > [pandora:04500] Failing at address: 0x7f310eb0 > [pandora:04499] *** Process received signal *** > [pandora:04499] Signal: Segmentation fault (11) > [pandora:04499] Signal code: Address not mapped (1) > [pandora:04499] Failing at address: 0x7f28b110 > [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680] > [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0] > [pandora:04500] [ 2] > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55] > [pandora:04500] [ 3] [pandora:04499] [ 0] > /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b] > [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680] > [pandora:04499] [ 1] > /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7] > [pandora:04500] [ 5] IMB-MPI1[0x40b83b] > [pandora:04500] [ 6] IMB-MPI1[0x407155] > [pandora:04500] [ 7] IMB-MPI1[0x4022ea] > [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0] > [pandora:04499] [ 2] > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5] > [pandora:04500] [ 9] IMB-MPI1[0x401d49] > [pandora:04500] *** End of error message *** > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55] > [pandora:04499] [ 3] > /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b] > [pandora:04499] [ 4] > /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7] > [pandora:04499] [ 5] IMB-MPI1[0x40b83b] > [pandora:04499] [ 6] IMB-MPI1[0x407155] > [pandora:04499] [ 7] IMB-MPI1[0x4022ea] > [pandora:04499] [ 8] > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5] > [pandora:04499] [ 9] IMB-MPI1[0x401d49] > [pandora:04499] *** End of error message *** > [phoebe:03779] *** Process received signal *** > [phoebe:03779] Signal: Segmentation fault (11) > [phoebe:03779] Signal code: Address not mapped (1) > [phoebe:03779] Failing at address: 0x7f483d60 > [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680] > [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0] > [phoebe:03779] [ 2] > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55] > [phoebe:03779] [ 3] > /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_b
Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX
Hi Matt Definitely do not include the ucx option for an omnipath cluster. Actually if you accidentally installed ucx in it’s default location use on the system Switch to this config option —with-ucx=no Otherwise you will hit https://github.com/openucx/ucx/issues/750 Howard Gilles Gouaillardet schrieb am Sa. 19. Jan. 2019 um 18:41: > Matt, > > There are two ways of using PMIx > > - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk > to mpirun and orted daemons (e.g. the PMIx server) > - if you use SLURM srun, then the MPI app will directly talk to the > PMIx server provided by SLURM. (note you might have to srun > --mpi=pmix_v2 or something) > > In the former case, it does not matter whether you use the embedded or > external PMIx. > In the latter case, Open MPI and SLURM have to use compatible PMIx > libraries, and you can either check the cross-version compatibility > matrix, > or build Open MPI with the same PMIx used by SLURM to be on the safe > side (not a bad idea IMHO). > > > Regarding the hang, I suggest you try different things > - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun > runs on a compute node rather than on a frontend node) > - try something even simpler such as mpirun hostname (both with sbatch > and salloc) > - explicitly specify the network to be used for the wire-up. you can > for example mpirun --mca oob_tcp_if_include 192.168.0.0/24 if this is > the network subnet by which all the nodes (e.g. compute nodes and > frontend node if you use salloc) communicate. > > > Cheers, > > Gilles > > On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson wrote: > > > > On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users < > users@lists.open-mpi.org> wrote: > >> > >> On Jan 18, 2019, at 12:43 PM, Matt Thompson wrote: > >> > > >> > With some help, I managed to build an Open MPI 4.0.0 with: > >> > >> We can discuss each of these params to let you know what they are. > >> > >> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath > >> > >> Did you have a reason for disabling these? They're generally good > things. What they do is add linker flags to the wrapper compilers (i.e., > mpicc and friends) that basically put a default path to find libraries at > run time (that can/will in most cases override LD_LIBRARY_PATH -- but you > can override these linked-in-default-paths if you want/need to). > > > > > > I've had these in my Open MPI builds for a while now. The reason was one > of the libraries I need for the climate model I work on went nuts if both > of them weren't there. It was originally the rpath one but then eventually > (Open MPI 3?) I had to add the runpath one. But I have been updating the > libraries more aggressively recently (due to OS upgrades) so it's possible > this is no longer needed. > > > >> > >> > >> > --with-psm2 > >> > >> Ensure that Open MPI can include support for the PSM2 library, and > abort configure if it cannot. > >> > >> > --with-slurm > >> > >> Ensure that Open MPI can include support for SLURM, and abort configure > if it cannot. > >> > >> > --enable-mpi1-compatibility > >> > >> Add support for MPI_Address and other MPI-1 functions that have since > been deleted from the MPI 3.x specification. > >> > >> > --with-ucx > >> > >> Ensure that Open MPI can include support for UCX, and abort configure > if it cannot. > >> > >> > --with-pmix=/usr/nlocal/pmix/2.1 > >> > >> Tells Open MPI to use the PMIx that is installed at > /usr/nlocal/pmix/2.1 (instead of using the PMIx that is bundled internally > to Open MPI's source code tree/expanded tarball). > >> > >> Unless you have a reason to use the external PMIx, the internal/bundled > PMIx is usually sufficient. > > > > > > Ah. I did not know that. I figured if our SLURM was built linked to a > specific PMIx v2 that I should build Open MPI with the same PMIx. I'll > build an Open MPI 4 without specifying this. > > > >> > >> > >> > --with-libevent=/usr > >> > >> Same as previous; change "pmix" to "libevent" (i.e., use the external > libevent instead of the bundled libevent). > >> > >> > CC=icc CXX=icpc FC=ifort > >> > >> Specify the exact compilers to use. > >> > >> > The MPI 1 is because I need to build HDF5 eventually and I added psm2 > because it's an Omnipath cluster. The libevent was prob
Re: [OMPI users] Segmentation fault using openmpi-master-201901030305-ee26ed9
Hi Sigmar, I observed this problem yesterday myself and should have a fix in to master later today. Howard Am Fr., 4. Jan. 2019 um 05:30 Uhr schrieb Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de>: > Hi, > > I've installed (tried to install) openmpi-master-201901030305-ee26ed9 on > my "SUSE Linux Enterprise Server 12.3 (x86_64)" with gcc-7.3.0, > icc-19.0.1.144 > pgcc-18.4-0, and Sun C 5.15 (Oracle Developer Studio 12.6). Unfortunately, > I > still cannot build it with Sun C and I get a segmentation fault for one of > my small programs for the other compilers. > > I get the following error for Sun C that I reported some time ago. > https://www.mail-archive.com/users@lists.open-mpi.org/msg32816.html > > > The program runs as expected if I only use my local machine "loki" and it > breaks if I add a remote machine (even if I only use the remote machine > without "loki"). > > loki hello_1 114 ompi_info | grep -e "Open MPI repo revision" -e"Configure > command line" >Open MPI repo revision: v2.x-dev-6601-gee26ed9 >Configure command line: '--prefix=/usr/local/openmpi-master_64_gcc' > '--libdir=/usr/local/openmpi-master_64_gcc/lib64' > '--with-jdk-bindir=/usr/local/jdk-11/bin' > '--with-jdk-headers=/usr/local/jdk-11/include' > 'JAVA_HOME=/usr/local/jdk-11' > 'LDFLAGS=-m64 -L/usr/local/cuda/lib64' 'CC=gcc' 'CXX=g++' 'FC=gfortran' > 'CFLAGS=-m64 -I/usr/local/cuda/include' 'CXXFLAGS=-m64 > -I/usr/local/cuda/include' 'FCFLAGS=-m64' 'CPP=cpp > -I/usr/local/cuda/include' > 'CXXCPP=cpp -I/usr/local/cuda/include' '--enable-mpi-cxx' > '--enable-cxx-exceptions' '--enable-mpi-java' > '--with-cuda=/usr/local/cuda' > '--with-valgrind=/usr/local/valgrind' '--with-hwloc=internal' > '--without-verbs' > '--with-wrapper-cflags=-std=c11 -m64' '--with-wrapper-cxxflags=-m64' > '--with-wrapper-fcflags=-m64' '--enable-debug' > > > loki hello_1 115 mpiexec -np 4 --host loki:2,nfs2:2 hello_1_mpi > Process 0 of 4 running on loki > Process 1 of 4 running on loki > Process 2 of 4 running on nfs2 > Process 3 of 4 running on nfs2 > > Now 3 slave tasks are sending greetings. > > Greetings from task 1: >message type:3 >msg length: 132 characters > ... (complete output of my program) > > [nfs2:01336] *** Process received signal *** > [nfs2:01336] Signal: Segmentation fault (11) > [nfs2:01336] Signal code: Address not mapped (1) > [nfs2:01336] Failing at address: 0x7feea4849268 > [nfs2:01336] [ 0] /lib64/libpthread.so.0(+0x10c10)[0x7feeacbbec10] > [nfs2:01336] [ 1] > > /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(+0x7cd34)[0x7feeadd94d34] > [nfs2:01336] [ 2] > > /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(+0x78673)[0x7feeadd90673] > [nfs2:01336] [ 3] > > /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(+0x7ac2c)[0x7feeadd92c2c] > [nfs2:01336] [ 4] > > /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(opal_finalize_cleanup_domain+0x3e)[0x7feeadd56507] > [nfs2:01336] [ 5] > > /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(opal_finalize_util+0x56)[0x7feeadd56667] > [nfs2:01336] [ 6] > > /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(opal_finalize+0xd3)[0x7feeadd567de] > [nfs2:01336] [ 7] > > /usr/local/openmpi-master_64_gcc/lib64/libopen-rte.so.0(orte_finalize+0x1ba)[0x7feeae09d7ea] > [nfs2:01336] [ 8] > > /usr/local/openmpi-master_64_gcc/lib64/libopen-rte.so.0(orte_daemon+0x3ddd)[0x7feeae0cf55d] > [nfs2:01336] [ 9] orted[0x40086d] > [nfs2:01336] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7feeac829725] > [nfs2:01336] [11] orted[0x400739] > [nfs2:01336] *** End of error message *** > Segmentation fault (core dumped) > loki hello_1 116 > > > I would be grateful, if somebody can fix the problem. Do you need anything > else? Thank you very much for any help in advance. > > > Kind regards > > Siegmar > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Unable to build Open MPI with external PMIx library support
HI Eduardo, The config.log looked nominal.Could you try the following additional options to the build with the internal PMIx builds: --enable-orterun-prefix-by-default --disable-dlopen ? Also, for the mpirun built using the internal PMIx, could you check the output of ldd? And just in case, check if the PMIX_INSTALL_PREFIX is somehow being set? Howard Am Mo., 17. Dez. 2018 um 03:29 Uhr schrieb Eduardo Rothe < eduardo.ro...@yahoo.co.uk>: > Hi Howard, > > Thank you for you reply. I have just re-executed the whole process and > here is the config.log (in attachment to this message)! > > Just for restating, when I use internal PMIx I get the following error > while running mpirun (using Open MPI 4.0.0): > > -- > We were unable to find any usable plugins for the BFROPS framework. This > PMIx > framework requires at least one plugin in order to operate. This can be > caused > by any of the following: > > * we were unable to build any of the plugins due to some combination > of configure directives and available system support > > * no plugin was selected due to some combination of MCA parameter > directives versus built plugins (i.e., you excluded all the plugins > that were built and/or could execute) > > * the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter > "mca_base_component_path", is set and doesn't point to any location > that includes at least one usable plugin for this framework. > > Please check your installation and environment. > ------ > > Regards, > Eduardo > > > On Saturday, 15 December 2018, 18:35:44 CET, Howard Pritchard < > hpprit...@gmail.com> wrote: > > > Hi Eduardo > > Could you post the config.log for the build with internal PMIx so we can > figure that out first. > > Howard > > Eduardo Rothe via users schrieb am Fr. 14. > Dez. 2018 um 09:41: > > Open MPI: 4.0.0 > PMIx: 3.0.2 > OS: Debian 9 > > I'm building a debian package for Open MPI and either I get the following > error messages while configuring: > > undefined reference to symbol 'dlopen@@GLIBC_2.2.5' > undefined reference to symbol 'lt_dlopen' > > when using the configure option: > > ./configure --with-pmix=/usr/lib/x86_64-linux-gnu/pmix > > or otherwise, if I use the following configure options: > > ./configure --with-pmix=external > --with-pmix-libdir=/usr/lib/x86_64-linux-gnu/pmix > > I have a successfull compile, but when running mpirun I get the following > message: > > -- > We were unable to find any usable plugins for the BFROPS framework. This > PMIx > framework requires at least one plugin in order to operate. This can be > caused > by any of the following: > > * we were unable to build any of the plugins due to some combination > of configure directives and available system support > > * no plugin was selected due to some combination of MCA parameter > directives versus built plugins (i.e., you excluded all the plugins > that were built and/or could execute) > > * the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter > "mca_base_component_path", is set and doesn't point to any location > that includes at least one usable plugin for this framework. > > Please check your installation and environment. > -- > > What I find most strange is that I get the same error message (unable to > find > any usable plugins for the BFROPS framework) even if I don't configure > external PMIx support! > > Can someone please hint me about what's going on? > > Cheers! > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Unable to build Open MPI with external PMIx library support
Hi Eduardo Could you post the config.log for the build with internal PMIx so we can figure that out first. Howard Eduardo Rothe via users schrieb am Fr. 14. Dez. 2018 um 09:41: > Open MPI: 4.0.0 > PMIx: 3.0.2 > OS: Debian 9 > > I'm building a debian package for Open MPI and either I get the following > error messages while configuring: > > undefined reference to symbol 'dlopen@@GLIBC_2.2.5' > undefined reference to symbol 'lt_dlopen' > > when using the configure option: > > ./configure --with-pmix=/usr/lib/x86_64-linux-gnu/pmix > > or otherwise, if I use the following configure options: > > ./configure --with-pmix=external > --with-pmix-libdir=/usr/lib/x86_64-linux-gnu/pmix > > I have a successfull compile, but when running mpirun I get the following > message: > > -- > We were unable to find any usable plugins for the BFROPS framework. This > PMIx > framework requires at least one plugin in order to operate. This can be > caused > by any of the following: > > * we were unable to build any of the plugins due to some combination > of configure directives and available system support > > * no plugin was selected due to some combination of MCA parameter > directives versus built plugins (i.e., you excluded all the plugins > that were built and/or could execute) > > * the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter > "mca_base_component_path", is set and doesn't point to any location > that includes at least one usable plugin for this framework. > > Please check your installation and environment. > -- > > What I find most strange is that I get the same error message (unable to > find > any usable plugins for the BFROPS framework) even if I don't configure > external PMIx support! > > Can someone please hint me about what's going on? > > Cheers! > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] [Open MPI Announce] Open MPI 4.0.0 Released
Hi Bert, If you'd prefer to return to the land of convenience and don't need to mix MPI and OpenSHMEM, then you may want to try the path I outlined in the email archived at the following link https://www.mail-archive.com/users@lists.open-mpi.org/msg32274.html Howard Am Di., 13. Nov. 2018 um 23:10 Uhr schrieb Bert Wesarg via users < users@lists.open-mpi.org>: > Dear Takahiro, > On Wed, Nov 14, 2018 at 5:38 AM Kawashima, Takahiro > wrote: > > > > XPMEM moved to GitLab. > > > > https://gitlab.com/hjelmn/xpmem > > the first words from the README aren't very pleasant to read: > > This is an experimental version of XPMEM based on a version provided by > Cray and uploaded to https://code.google.com/p/xpmem. This version > supports > any kernel 3.12 and newer. *Keep in mind there may be bugs and this version > may cause kernel panics, code crashes, eat your cat, etc.* > > Installing this on my laptop where I just want developing with SHMEM > it would be a pitty to lose work just because of that. > > Best, > Bert > > > > > Thanks, > > Takahiro Kawashima, > > Fujitsu > > > > > Hello Bert, > > > > > > What OS are you running on your notebook? > > > > > > If you are running Linux, and you have root access to your system, > then > > > you should be able to resolve the Open SHMEM support issue by > installing > > > the XPMEM device driver on your system, and rebuilding UCX so it picks > > > up XPMEM support. > > > > > > The source code is on GitHub: > > > > > > https://github.com/hjelmn/xpmem > > > > > > Some instructions on how to build the xpmem device driver are at > > > > > > https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM > > > > > > You will need to install the kernel source and symbols rpms on your > > > system before building the xpmem device driver. > > > > > > Hope this helps, > > > > > > Howard > > > > > > > > > Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users < > > > users@lists.open-mpi.org>: > > > > > > > Hi, > > > > > > > > On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce > > > > wrote: > > > > > > > > > > The Open MPI Team, representing a consortium of research, > academic, and > > > > > industry partners, is pleased to announce the release of Open MPI > version > > > > > 4.0.0. > > > > > > > > > > v4.0.0 is the start of a new release series for Open MPI. > Starting with > > > > > this release, the OpenIB BTL supports only iWarp and RoCE by > default. > > > > > Starting with this release, UCX is the preferred transport > protocol > > > > > for Infiniband interconnects. The embedded PMIx runtime has been > updated > > > > > to 3.0.2. The embedded Romio has been updated to 3.2.1. This > > > > > release is ABI compatible with the 3.x release streams. There have > been > > > > numerous > > > > > other bug fixes and performance improvements. > > > > > > > > > > Note that starting with Open MPI v4.0.0, prototypes for several > > > > > MPI-1 symbols that were deleted in the MPI-3.0 specification > > > > > (which was published in 2012) are no longer available by default in > > > > > mpi.h. See the README for further details. > > > > > > > > > > Version 4.0.0 can be downloaded from the main Open MPI web site: > > > > > > > > > > https://www.open-mpi.org/software/ompi/v4.0/ > > > > > > > > > > > > > > > 4.0.0 -- September, 2018 > > > > > > > > > > > > > > > - OSHMEM updated to the OpenSHMEM 1.4 API. > > > > > - Do not build OpenSHMEM layer when there are no SPMLs available. > > > > > Currently, this means the OpenSHMEM layer will only build if > > > > > a MXM or UCX library is found. > > > > > > > > so what is the most convenience way to get SHMEM working on a single > > > > shared memory node (aka. notebook)? I just realized that I don't have > > > > a SHMEM since Open MPI 3.0. But building with UCX does not help > > > > either. I tried with UCX 1.4 but Open MPI SHMEM > > > > still does not work: > > > > > > > > $ oshcc -o shmem_hello_world-
Re: [OMPI users] [Open MPI Announce] Open MPI 4.0.0 Released
Hello Bert, What OS are you running on your notebook? If you are running Linux, and you have root access to your system, then you should be able to resolve the Open SHMEM support issue by installing the XPMEM device driver on your system, and rebuilding UCX so it picks up XPMEM support. The source code is on GitHub: https://github.com/hjelmn/xpmem Some instructions on how to build the xpmem device driver are at https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM You will need to install the kernel source and symbols rpms on your system before building the xpmem device driver. Hope this helps, Howard Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users < users@lists.open-mpi.org>: > Hi, > > On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce > wrote: > > > > The Open MPI Team, representing a consortium of research, academic, and > > industry partners, is pleased to announce the release of Open MPI version > > 4.0.0. > > > > v4.0.0 is the start of a new release series for Open MPI. Starting with > > this release, the OpenIB BTL supports only iWarp and RoCE by default. > > Starting with this release, UCX is the preferred transport protocol > > for Infiniband interconnects. The embedded PMIx runtime has been updated > > to 3.0.2. The embedded Romio has been updated to 3.2.1. This > > release is ABI compatible with the 3.x release streams. There have been > numerous > > other bug fixes and performance improvements. > > > > Note that starting with Open MPI v4.0.0, prototypes for several > > MPI-1 symbols that were deleted in the MPI-3.0 specification > > (which was published in 2012) are no longer available by default in > > mpi.h. See the README for further details. > > > > Version 4.0.0 can be downloaded from the main Open MPI web site: > > > > https://www.open-mpi.org/software/ompi/v4.0/ > > > > > > 4.0.0 -- September, 2018 > > > > > > - OSHMEM updated to the OpenSHMEM 1.4 API. > > - Do not build OpenSHMEM layer when there are no SPMLs available. > > Currently, this means the OpenSHMEM layer will only build if > > a MXM or UCX library is found. > > so what is the most convenience way to get SHMEM working on a single > shared memory node (aka. notebook)? I just realized that I don't have > a SHMEM since Open MPI 3.0. But building with UCX does not help > either. I tried with UCX 1.4 but Open MPI SHMEM > still does not work: > > $ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c > $ oshrun -np 2 ./shmem_hello_world-4.0.0 > [1542109710.217344] [tudtug:27715:0] select.c:406 UCX ERROR > no remote registered memory access transport to tudtug:27716: > self/self - Destination is unreachable, tcp/enp0s31f6 - no put short, > tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable, > mm/posix - Destination is unreachable, cma/cma - no put short > [1542109710.217344] [tudtug:27716:0] select.c:406 UCX ERROR > no remote registered memory access transport to tudtug:27715: > self/self - Destination is unreachable, tcp/enp0s31f6 - no put short, > tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable, > mm/posix - Destination is unreachable, cma/cma - no put short > [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266 > Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable > [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305 > Error: add procs FAILED rc=-2 > [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266 > Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable > [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305 > Error: add procs FAILED rc=-2 > -- > It looks like SHMEM_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during SHMEM_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open SHMEM > developer): > > SPML add procs failed > --> Returned "Out of resource" (-2) instead of "Success" (0) > -- > [tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to > initialize - aborting > [tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to > initialize - aborting > -- > SHMEM_ABORT was invo
Re: [OMPI users] [EXTERNAL] Re: OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2
HI Si, Could you add --disable-builtin-atomics to the configure options and see if the hang goes away? Howard 2018-07-02 8:48 GMT-06:00 Jeff Squyres (jsquyres) via users < users@lists.open-mpi.org>: > Simon -- > > You don't currently have another Open MPI installation in your PATH / > LD_LIBRARY_PATH, do you? > > I have seen dependency library loads cause "make check" to get confused, > and instead of loading the libraries from the build tree, actually load > some -- but not all -- of the required OMPI/ORTE/OPAL/etc. libraries from > an installation tree. Hilarity ensues (to include symptoms such as running > forever). > > Can you double check that you have no Open MPI libraries in your > LD_LIBRARY_PATH before running "make check" on the build tree? > > > > > On Jun 30, 2018, at 3:18 PM, Hammond, Simon David via users < > users@lists.open-mpi.org> wrote: > > > > Nathan, > > > > Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2. > > > > S. > > > > -- > > Si Hammond > > Scalable Computer Architectures > > Sandia National Laboratories, NM, USA > > [Sent from remote connection, excuse typos] > > > > > > On 6/16/18, 10:10 PM, "Nathan Hjelm" wrote: > > > >Try the latest nightly tarball for v3.1.x. Should be fixed. > > > >> On Jun 16, 2018, at 5:48 PM, Hammond, Simon David via users < > users@lists.open-mpi.org> wrote: > >> > >> The output from the test in question is: > >> > >> Single thread test. Time: 0 s 10182 us 10 nsec/poppush > >> Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush > >> > >> > >> S. > >> > >> -- > >> Si Hammond > >> Scalable Computer Architectures > >> Sandia National Laboratories, NM, USA > >> [Sent from remote connection, excuse typos] > >> > >> > >> On 6/16/18, 5:45 PM, "Hammond, Simon David" wrote: > >> > >> Hi OpenMPI Team, > >> > >> We have recently updated an install of OpenMPI on POWER9 system > (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. > We seem to have a symptom where code than ran before is now locking up and > making no progress, getting stuck in wait-all operations. While I think > it's prudent for us to root cause this a little more, I have gone back and > rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears > to hang forever. I am not sure if this is the cause of our issue but wanted > to report that we are seeing this on our system. > >> > >> OpenMPI 3.1.0 Configuration: > >> > >> ./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3. > 1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java > --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10. > 1/linux3.10-glibc2.17-ppc64le/lib --with-verbs > >> > >> GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA > for POWER9 (standard download from their website). We enable IBM's JDK > 8.0.0. > >> RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo) > >> > >> Output: > >> > >> make[3]: Entering directory `/home/sdhammo/openmpi/ > openmpi-3.1.0/test/class' > >> make[4]: Entering directory `/home/sdhammo/openmpi/ > openmpi-3.1.0/test/class' > >> PASS: ompi_rb_tree > >> PASS: opal_bitmap > >> PASS: opal_hash_table > >> PASS: opal_proc_table > >> PASS: opal_tree > >> PASS: opal_list > >> PASS: opal_value_array > >> PASS: opal_pointer_array > >> PASS: opal_lifo > >> > >> > >> Output from Top: > >> > >> 20 0 73280 4224 2560 S 800.0 0.0 17:22.94 lt-opal_fifo > >> > >> -- > >> Si Hammond > >> Scalable Computer Architectures > >> Sandia National Laboratories, NM, USA > >> [Sent from remote connection, excuse typos] > >> > >> > >> > >> > >> ___ > >> users mailing list > >> users@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/users > > > > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] A couple of general questions
Hello Charles You are heading in the right direction. First you might want to run the libfabric fi_info command to see what capabilities you picked up from the libfabric RPMs. Next you may well not actually be using the OFI mtl. Could you run your app with export OMPI_MCA_mtl_base_verbose=100 and post the output? It would also help if you described the system you are using : OS interconnect cpu type etc. Howard Charles A Taylor schrieb am Do. 14. Juni 2018 um 06:36: > Because of the issues we are having with OpenMPI and the openib BTL > (questions previously asked), I’ve been looking into what other transports > are available. I was particularly interested in OFI/libfabric support but > cannot find any information on it more recent than a reference to the usNIC > BTL from 2015 (Jeff Squyres, Cisco). Unfortunately, the openmpi-org > website FAQ’s covering OpenFabrics support don’t mention anything beyond > OpenMPI 1.8. Given that 3.1 is the current stable version, that seems odd. > > That being the case, I thought I’d ask here. After laying down the > libfabric-devel RPM and building (3.1.0) with —with-libfabric=/usr, I end > up with an “ofi” MTL but nothing else. I can run with OMPI_MCA_mtl=ofi > and OMPI_MCA_btl=“self,vader,openib” but it eventually crashes in > libopen-pal.so. (mpi_waitall() higher up the stack). > > GIZMO:9185 terminated with signal 11 at PC=2b4d4b68a91d SP=7ffcfbde9ff0. > Backtrace: > > /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(+0x9391d)[0x2b4d4b68a91d] > > /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(opal_progress+0x24)[0x2b4d4b632754] > > /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(ompi_request_default_wait_all+0x11f)[0x2b4d47be2a6f] > > /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(PMPI_Waitall+0xbd)[0x2b4d47c2ce4d] > > Questions: Am I using the OFI MTL as intended? Should there be an “ofi” > BTL? Does anyone use this? > > Thanks, > > Charlie Taylor > UF Research Computing > > PS - If you could use some help updating the FAQs, I’d be willing to put > in some time. I’d probably learn a lot. > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Problem running with UCX/oshmem on single node?
Hi Craig, You are experiencing problems because you don't have a transport installed that UCX can use for oshmem. You either need to go and buy a connectx4/5 HCA from mellanox (and maybe a switch), and install that on your system, or else install xpmem (https://github.com/hjelmn/xpmem). Note there is a bug right now in UCX that you may hit if you try to go thee xpmem only route: https://github.com/open-mpi/ompi/issues/5083 and https://github.com/openucx/ucx/issues/2588 If you are just running on a single node and want to experiment with the OpenSHMEM program model, and do not have mellanox mlx5 equipment installed on the node, you are much better off trying to use SOS over OFI libfabric: https://github.com/Sandia-OpenSHMEM/SOS https://github.com/ofiwg/libfabric/releases For SOS you will need to install the hydra launcher as well: http://www.mpich.org/downloads/ I really wish google would do a better job at hitting my responses about this type of problem. I seem to respond every couple of months to this exact problem on this mail list. Howard 2018-05-09 13:11 GMT-06:00 Craig Reese <cfre...@super.org>: > > I'm trying to play with oshmem on a single node (just to have a way to do > some simple > experimentation and playing around) and having spectacular problems: > > CentOS 6.9 (gcc 4.4.7) > built and installed ucx 1.3.0 > built and installed openmpi-3.1.0 > > [cfreese]$ cat oshmem.c > > #include > int > main() { > shmem_init(); > } > > [cfreese]$ mpicc oshmem.c -loshmem > > [cfreese]$ shmemrun -np 2 ./a.out > > [ucs1l:30118] mca: base: components_register: registering framework spml > components > [ucs1l:30118] mca: base: components_register: found loaded component ucx > [ucs1l:30119] mca: base: components_register: registering framework spml > components > [ucs1l:30119] mca: base: components_register: found loaded component ucx > [ucs1l:30119] mca: base: components_register: component ucx register > function successful > [ucs1l:30118] mca: base: components_register: component ucx register > function successful > [ucs1l:30119] mca: base: components_open: opening spml components > [ucs1l:30119] mca: base: components_open: found loaded component ucx > [ucs1l:30118] mca: base: components_open: opening spml components > [ucs1l:30118] mca: base: components_open: found loaded component ucx > [ucs1l:30119] mca: base: components_open: component ucx open function > successful > [ucs1l:30118] mca: base: components_open: component ucx open function > successful > [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 - > mca_spml_base_select() select: initializing spml component ucx > [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173 > - mca_spml_ucx_component_init() in ucx, my priority is 21 > [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 - > mca_spml_base_select() select: initializing spml component ucx > [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173 > - mca_spml_ucx_component_init() in ucx, my priority is 21 > [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184 > - mca_spml_ucx_component_init() *** ucx initialized > [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 - > mca_spml_base_select() select: init returned priority 21 > [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 - > mca_spml_base_select() selected ucx best priority 21 > [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 - > mca_spml_base_select() select: component ucx selected > [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 - > mca_spml_ucx_enable() *** ucx ENABLED > [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184 > - mca_spml_ucx_component_init() *** ucx initialized > [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 - > mca_spml_base_select() select: init returned priority 21 > [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 - > mca_spml_base_select() selected ucx best priority 21 > [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 - > mca_spml_base_select() select: component ucx selected > [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 - > mca_spml_ucx_enable() *** ucx ENABLED > > here's where I think the real issue is > > [1525891910.424102] [ucs1l:30119:0] select.c:316 UCX ERROR no > remote registered memory access transport to : mm/posix - > Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0 > - no put short, self/self - Destination is unreachable > [1525891910.424104] [ucs1l:30118:0] select.c:316 UCX ERROR no > remote registered memory ac
Re: [OMPI users] Debug build of v3.0.1 tarball
HI Adam, I think you'll have better luck setting the CFLAGS on the configure line. try ./configure CFLAGS="-g -O0" your other configury options. Howard 2018-05-04 12:09 GMT-06:00 Moody, Adam T. <mood...@llnl.gov>: > Hi Howard, > > I do have a make clean after the configure. To be extra safe, I’m now > also deleting the source directory and untarring for each build to make > sure I have a clean starting point. > > > > I do get a successful build if I add --enable-debug to configure and then > do a simple make that has no CFLAGS or LDFLAGS: > > > > make -j VERBOSE=1 > > > > So that’s good. However, looking at the compile lines that were used, I > see a -g but no -O0. I’m trying to force the -g -O0, because our debuggers > show the best info at that optimization level. > > > > If I then also add a CFLAGS=”-g -O0” to my make command, I see the “-g > -O0” in the compile lines, but then the pthread link error shows up: > > > > make -j CFLAGS=”-g -O0” VERBOSE=1 > > > > CC opal_wrapper.o > > GENERATE opal_wrapper.1 > > CCLD opal_wrapper > > ../../../opal/.libs/libopen-pal.so: undefined reference to > `pthread_atfork' > > collect2: error: ld returned 1 exit status > > make[2]: *** [opal_wrapper] Error 1 > > > > Also setting LDFLAGS fixes that up. Just wondering whether I’m going > about it the right way in trying to get -g -O0 in the build. > > > > Thanks for your help, > > -Adam > > > > *From: *users <users-boun...@lists.open-mpi.org> on behalf of Howard > Pritchard <hpprit...@gmail.com> > *Reply-To: *Open MPI Users <users@lists.open-mpi.org> > *Date: *Friday, May 4, 2018 at 7:46 AM > *To: *Open MPI Users <users@lists.open-mpi.org> > *Subject: *Re: [OMPI users] Debug build of v3.0.1 tarball > > > > HI Adam, > > > > Sorry didn't notice you did try the --enable-debug flag. That should not > have > > led to the link error building the opal dso. Did you do a make clean after > > rerunning configure? > > > > Howard > > > > > > 2018-05-04 8:22 GMT-06:00 Howard Pritchard <hpprit...@gmail.com>: > > Hi Adam, > > > > Did you try using the --enable-debug configure option along with your > CFLAGS options? > > You may want to see if that simplifies your build. > > > > In any case, we'll fix the problems you found. > > > > Howard > > > > > > 2018-05-03 15:00 GMT-06:00 Moody, Adam T. <mood...@llnl.gov>: > > Hello Open MPI team, > > I'm looking for the recommended way to produce a debug build of Open MPI > v3.0.1 that compiles with “-g -O0” so that I get accurate debug info under > a debugger. > > So far, I've gone through the following sequence. I started with > CFLAGS="-g -O0" on make: > > shell$ ./configure --prefix=$installdir --disable-silent-rules \ > > --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi > > shell$ make -j CFLAGS="-g -O0" VERBOSE=1 > > That led to the following error: > > In file included from ../../../../opal/util/arch.h:26:0, > > from btl_openib.h:43, > > from btl_openib_component.c:79: > > btl_openib_component.c: In function 'progress_pending_frags_wqe': > > btl_openib_component.c:3351:29: error: 'opal_list_item_t' has no member named > 'opal_list_item_refcount' > > assert(0 == frag->opal_list_item_refcount); > > ^ > > make[2]: *** [btl_openib_component.lo] Error 1 > > make[2]: *** Waiting for unfinished jobs > > make[2]: Leaving directory `.../openmpi-3.0.1/opal/mca/btl/openib' > > So it seems the assert is referring to a field structure that is protected > by a debug flag. I then added --enable-debug to configure, which led to: > > make[2]: Entering directory `.../openmpi-3.0.1/opal/tools/wrappers' > > CC opal_wrapper.o > > GENERATE opal_wrapper.1 > > CCLD opal_wrapper > > ../../../opal/.libs/libopen-pal.so: undefined reference to `pthread_atfork' > > collect2: error: ld returned 1 exit status > > make[2]: *** [opal_wrapper] Error 1 > > make[2]: Leaving directory `.../openmpi-3.0.1/opal/tools/wrappers' > > Finally, if I also add LDFLAGS="-lpthread" to make, I get a build: > > shell$ ./configure --prefix=$installdir --enable-debug --disable-silent-rules > \ > > --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi > > shell$ make -j CFLAGS="-g -O0" LDFLAGS="-lpthread" VERBOSE=1 > > Am I doing this correct
Re: [OMPI users] Debug build of v3.0.1 tarball
HI Adam, Sorry didn't notice you did try the --enable-debug flag. That should not have led to the link error building the opal dso. Did you do a make clean after rerunning configure? Howard 2018-05-04 8:22 GMT-06:00 Howard Pritchard <hpprit...@gmail.com>: > Hi Adam, > > Did you try using the --enable-debug configure option along with your > CFLAGS options? > You may want to see if that simplifies your build. > > In any case, we'll fix the problems you found. > > Howard > > > 2018-05-03 15:00 GMT-06:00 Moody, Adam T. <mood...@llnl.gov>: > >> Hello Open MPI team, >> >> I'm looking for the recommended way to produce a debug build of Open MPI >> v3.0.1 that compiles with “-g -O0” so that I get accurate debug info under >> a debugger. >> >> So far, I've gone through the following sequence. I started with >> CFLAGS="-g -O0" on make: >> >> shell$ ./configure --prefix=$installdir --disable-silent-rules \ >> >> --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi >> >> shell$ make -j CFLAGS="-g -O0" VERBOSE=1 >> >> That led to the following error: >> >> In file included from ../../../../opal/util/arch.h:26:0, >> >> from btl_openib.h:43, >> >> from btl_openib_component.c:79: >> >> btl_openib_component.c: In function 'progress_pending_frags_wqe': >> >> btl_openib_component.c:3351:29: error: 'opal_list_item_t' has no member >> named 'opal_list_item_refcount' >> >> assert(0 == frag->opal_list_item_refcount); >> >> ^ >> >> make[2]: *** [btl_openib_component.lo] Error 1 >> >> make[2]: *** Waiting for unfinished jobs >> >> make[2]: Leaving directory `.../openmpi-3.0.1/opal/mca/btl/openib' >> >> So it seems the assert is referring to a field structure that is >> protected by a debug flag. I then added --enable-debug to configure, which >> led to: >> >> make[2]: Entering directory `.../openmpi-3.0.1/opal/tools/wrappers' >> >> CC opal_wrapper.o >> >> GENERATE opal_wrapper.1 >> >> CCLD opal_wrapper >> >> ../../../opal/.libs/libopen-pal.so: undefined reference to `pthread_atfork' >> >> collect2: error: ld returned 1 exit status >> >> make[2]: *** [opal_wrapper] Error 1 >> >> make[2]: Leaving directory `.../openmpi-3.0.1/opal/tools/wrappers' >> >> Finally, if I also add LDFLAGS="-lpthread" to make, I get a build: >> >> shell$ ./configure --prefix=$installdir --enable-debug >> --disable-silent-rules \ >> >> --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi >> >> shell$ make -j CFLAGS="-g -O0" LDFLAGS="-lpthread" VERBOSE=1 >> >> Am I doing this correctly? >> >> Is there a pointer to the configure/make flags for this? >> >> I did find this page that describes the developer build from a git clone, >> but that seemed a bit overkill since I am looking for a debug build from >> the distribution tarball instead of the git clone (avoid the autotools >> nightmare): >> >> https://www.open-mpi.org/source/building.php >> >> Thanks. >> >> -Adam >> >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Debug build of v3.0.1 tarball
Hi Adam, Did you try using the --enable-debug configure option along with your CFLAGS options? You may want to see if that simplifies your build. In any case, we'll fix the problems you found. Howard 2018-05-03 15:00 GMT-06:00 Moody, Adam T. <mood...@llnl.gov>: > Hello Open MPI team, > > I'm looking for the recommended way to produce a debug build of Open MPI > v3.0.1 that compiles with “-g -O0” so that I get accurate debug info under > a debugger. > > So far, I've gone through the following sequence. I started with > CFLAGS="-g -O0" on make: > > shell$ ./configure --prefix=$installdir --disable-silent-rules \ > > --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi > > shell$ make -j CFLAGS="-g -O0" VERBOSE=1 > > That led to the following error: > > In file included from ../../../../opal/util/arch.h:26:0, > > from btl_openib.h:43, > > from btl_openib_component.c:79: > > btl_openib_component.c: In function 'progress_pending_frags_wqe': > > btl_openib_component.c:3351:29: error: 'opal_list_item_t' has no member named > 'opal_list_item_refcount' > > assert(0 == frag->opal_list_item_refcount); > > ^ > > make[2]: *** [btl_openib_component.lo] Error 1 > > make[2]: *** Waiting for unfinished jobs > > make[2]: Leaving directory `.../openmpi-3.0.1/opal/mca/btl/openib' > > So it seems the assert is referring to a field structure that is protected > by a debug flag. I then added --enable-debug to configure, which led to: > > make[2]: Entering directory `.../openmpi-3.0.1/opal/tools/wrappers' > > CC opal_wrapper.o > > GENERATE opal_wrapper.1 > > CCLD opal_wrapper > > ../../../opal/.libs/libopen-pal.so: undefined reference to `pthread_atfork' > > collect2: error: ld returned 1 exit status > > make[2]: *** [opal_wrapper] Error 1 > > make[2]: Leaving directory `.../openmpi-3.0.1/opal/tools/wrappers' > > Finally, if I also add LDFLAGS="-lpthread" to make, I get a build: > > shell$ ./configure --prefix=$installdir --enable-debug --disable-silent-rules > \ > > --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi > > shell$ make -j CFLAGS="-g -O0" LDFLAGS="-lpthread" VERBOSE=1 > > Am I doing this correctly? > > Is there a pointer to the configure/make flags for this? > > I did find this page that describes the developer build from a git clone, > but that seemed a bit overkill since I am looking for a debug build from > the distribution tarball instead of the git clone (avoid the autotools > nightmare): > > https://www.open-mpi.org/source/building.php > > Thanks. > > -Adam > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0
Hello Ben, Thanks for the info. You would probably be better off installing UCX on your cluster and rebuilding your Open MPI with the --with-ucx configure option. Here's what I'm seeing with Open MPI 3.0.1 on a ConnectX5 based cluster using ob1/openib BTL: mpirun -map-by ppr:1:node -np 2 ./osu_bibw # OSU MPI Bi-Directional Bandwidth Test v5.1 # Size Bandwidth (MB/s) 1 0.00 2 0.00 4 0.01 8 0.02 16 0.04 32 0.07 64 0.13 128 273.64 256 485.04 512 869.51 1024 1434.99 2048 2208.12 4096 3055.67 8192 3896.93 16384 89.29 32768 252.59 65536 614.42 131072 22878.74 262144 23846.93 524288 24256.23 1048576 24498.27 2097152 24615.64 4194304 24632.58 export OMPI_MCA_pml=ucx # OSU MPI Bi-Directional Bandwidth Test v5.1 # Size Bandwidth (MB/s) 1 4.57 2 8.95 4 17.67 8 35.99 16 71.99 32141.56 64208.86 128 410.32 256 495.56 512 1455.98 1024 2414.78 2048 3008.19 4096 5351.62 8192 5563.66 163845945.16 327686061.33 65536 21376.89 131072 23462.99 262144 24064.56 524288 24366.84 1048576 24550.75 2097152 24649.03 4194304 24693.77 You can get ucx off of GitHub https://github.com/openucx/ucx/releases There is also a pre-release version of UCX (1.3.0RCX?) packaged as an RPM available in MOFED 4.3. See http://www.mellanox.com/page/products_dyn?product_family=26=linux_sw_drivers I was using UCX 1.2.2 for the results above. Good luck, Howard 2018-04-05 1:12 GMT-06:00 Ben Menadue <ben.mena...@nci.org.au>: > Hi, > > Another interesting point. I noticed that the last two message sizes > tested (2MB and 4MB) are lower than expected for both osu_bw and osu_bibw. > Increasing the minimum size to use the RDMA pipeline to above these sizes > brings those two data-points up to scratch for both benchmarks: > > *3.0.0, osu_bw, no rdma for large messages* > > > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by > ppr:1:node -np 2 -H r6,r7 ./osu_bw -m 2097152:4194304 > # OSU MPI Bi-Directional Bandwidth Test v5.4.0 > # Size Bandwidth (MB/s) > 2097152 6133.22 > 4194304 6054.06 > > *3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages* > > > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca > btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw -m > 2097152:4194304 > # OSU MPI Bi-Directional Bandwidth Test v5.4.0 > # Size Bandwidth (MB/s) > 2097152 11397.85 > 4194304 11389.64 > > This makes me think something odd is going on in the RDMA pipeline. > > Cheers, > Ben > > > > On 5 Apr 2018, at 5:03 pm, Ben Menadue <ben.mena...@nci.org.au> wrote: > > Hi, > > We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed > that *osu_bibw* gives nowhere near the bandwidth I’d expect (this is on > FDR IB). However, *osu_bw* is fine. > > If I disable eager RDMA, then *osu_bibw* gives the expected > numbers. Similarly, if I increase the number of eager RDMA buffers, it > gives the expected results. > > OpenMPI 1.10.7 gives consistent, reasonable numbers with default settings, > but they’re not as good as 3.0.0 (when tuned) for large buffers. The same > option changes produce no different in the performance for 1.10.7. > > I was wondering if anyone else has noticed anything similar, and if this > is unexpected, if anyone has a suggestion on how to investigate further? > > Thanks, > Ben > > > Here’s are the numbers: > > *3.0.0, osu_bw, default settings* > > > mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw > # OSU MPI Bandwidth Test v5.4.0 > # Size Bandwidth (MB/s) > 1 1.13 > 2 2.29 > 4 4.63 > 8 9.21 > 16 18.18 > 32 36.46 > 64 69.95 > 128 128.55 > 256 250.74 > 512 451.54 > 1024 829.44 > 2048 1475.87 > 4096
Re: [OMPI users] OpenMPI with Portals4 transport
HI Brian, Thanks for the info. I'm not sure I quite get the response though. Is the race condition in the way Open MPI Portals4 MTL is using portals or is a problem in the portals implementation itself? Howard 2018-02-08 9:20 GMT-07:00 D. Brian Larkins <brianlark...@gmail.com>: > Howard, > > Looks like ob1 is working fine. When I looked into the problems with ob1, > it looked like the progress thread was polling the Portals event queue > before it had been initialized. > > b. > > $ mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib osu_latency > WARNING: Ummunotify not found: Not using ummunotify can result in > incorrect results download and install ummunotify from: > http://support.systemfabricworks.com/downloads/ummunotify/ > ummunotify-v2.tar.bz2 > WARNING: Ummunotify not found: Not using ummunotify can result in > incorrect results download and install ummunotify from: > http://support.systemfabricworks.com/downloads/ummunotify/ > ummunotify-v2.tar.bz2 > # OSU MPI Latency Test > # SizeLatency (us) > 0 1.87 > 1 1.93 > 2 1.90 > 4 1.94 > 8 1.94 > 161.96 > 321.97 > 641.99 > 128 2.43 > 256 2.50 > 512 2.71 > 1024 3.01 > 2048 3.45 > 4096 4.56 > 8192 6.39 > 16384 8.79 > 3276811.50 > 6553616.59 > 131072 27.10 > 262144 46.97 > 524288 87.55 > 1048576 168.89 > 2097152 331.40 > 4194304 654.08 > > > On Feb 7, 2018, at 9:04 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > > HI Brian, > > As a sanity check, can you see if the ob1 pml works okay, i.e. > > mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib ./osu_latency > > Howard > > > 2018-02-07 11:03 GMT-07:00 brian larkins <brianlark...@gmail.com>: > >> Hello, >> >> I’m doing some work with Portals4 and am trying to run some MPI programs >> using the Portals 4 as the transport layer. I’m running into problems and >> am hoping that someone can help me figure out how to get things working. >> I’m using OpenMPI 3.0.0 with the following configuration: >> >> ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky >> --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4 >> --disable-oshmem --disable-vt --disable-java --disable-mpi-io >> --disable-io-romio --disable-libompitrace --disable-btl-portals4-flow-control >> --disable-mtl-portals4-flow-control >> >> I have also tried the head from the git repo and 2.1.2 with the same >> results. A simpler configure line (w —prefix and —with-portals4=) also gets >> same results. >> >> Portals4 configuration is from github master and configured thus: >> >> ./configure —prefix=path/to/portals4 --with-ev=path/to/libev >> --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered >> >> If I specify the cm pml on the command-line, I can get examples/hello_c >> to run correctly. Trying to get some latency numbers using the OSU >> benchmarks is where my trouble begins: >> >> $ mpirun -n 2 --mca mtl portals4 --mca pml cm env >> PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency >> NOTE: Ummunotify and IB registered mem cache disabled, set >> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. >> NOTE: Ummunotify and IB registered mem cache disabled, set >> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. >> # OSU MPI Latency Test >> # SizeLatency (us) >> 025.96 >> [node41:19740] *** An error occurred in MPI_Barrier >> [node41:19740] *** reported by process [139815819542529,4294967297] >> [node41:19740] *** on communicator MPI_COMM_WORLD >> [node41:19740] *** MPI_ERR_OTHER: known error not in list >> [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, >> [node41:19740] ***and potentially your MPI job) >> >> Not specifying CM gets an earlier segfault (defaults to ob1) and looks to >> be a progress thread initialization problem. >> Using PTL_IGNORE_UMMUNOTIFY=1 gets here: >> >> $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency >> # OSU MPI Latency Test >> # SizeLatency (us) >> 0
Re: [OMPI users] Using OpenSHMEM with Shared Memory
HI Ben, I'm afraid this is bad news for using UCX. The problem is that when UCX was configured/built, it did not find a transport for doing one sided put/get transfers. If you're feeling lucky, you may want to install xpmem (https://github.com/hjelmn/xpmem) and rebuild UCX. This requires building a device driver against your kernel source and taking steps to getting the xpmem.ko loaded into the kernel, etc. There's an alternative however which works just fine on a laptop running linux or osx. Check out https://github.com/Sandia-OpenSHMEM/SOS/releases and get the 1.4.0 release. For build/install, follow the directions at https://github.com/Sandia-OpenSHMEM/SOS/wiki/OFI-Build-Instructions Note you will also need to install the MPICH hydra launcher as well. Sandia OpenSHMEM over OFI libfabric uses TCP sockets as the fallback if nothing else is available. I use this version of OpenSHMEM if I'm doing SHMEM stuff on my mac (no vm's). Howard 2018-02-07 12:49 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>: > > Here's what I get with those environment variables: > > https://hastebin.com/ibimipuden.sql > > I'm running Arch Linux (but with OpenMPI/UCX installed from source as > described in my earlier message). > > Ben > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] OpenMPI with Portals4 transport
HI Brian, As a sanity check, can you see if the ob1 pml works okay, i.e. mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib ./osu_latency Howard 2018-02-07 11:03 GMT-07:00 brian larkins <brianlark...@gmail.com>: > Hello, > > I’m doing some work with Portals4 and am trying to run some MPI programs > using the Portals 4 as the transport layer. I’m running into problems and > am hoping that someone can help me figure out how to get things working. > I’m using OpenMPI 3.0.0 with the following configuration: > > ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky > --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4 > --disable-oshmem --disable-vt --disable-java --disable-mpi-io > --disable-io-romio --disable-libompitrace --disable-btl-portals4-flow-control > --disable-mtl-portals4-flow-control > > I have also tried the head from the git repo and 2.1.2 with the same > results. A simpler configure line (w —prefix and —with-portals4=) also gets > same results. > > Portals4 configuration is from github master and configured thus: > > ./configure —prefix=path/to/portals4 --with-ev=path/to/libev > --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered > > If I specify the cm pml on the command-line, I can get examples/hello_c to > run correctly. Trying to get some latency numbers using the OSU benchmarks > is where my trouble begins: > > $ mpirun -n 2 --mca mtl portals4 --mca pml cm env > PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency > NOTE: Ummunotify and IB registered mem cache disabled, set > PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. > NOTE: Ummunotify and IB registered mem cache disabled, set > PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. > # OSU MPI Latency Test > # SizeLatency (us) > 025.96 > [node41:19740] *** An error occurred in MPI_Barrier > [node41:19740] *** reported by process [139815819542529,4294967297] > [node41:19740] *** on communicator MPI_COMM_WORLD > [node41:19740] *** MPI_ERR_OTHER: known error not in list > [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [node41:19740] ***and potentially your MPI job) > > Not specifying CM gets an earlier segfault (defaults to ob1) and looks to > be a progress thread initialization problem. > Using PTL_IGNORE_UMMUNOTIFY=1 gets here: > > $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency > # OSU MPI Latency Test > # SizeLatency (us) > 024.14 > 126.24 > [node41:19993] *** Process received signal *** > [node41:19993] Signal: Segmentation fault (11) > [node41:19993] Signal code: Address not mapped (1) > [node41:19993] Failing at address: 0x141 > [node41:19993] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fa6ac73b710] > [node41:19993] [ 1] /ascldap/users/dblarki/opt/portals4.master/lib/ > libportals.so.4(+0xcd65)[0x7fa69b770d65] > [node41:19993] [ 2] /ascldap/users/dblarki/opt/portals4.master/lib/ > libportals.so.4(PtlPut+0x143)[0x7fa69b773fb3] > [node41:19993] [ 3] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_ > portals4.so(+0xa961)[0x7fa698cf5961] > [node41:19993] [ 4] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_ > portals4.so(+0xb0e5)[0x7fa698cf60e5] > [node41:19993] [ 5] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_ > portals4.so(ompi_mtl_portals4_send+0x90)[0x7fa698cf61d1] > [node41:19993] [ 6] /ascldap/users/dblarki/opt/ > ompi/lib/openmpi/mca_pml_cm.so(+0x5430)[0x7fa69a794430] > [node41:19993] [ 7] /ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(PMPI_ > Send+0x2b4)[0x7fa6ac9ff018] > [node41:19993] [ 8] ./osu_latency[0x40106f] > [node41:19993] [ 9] /lib64/libc.so.6(__libc_start_ > main+0xfd)[0x7fa6ac3b6d5d] > [node41:19993] [10] ./osu_latency[0x400c59] > > This cluster is running RHEL 6.5 without ummunotify modules, but I get the > same results on a local (small) cluster running ubuntu 16.04 with > ummunotify loaded. > > Any help would be much appreciated. > thanks, > > brian. > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Using OpenSHMEM with Shared Memory
HI Ben, Could you set these environment variables and post the output ? export OMPI_MCA_spml=ucx export OMPI_MCA_spml_base_verbose=100 then run your test? Also, what OS are you using? Howard 2018-02-06 20:10 GMT-07:00 Jeff Hammond <jeff.scie...@gmail.com>: > > On Tue, Feb 6, 2018 at 3:58 PM Benjamin Brock <br...@cs.berkeley.edu> > wrote: > >> How can I run an OpenSHMEM program just using shared memory? I'd like to >> use OpenMPI to run SHMEM programs locally on my laptop. >> > > It’s not Open-MPI itself but OSHMPI sits on top of any MPI-3 library and > has a mode to bypass MPI for one-sided if only used within a shared-memory > domain. > > > See https://github.com/jeffhammond/oshmpi and use --enable-smp-optimizations. > While I don’t actively maintain it and it doesn’t support the latest spec, > I’ll fix bugs and implement features on demand if users file GitHub issues. > > Sorry for the shameless self-promotion but I know a few folks who use > OSHMPI specifically because of the SMP feature. > > Sandia OpenSHMEM with OFI definitely works on shared-memory as well. I use > it for all of my Travis CI testing of SHMEM code on both Mac and Linux. > > Jeff > > >> I understand that the old SHMEM component (Yoda?) was taken out, and that >> UCX is now required. I have a build of OpenMPI with UCX as per the >> directions on this random GitHub Page >> <https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX> >> . >> >> When I try to just `shmemrun`, I get a complaint about not haivng any >> splm components available. >> >> [xiii@shini kmer_hash]$ shmemrun -np 2 ./kmer_generic_hash >> >> -- >> No available spml components were found! >> >> This means that there are no components of this type installed on your >> system or all the components reported that they could not be used. >> >> This is a fatal error; your SHMEM process is likely to abort. Check the >> output of the "ompi_info" command and ensure that components of this >> type are available on your system. You may also wish to check the >> value of the "component_path" MCA parameter and ensure that it has at >> least one directory that contains valid MCA components. >> >> -- >> [shini:16341] SPML ikrit cannot be selected >> [shini:16342] SPML ikrit cannot be selected >> [shini:16336] 1 more process has sent help message >> help-oshmem-memheap.txt / find-available:none-found >> [shini:16336] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages >> >> >> I tried fiddling with the MCA command-line settings, but didn't have any >> luck. Is it possible to do this? Can anyone point me to some >> documentation? >> >> Thanks, >> >> Ben >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > -- > Jeff Hammond > jeff.scie...@gmail.com > http://jeffhammond.github.io/ > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] About my GPU performance using Openmpi-2.0.4
Hi Phanikumar It’s unlikely the warning message you are seeing is related to GPU performance. Have you tried adding —with-verbs=no to your config line? That should quash openib complaint. Howard Phanikumar Pentyala <phani12.c...@gmail.com> schrieb am Mo. 11. Dez. 2017 um 22:43: > Dear users and developers, > > Currently I am using two Tesla K40m cards for my computational work on > quantum espresso (QE) suit http://www.quantum-espresso.org/. My GPU > enabled QE code running very slower than normal version. When I am > submitting my job on gpu it was showing some error that "A high-performance > Open MPI point-to-point messaging module was unable to find any relevant > network interfaces: > > Module: OpenFabrics (openib) > Host: qmel > > Another transport will be used instead, although this may result in > lower performance. > > Is this the reason for diminishing GPU performance ?? > > I done installation by > > 1. ./configure --prefix=/home//software/openmpi-2.0.4 > --disable-openib-dynamic-sl --disable-openib-udcm --disable-openib-rdmacm" > because we don't have any Infiband adapter HCA in server. > > 2. make all > > 3. make install > > Please correct me If I done any mistake in my installation or I have to > use Infiband adaptor for using Openmpi?? > > I read lot of posts in openmpi forum to remove above error while > submitting job, I added tag of "--mca btl ^openib" , still no use error > vanished but performance was same. > > Current details of server are: > > Server: FUJITSU PRIMERGY RX2540 M2 > CUDA version: 9.0 > openmpi version: 2.0.4 with intel mkl libraries > QE-gpu version (my application): 5.4.0 > > P.S: Extra information attached > > Thanks in advance > > Regards > Phanikumar > Research scholar > IIT Kharagpur > Kharagpur, westbengal > India > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] [EXTERNAL] Re: Using shmem_int_fadd() in OpenMPI\'s SHMEM
Hi Ben, Actually I did some checking about the brew install for OFi libfabric. It looks like if your brew is up to date, it will pick up libfabric 1.5.2. Howard 2017-11-22 15:21 GMT-07:00 Howard Pritchard <hpprit...@gmail.com>: > HI Ben, > > Even on one box, the yoda component doesn't work any more. > > If you want to do OpenSHMEM programming on you Macbook pro (like I do) > and you don't want to set up a VM to use UCX, then you can use > Sandia OpenSHMEM implementation. > > https://github.com/Sandia-OpenSHMEM/SOS > > You will need to install the MPICH hydra launcher > > http://www.mpich.org/downloads/versions/ > > as the SOS needs that for its oshrun launcher. > > I use hydra-3.2 on my mac with SOS. > > You will also need to install OFI libfabric: > > https://github.com/ofiwg/libfabric > > I'd suggest installing the OFI 1.5.1 tarball. OFI is also available via > brew > but its so old that I doubt it will work with recent versions of SOS. > > If you'd like to use UCX, you'll need to install it and Open MPI on a VM > running a linux distro. > > Howard > > > 2017-11-21 12:47 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>: > >> > What version of Open MPI are you trying to use? >> >> Open MPI 2.1.1-2 as distributed by Arch Linux. >> >> > Also, could you describe something about your system. >> >> This is all in shared memory on a MacBook Pro; no networking involved. >> >> The seg fault with the code example above looks like this: >> >> [xiii@shini kmer_hash]$ g++ minimal.cpp -o minimal `shmemcc >> --showme:link` >> [xiii@shini kmer_hash]$ !shm >> shmemrun -n 2 ./minimal >> [shini:08284] *** Process received signal *** >> [shini:08284] Signal: Segmentation fault (11) >> [shini:08284] Signal code: Address not mapped (1) >> [shini:08284] Failing at address: 0x18 >> [shini:08284] [ 0] /usr/lib/libpthread.so.0(+0x11da0)[0x7f06fb763da0] >> [shini:08284] [ 1] /usr/lib/openmpi/openmpi/mca_s >> pml_yoda.so(mca_spml_yoda_get+0x7da)[0x7f06e0eef0aa] >> [shini:08284] [ 2] /usr/lib/openmpi/openmpi/mca_a >> tomic_basic.so(atomic_basic_lock+0xb2)[0x7f06e08d90d2] >> [shini:08284] [ 3] /usr/lib/openmpi/openmpi/mca_a >> tomic_basic.so(mca_atomic_basic_fadd+0x4a)[0x7f06e08d949a] >> [shini:08284] [ 4] /usr/lib/openmpi/liboshmem.so. >> 20(shmem_int_fadd+0x90)[0x7f06fc5a7660] >> [shini:08284] [ 5] ./minimal(+0x94f)[0x55a5cde7e94f] >> [shini:08284] [ 6] /usr/lib/libc.so.6(__libc_star >> t_main+0xea)[0x7f06fb3baf6a] >> [shini:08284] [ 7] ./minimal(+0x80a)[0x55a5cde7e80a] >> [shini:08284] *** End of error message *** >> >> -- >> shmemrun noticed that process rank 1 with PID 0 on node shini exited on >> signal 11 (Segmentation fault). >> >> -- >> >> Cheers, >> >> Ben >> >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] [EXTERNAL] Re: Using shmem_int_fadd() in OpenMPI\'s SHMEM
HI Ben, Even on one box, the yoda component doesn't work any more. If you want to do OpenSHMEM programming on you Macbook pro (like I do) and you don't want to set up a VM to use UCX, then you can use Sandia OpenSHMEM implementation. https://github.com/Sandia-OpenSHMEM/SOS You will need to install the MPICH hydra launcher http://www.mpich.org/downloads/versions/ as the SOS needs that for its oshrun launcher. I use hydra-3.2 on my mac with SOS. You will also need to install OFI libfabric: https://github.com/ofiwg/libfabric I'd suggest installing the OFI 1.5.1 tarball. OFI is also available via brew but its so old that I doubt it will work with recent versions of SOS. If you'd like to use UCX, you'll need to install it and Open MPI on a VM running a linux distro. Howard 2017-11-21 12:47 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>: > > What version of Open MPI are you trying to use? > > Open MPI 2.1.1-2 as distributed by Arch Linux. > > > Also, could you describe something about your system. > > This is all in shared memory on a MacBook Pro; no networking involved. > > The seg fault with the code example above looks like this: > > [xiii@shini kmer_hash]$ g++ minimal.cpp -o minimal `shmemcc --showme:link` > [xiii@shini kmer_hash]$ !shm > shmemrun -n 2 ./minimal > [shini:08284] *** Process received signal *** > [shini:08284] Signal: Segmentation fault (11) > [shini:08284] Signal code: Address not mapped (1) > [shini:08284] Failing at address: 0x18 > [shini:08284] [ 0] /usr/lib/libpthread.so.0(+0x11da0)[0x7f06fb763da0] > [shini:08284] [ 1] /usr/lib/openmpi/openmpi/mca_s > pml_yoda.so(mca_spml_yoda_get+0x7da)[0x7f06e0eef0aa] > [shini:08284] [ 2] /usr/lib/openmpi/openmpi/mca_a > tomic_basic.so(atomic_basic_lock+0xb2)[0x7f06e08d90d2] > [shini:08284] [ 3] /usr/lib/openmpi/openmpi/mca_a > tomic_basic.so(mca_atomic_basic_fadd+0x4a)[0x7f06e08d949a] > [shini:08284] [ 4] /usr/lib/openmpi/liboshmem.so. > 20(shmem_int_fadd+0x90)[0x7f06fc5a7660] > [shini:08284] [ 5] ./minimal(+0x94f)[0x55a5cde7e94f] > [shini:08284] [ 6] /usr/lib/libc.so.6(__libc_star > t_main+0xea)[0x7f06fb3baf6a] > [shini:08284] [ 7] ./minimal(+0x80a)[0x55a5cde7e80a] > [shini:08284] *** End of error message *** > -- > shmemrun noticed that process rank 1 with PID 0 on node shini exited on > signal 11 (Segmentation fault). > -- > > Cheers, > > Ben > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] [EXTERNAL] Re: Using shmem_int_fadd() in OpenMPI's SHMEM
HI Folks, For the Open MPI 2.1.1 release, the only OSHMEM SPML's that work are the ikrit and ucx. yoda doesn't work. Ikrit only works on systems with Mellanox iinterconnects and requires MXM to be installed. This is recommended for systems with connectx3 or older HCAs. For systems with connectx4 or connectx5 you should be using UCX. You'll need to add --with-ucx + arguments as required to the configure command line when you build Open MPI/OSHMEM to pick up the ucx stuff. A gotcha is that by default, the ucx spml is not selected, so either on the oshrun command line add --mca spml ucx or via env. variable export OMPI_MCA_spml=ucx I verified that a 2.1.1 release + UCX 1.2.0 builds your test (after fixing the unusual include files) and passes on my mellanox connectx5 cluster. Howard 2017-11-21 8:24 GMT-07:00 Hammond, Simon David <sdha...@sandia.gov>: > Hi Howard/OpenMPI Users, > > > > I have had a similar seg-fault this week using OpenMPI 2.1.1 with GCC > 4.9.3 so I tried to compile the example code in the email below. I see > similar behavior to a small benchmark we have in house (but using inc not > finc). > > > > When I run on a single node (both PE’s on the same node) I get the error > below. But, if I run on multiple nodes (say 2 nodes with one PE per node) > then the code runs fine. Same thing for my benchmark which uses > shmem_longlong_inc. For reference, we are using InfiniBand on our cluster > and dual-socket Haswell processors. > > > > Hope that helps, > > > > S. > > > > $ shmemrun -n 2 ./testfinc > > -- > > WARNING: There is at least non-excluded one OpenFabrics device found, > > but there are no active ports detected (or Open MPI was unable to use > > them). This is most certainly not what you wanted. Check your > > cables, subnet manager configuration, etc. The openib BTL will be > > ignored for this job. > > > > Local host: shepard-lsm1 > > -- > > [shepard-lsm1:49505] *** Process received signal *** > > [shepard-lsm1:49505] Signal: Segmentation fault (11) > > [shepard-lsm1:49505] Signal code: Address not mapped (1) > > [shepard-lsm1:49505] Failing at address: 0x18 > > [shepard-lsm1:49505] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7ffc4cd9e710] > > [shepard-lsm1:49505] [ 1] /home/projects/x86-64-haswell/ > openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_spml_yoda.so(mca_ > spml_yoda_get+0x86d)[0x7ffc337cf37d] > > [shepard-lsm1:49505] [ 2] /home/projects/x86-64-haswell/ > openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_atomic_basic.so( > atomic_basic_lock+0x9a)[0x7ffc32f190aa] > > [shepard-lsm1:49505] [ 3] /home/projects/x86-64-haswell/ > openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_atomic_basic.so( > mca_atomic_basic_fadd+0x39)[0x7ffc32f19409] > > [shepard-lsm1:49505] [ 4] /home/projects/x86-64-haswell/ > openmpi/2.1.1/gcc/4.9.3/lib/liboshmem.so.20(shmem_int_ > fadd+0x80)[0x7ffc4d2fc110] > > [shepard-lsm1:49505] [ 5] ./testfinc[0x400888] > > [shepard-lsm1:49505] [ 6] /lib64/libc.so.6(__libc_start_ > main+0xfd)[0x7ffc4ca19d5d] > > [shepard-lsm1:49505] [ 7] ./testfinc[0x400739] > > [shepard-lsm1:49505] *** End of error message *** > > -- > > shmemrun noticed that process rank 1 with PID 0 on node shepard-lsm1 > exited on signal 11 (Segmentation fault). > > -- > > [shepard-lsm1:49499] 1 more process has sent help message > help-mpi-btl-openib.txt / no active ports found > > [shepard-lsm1:49499] Set MCA parameter "orte_base_help_aggregate" to 0 to > see all help / error messages > > > > -- > > Si Hammond > > Scalable Computer Architectures > > Sandia National Laboratories, NM, USA > > > > > > *From: *users <users-boun...@lists.open-mpi.org> on behalf of Howard > Pritchard <hpprit...@gmail.com> > *Reply-To: *Open MPI Users <users@lists.open-mpi.org> > *Date: *Monday, November 20, 2017 at 4:11 PM > *To: *Open MPI Users <users@lists.open-mpi.org> > *Subject: *[EXTERNAL] Re: [OMPI users] Using shmem_int_fadd() in > OpenMPI's SHMEM > > > > HI Ben, > > > > What version of Open MPI are you trying to use? > > > > Also, could you describe something about your system. If its a cluster > > what sort of interconnect is being used. > > > > Howard > > > > > > 2017-11-20 14:13 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>: > > What's the proper way to use
Re: [OMPI users] Using shmem_int_fadd() in OpenMPI's SHMEM
HI Ben, What version of Open MPI are you trying to use? Also, could you describe something about your system. If its a cluster what sort of interconnect is being used. Howard 2017-11-20 14:13 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>: > What's the proper way to use shmem_int_fadd() in OpenMPI's SHMEM? > > A minimal example seems to seg fault: > > #include > #include > > #include > > int main(int argc, char **argv) { > shmem_init(); > const size_t shared_segment_size = 1024; > void *shared_segment = shmem_malloc(shared_segment_size); > > int *arr = (int *) shared_segment; > int *local_arr = (int *) malloc(sizeof(int) * 10); > > if (shmem_my_pe() == 1) { > shmem_int_fadd((int *) shared_segment, 1, 0); > } > shmem_barrier_all(); > > return 0; > } > > > Where am I going wrong here? This sort of thing works in Cray SHMEM. > > Ben Bock > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Problems building OpenMPI 2.1.1 on Intel KNL
Hello Ake, Would you mind opening an issue on Github so we can track this? https://github.com/open-mpi/ompi/issues There's a template to show what info we need to fix this. Thanks very much for reporting this, Howard 2017-11-20 3:26 GMT-07:00 Åke Sandgren <ake.sandg...@hpc2n.umu.se>: > Hi! > > When the xppsl-libmemkind-dev package version 1.5.3 is installed > building OpenMPI fails. > > opal/mca/mpool/memkind uses the macro MEMKIND_NUM_BASE_KIND which has > been moved to memkind/internal/memkind_private.h > > Current master is also using that so I think that will also fail. > > Are there anyone working on this? > > -- > Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden > Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90-580 14 > Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] OMPI 2.1.2 and SLURM compatibility
Hello Bennet, What you are trying to do using srun as the job launcher should work. Could you post the contents of /etc/slurm/slurm.conf for your system? Could you also post the output of the following command: ompi_info --all | grep pmix to the mail list. the config.log from your build would also be useful. Howard 2017-11-16 9:30 GMT-07:00 r...@open-mpi.org <r...@open-mpi.org>: > What Charles said was true but not quite complete. We still support the > older PMI libraries but you likely have to point us to wherever slurm put > them. > > However,we definitely recommend using PMIx as you will get a faster launch > > Sent from my iPad > > > On Nov 16, 2017, at 9:11 AM, Bennet Fauber <ben...@umich.edu> wrote: > > > > Charlie, > > > > Thanks a ton! Yes, we are missing two of the three steps. > > > > Will report back after we get pmix installed and after we rebuild > > Slurm. We do have a new enough version of it, at least, so we might > > have missed the target, but we did at least hit the barn. ;-) > > > > > > > >> On Thu, Nov 16, 2017 at 10:54 AM, Charles A Taylor <chas...@ufl.edu> > wrote: > >> Hi Bennet, > >> > >> Three things... > >> > >> 1. OpenMPI 2.x requires PMIx in lieu of pmi1/pmi2. > >> > >> 2. You will need slurm 16.05 or greater built with —with-pmix > >> > >> 2a. You will need pmix 1.1.5 which you can get from github. > >> (https://github.com/pmix/tarballs). > >> > >> 3. then, to launch your mpi tasks on the allocated resources, > >> > >> srun —mpi=pmix ./hello-mpi > >> > >> I’m replying to the list because, > >> > >> a) this information is harder to find than you might think. > >> b) someone/anyone can correct me if I’’m giving a bum steer. > >> > >> Hope this helps, > >> > >> Charlie Taylor > >> University of Florida > >> > >> On Nov 16, 2017, at 10:34 AM, Bennet Fauber <ben...@umich.edu> wrote: > >> > >> I think that OpenMPI is supposed to support SLURM integration such that > >> > >> srun ./hello-mpi > >> > >> should work? I built OMPI 2.1.2 with > >> > >> export CONFIGURE_FLAGS='--disable-dlopen --enable-shared' > >> export COMPILERS='CC=gcc CXX=g++ FC=gfortran F77=gfortran' > >> > >> CMD="./configure \ > >> --prefix=${PREFIX} \ > >> --mandir=${PREFIX}/share/man \ > >> --with-slurm \ > >> --with-pmi \ > >> --with-lustre \ > >> --with-verbs \ > >> $CONFIGURE_FLAGS \ > >> $COMPILERS > >> > >> I have a simple hello-mpi.c (source included below), which compiles > >> and runs with mpirun, both on the login node and in a job. However, > >> when I try to use srun in place of mpirun, I get instead a hung job, > >> which upon cancellation produces this output. > >> > >> [bn2.stage.arc-ts.umich.edu:116377] PMI_Init [pmix_s1.c:162:s1_init]: > >> PMI is not initialized > >> [bn1.stage.arc-ts.umich.edu:36866] PMI_Init [pmix_s1.c:162:s1_init]: > >> PMI is not initialized > >> [warn] opal_libevent2022_event_active: event has no event_base set. > >> [warn] opal_libevent2022_event_active: event has no event_base set. > >> slurmstepd: error: *** STEP 86.0 ON bn1 CANCELLED AT > 2017-11-16T10:03:24 *** > >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > >> slurmstepd: error: *** JOB 86 ON bn1 CANCELLED AT 2017-11-16T10:03:24 > *** > >> > >> The SLURM web page suggests that OMPI 2.x and later support PMIx, and > >> to use `srun --mpi=pimx`, however that no longer seems to be an > >> option, and using the `openmpi` type isn't working (neither is pmi2). > >> > >> [bennet@beta-build hello]$ srun --mpi=list > >> srun: MPI types are... > >> srun: mpi/pmi2 > >> srun: mpi/lam > >> srun: mpi/openmpi > >> srun: mpi/mpich1_shmem > >> srun: mpi/none > >> srun: mpi/mvapich > >> srun: mpi/mpich1_p4 > >> srun: mpi/mpichgm > >> srun: mpi/mpichmx > >> > >> To get the Intel PMI to work with srun, I have to set > >> > >> I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so > >> > >> Is there a comparable environment variable that must be set to enable > >> `srun` to work? > >> > >> Am I missing a build option or misspecifying one? > >> > >> -- bennet > >>
Re: [OMPI users] [OMPI devel] Open MPI 2.0.4rc2 available for testing
HI Siegmar, Could you check if you also see a similar problem with OMPI master when you build with the Sun compiler? I opened issue 4436 to track this issue. Not sure we'll have time to fix it for 2.0.4 though. Howard 2017-11-02 3:49 GMT-06:00 Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de>: > Hi, > > thank you very much for the fix. Unfortunately, I still get an error > with Sun C 5.15. > > > loki openmpi-2.0.4rc2-Linux.x86_64.64_cc 125 tail -30 > log.make.Linux.x86_64.64_cc > CC src/client/pmix_client.lo > "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h", > line 161: warning: parameter in inline asm statement unused: %3 > "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h", > line 207: warning: parameter in inline asm statement unused: %2 > "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h", > line 228: warning: parameter in inline asm statement unused: %2 > "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h", > line 249: warning: parameter in inline asm statement unused: %2 > "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h", > line 270: warning: parameter in inline asm statement unused: %2 > "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c", > line 235: redeclaration must have the same or more restrictive linker > scoping: OPAL_PMIX_PMIX112_PMIx_Get_version > "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c", > line 240: redeclaration must have the same or more restrictive linker > scoping: OPAL_PMIX_PMIX112_PMIx_Init > "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c", > line 408: redeclaration must have the same or more restrictive linker > scoping: OPAL_PMIX_PMIX112_PMIx_Initialized > "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c", > line 416: redeclaration must have the same or more restrictive linker > scoping: OPAL_PMIX_PMIX112_PMIx_Finalize > "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c", > line 488: redeclaration must have the same or more restrictive linker > scoping: OPAL_PMIX_PMIX112_PMIx_Abort > "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c", > line 616: redeclaration must have the same or more restrictive linker > scoping: OPAL_PMIX_PMIX112_PMIx_Put > "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c", > line 703: redeclaration must have the same or more restrictive linker > scoping: OPAL_PMIX_PMIX112_PMIx_Commit > "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c", > line 789: redeclaration must have the same or more restrictive linker > scoping: OPAL_PMIX_PMIX112_PMIx_Resolve_peers > "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c", > line 852: redeclaration must have the same or more restrictive linker > scoping: OPAL_PMIX_PMIX112_PMIx_Resolve_nodes > cc: acomp failed for ../../../../../../openmpi-2.0. > 4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c > Makefile:1242: recipe for target 'src/client/pmix_client.lo' failed > make[4]: *** [src/client/pmix_client.lo] Error 1 > make[4]: Leaving directory '/export2/src/openmpi-2.0.4/op > enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix' > Makefile:1486: recipe for target 'all-recursive' failed > make[3]: *** [all-recursive] Error 1 > make[3]: Leaving directory '/export2/src/openmpi-2.0.4/op > enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix' > Makefile:1935: recipe for target 'all-recursive' failed > make[2]: *** [all-recursive] Error 1 > make[2]: Leaving directory '/export2/src/openmpi-2.0.4/op > enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal/mca/pmix/pmix112' > Makefile:2301: recipe for target 'all-recursive' failed > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory '/export2/src/openmpi-2.0.4/op > enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal' > Makefile:1800: recipe for target 'all-recursive' failed > make: *** [all-recursive] Error 1 > loki openmpi-2.0.4rc2-Linux.x86_64.64_cc 125 > > > > I would be grateful, if somebody can fix these problems as well. > Thank you very much for any help in advance. > > > Kind regards > > Siegmar > > > > On 11/01/17 23:18, Howard Pritchard wrote: > >> HI Folks, >> >> We decided to roll an rc2 to pick up a PMIx fix: >> >&g
Re: [OMPI users] Strange benchmarks at large message sizes
Hello Cooper Could you rerun your test with the following env. variable set export OMPI_MCA_coll=self,basic,libnbc and see if that helps? Also, what type of interconnect are you using - ethernet, IB, ...? Howard 2017-09-19 8:56 GMT-06:00 Cooper Burns <cooper.bu...@convergecfd.com>: > Hello, > > I have been running some simple benchmarks and saw some strange behaviour: > All tests are done on 4 nodes with 24 cores each (total of 96 mpi > processes) > > When I run MPI_Allreduce() I see the run time spike up (about 10x) when I > go from reducing a total of 4096KB to 8192KB for example, when count is > 2^21 (8192 kb of 4 byte ints): > > MPI_Allreduce(send_buf, recv_buf, count, MPI_SUM, MPI_COMM_WORLD) > > is slower than: > > MPI_Allreduce(send_buf, recv_buf, count*/2*, MPI_INT, MPI_SUM, > MPI_COMM_WORLD) > MPI_Allreduce(send_buf* + count/2*, recv_buf *+ count/2*, count*/2*,MPI_INT, > MPI_SUM, MPI_COMM_WORLD) > > Just wondering if anyone knows what the cause of this behaviour is. > > Thanks! > Cooper > > > Cooper Burns > Senior Research Engineer > <https://www.linkedin.com/company/convergent-science-inc> > <https://www.facebook.com/ConvergentScience> > <https://twitter.com/convergecfd> > <https://www.youtube.com/user/convergecfd> > <https://vimeo.com/convergecfd> > (608) 230-1551 > convergecfd.com > <https://convergecfd.com/?utm_source=Email_medium=signature_campaign=CSIEmailSignature> > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] openmpi-2.1.2rc2: warnings from "make" and "make check"
Hi Siegmar, Opened issue 4151 to track this. Thanks, Howard 2017-08-21 7:13 GMT-06:00 Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de>: > Hi, > > I've installed openmpi-2.1.2rc2 on my "SUSE Linux Enterprise Server 12.2 > (x86_64)" with Sun C 5.15 (Oracle Developer Studio 12.6) and gcc-7.1.0. > Perhaps somebody wants to eliminate the following warnings. > > > openmpi-2.1.2rc2-Linux.x86_64.64_gcc/log.make.Linux.x86_64.6 > 4_gcc:openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/utils.c:97:3: > warning: passing argument 3 of 'PMPI_Type_hindexed' discards 'const' > qualifier from pointer target type [-Wdiscarded-qualifiers] > openmpi-2.1.2rc2-Linux.x86_64.64_gcc/log.make.Linux.x86_64.6 > 4_gcc:openmpi-2.1.2rc2/ompi/mpiext/cuda/c/mpiext_cuda_c.h:16:0: warning: > "MPIX_CUDA_AWARE_SUPPORT" redefined > > > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-custom.c", > line 88: warning: initializer will be sign-extended: -1 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-linux.c", > line 2640: warning: initializer will be sign-extended: -1 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-synthetic.c", > line 851: warning: initializer will be sign-extended: -1 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-x86.c", > line 113: warning: initializer will be sign-extended: -1 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-xml.c", > line 1667: warning: initializer will be sign-extended: -1 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c", > line 428: warning: statement not reached > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/ad_threaded_io.c", > line 31: warning: statement not reached > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/utils.c", > line 97: warning: argument #3 is incompatible with prototype: > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 161: > warning: parameter in inline asm statement unused: %3 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 207: > warning: parameter in inline asm statement unused: %2 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 228: > warning: parameter in inline asm statement unused: %2 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 249: > warning: parameter in inline asm statement unused: %2 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 270: > warning: parameter in inline asm statement unused: %2 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix/src/client/pmi1.c", line > 708: warning: null dimension: argvp > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c", > line 266: warning: initializer will be sign-extended: -1 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c", > line 267: warning: initializer will be sign-extended: -1 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/ompi/mpiext/cuda/c/mpiext_cuda_c.h", line 16: > warning: macro redefined: MPIX_CUDA_AWARE_SUPPORT > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/timer.h", line 49: > warning: initializer does not fit or is out of range: 0x8007 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64 > _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix1_client.c", line 408: > warning: enum type mismatch: arg #1 > openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linu
Re: [OMPI users] openmpi-master-201708190239-9d3f451: warnings from "make" and "make check"
Hi Siegmar, I opened issue 4151 to track this. This is relevant to a project to get open mpi to build with -Werror. Thanks very much, Howard 2017-08-21 7:27 GMT-06:00 Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de>: > Hi, > > I've installed openmpi-master-201708190239-9d3f451 on my "SUSE Linux > Enterprise > Server 12.2 (x86_64)" with Sun C 5.15 (Oracle Developer Studio 12.6) and > gcc-7.1.0. Perhaps somebody wants to eliminate the following warnings. > > > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log. > make.Linux.x86_64.64_gcc:../../../../../../../../../openmpi- > master-201708190239-9d3f451/opal/mca/pmix/pmix2x/pmix/src/ > mca/bfrops/base/bfrop_base_copy.c:414:22: warning: statement will never > be executed [-Wswitch-unreachable] > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log. > make.Linux.x86_64.64_gcc:../../../../../openmpi-master-20170 > 8190239-9d3f451/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:136:34: > warning: passing argument 1 of '__xpg_basename' discards 'const' qualifier > from pointer target type [-Wdiscarded-qualifiers] > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log. > make.Linux.x86_64.64_gcc:../../../../../openmpi-master-20170 > 8190239-9d3f451/ompi/mpiext/cuda/c/mpiext_cuda_c.h:16:0: warning: > "MPIX_CUDA_AWARE_SUPPORT" redefined > > > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log. > make-check.Linux.x86_64.64_gcc:../../../openmpi-master- > 201708190239-9d3f451/test/class/opal_fifo.c:109:26: warning: assignment > discards 'volatile' qualifier from pointer target type > [-Wdiscarded-qualifiers] > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log. > make-check.Linux.x86_64.64_gcc:../../../openmpi-master- > 201708190239-9d3f451/test/class/opal_lifo.c:72:26: warning: assignment > discards 'volatile' qualifier from pointer target type > [-Wdiscarded-qualifiers] > > > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451 > /opal/mca/pmix/pmix2x/pmix/src/mca/base/pmix_mca_base_component_repository.c", > line 266: warning: statement not reached > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451 > /opal/mca/pmix/pmix2x/pmix/src/mca/bfrops/base/bfrop_base_copy.c", line > 414: warning: statement not reached > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451 > /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-linux.c", line 2797: > warning: initializer will be sign-extended: -1 > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451 > /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-synthetic.c", line 946: > warning: initializer will be sign-extended: -1 > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451 > /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-x86.c", line 238: warning: > initializer will be sign-extended: -1 > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451 > /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-xml.c", line 2404: warning: > initializer will be sign-extended: -1 > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451 > /opal/mca/pmix/pmix2x/pmix/src/client/pmi1.c", line 711: warning: null > dimension: argvp > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451 > /ompi/mca/io/romio314/romio/adio/common/ad_fstype.c", line 428: warning: > statement not reached > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451 > /ompi/mca/io/romio314/romio/adio/common/ad_threaded_io.c", line 31: > warning: statement not reached > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451 > /ompi/mca/coll/monitoring/coll_monitoring_component.c", line 160: > warning: improper pointer/integer combination: op "=" > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451 > /ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c", line 136: warning: > argument #1 is incompatible with prototype: > openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log. > make.Linux.x86_64.64_cc:"openmp
Re: [OMPI users] pmix, lxc, hpcx
Hi John, In the 2.1.x release stream a shared memory capability was introduced into the PMIx component. I know nothing about LXC containers, but it looks to me like there's some issue when PMIx tries to create these shared memory segments. I'd check to see if there's something about your container configuration that is preventing the creation of shared memory segments. Howard 2017-05-26 15:18 GMT-06:00 John Marshall <john.marsh...@ssc-spc.gc.ca>: > Hi, > > I have built openmpi 2.1.1 with hpcx-1.8 and tried to run some mpi code > under > ubuntu 14.04 and LXC (1.x) but I get the following: > > [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file > src/dstore/pmix_esh.c at line 1651 > [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file > src/dstore/pmix_esh.c at line 1751 > [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file > src/dstore/pmix_esh.c at line 1114 > [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file > src/common/pmix_jobdata.c at line 93 > [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file > src/common/pmix_jobdata.c at line 333 > [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file > src/server/pmix_server.c at line 606 > > I do not get the same outside of the LXC container and my code runs fine. > > I've looked for more info on these messages but could not find anything > helpful. Are these messages indicative of something missing in, or some > incompatibility with, the container? > > When I build using 2.0.2, I do not have a problem running inside or > outside of > the container. > > Thanks, > John > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes
Forgot you probably need an equal sign after btl arg Howard Pritchard <hpprit...@gmail.com> schrieb am Mi. 22. März 2017 um 18:11: > Hi Goetz > > Thanks for trying these other versions. Looks like a bug. Could you post > the config.log output from your build of the 2.1.0 to the list? > > Also could you try running the job using this extra command line arg to > see if the problem goes away? > > mpirun --mca btl ^vader (rest of your args) > > Howard > > Götz Waschk <goetz.was...@gmail.com> schrieb am Mi. 22. März 2017 um > 13:09: > > On Wed, Mar 22, 2017 at 7:46 PM, Howard Pritchard <hpprit...@gmail.com> > wrote: > > Hi Goetz, > > > > Would you mind testing against the 2.1.0 release or the latest from the > > 1.10.x series (1.10.6)? > > Hi Howard, > > after sending my mail I have tested both 1.10.6 and 2.1.0 and I have > received the same error. I have also tested outside of slurm using > ssh, same problem. > > Here's the message from 2.1.0: > [pax11-10:21920] *** Process received signal *** > [pax11-10:21920] Signal: Bus error (7) > [pax11-10:21920] Signal code: Non-existant physical address (2) > [pax11-10:21920] Failing at address: 0x2b5d5b752290 > [pax11-10:21920] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b5d446e9370] > [pax11-10:21920] [ 1] > > /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5d531645e0] > [pax11-10:21920] [ 2] > > /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b5d44f607c1] > [pax11-10:21920] [ 3] > > /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b5d53162b51] > [pax11-10:21920] [ 4] > > /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b5d5bb0a17f] > [pax11-10:21920] [ 5] > > /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b5d5bafe0aa] > [pax11-10:21920] [ 6] > > /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x399)[0x2b5d44480429] > [pax11-10:21920] [ 7] > > /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b5d86ab] > [pax11-10:21920] [ 8] IMB-MPI1[0x40b2ff] > [pax11-10:21920] [ 9] IMB-MPI1[0x402646] > [pax11-10:21920] [10] > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b5d44917b35] > [pax11-10:21920] [11] IMB-MPI1[0x401f79] > [pax11-10:21920] *** End of error message *** > -- > mpirun noticed that process rank 320 with PID 21920 on node pax11-10 > exited on signal 7 (Bus error). > -- > > > Regards, Götz Waschk > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes
Hi Goetz Thanks for trying these other versions. Looks like a bug. Could you post the config.log output from your build of the 2.1.0 to the list? Also could you try running the job using this extra command line arg to see if the problem goes away? mpirun --mca btl ^vader (rest of your args) Howard Götz Waschk <goetz.was...@gmail.com> schrieb am Mi. 22. März 2017 um 13:09: On Wed, Mar 22, 2017 at 7:46 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > Hi Goetz, > > Would you mind testing against the 2.1.0 release or the latest from the > 1.10.x series (1.10.6)? Hi Howard, after sending my mail I have tested both 1.10.6 and 2.1.0 and I have received the same error. I have also tested outside of slurm using ssh, same problem. Here's the message from 2.1.0: [pax11-10:21920] *** Process received signal *** [pax11-10:21920] Signal: Bus error (7) [pax11-10:21920] Signal code: Non-existant physical address (2) [pax11-10:21920] Failing at address: 0x2b5d5b752290 [pax11-10:21920] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b5d446e9370] [pax11-10:21920] [ 1] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5d531645e0] [pax11-10:21920] [ 2] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b5d44f607c1] [pax11-10:21920] [ 3] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b5d53162b51] [pax11-10:21920] [ 4] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b5d5bb0a17f] [pax11-10:21920] [ 5] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b5d5bafe0aa] [pax11-10:21920] [ 6] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x399)[0x2b5d44480429] [pax11-10:21920] [ 7] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b5d86ab] [pax11-10:21920] [ 8] IMB-MPI1[0x40b2ff] [pax11-10:21920] [ 9] IMB-MPI1[0x402646] [pax11-10:21920] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b5d44917b35] [pax11-10:21920] [11] IMB-MPI1[0x401f79] [pax11-10:21920] *** End of error message *** -- mpirun noticed that process rank 320 with PID 21920 on node pax11-10 exited on signal 7 (Bus error). -- Regards, Götz Waschk ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes
Hi Goetz, Would you mind testing against the 2.1.0 release or the latest from the 1.10.x series (1.10.6)? Thanks, Howard 2017-03-22 6:25 GMT-06:00 Götz Waschk <goetz.was...@gmail.com>: > Hi everyone, > > I'm testing a new machine with 32 nodes of 32 cores each using the IMB > benchmark. It is working fine with 512 processes, but it crashes with > 1024 processes after a running for a minute: > > [pax11-17:16978] *** Process received signal *** > [pax11-17:16978] Signal: Bus error (7) > [pax11-17:16978] Signal code: Non-existant physical address (2) > [pax11-17:16978] Failing at address: 0x2b147b785450 > [pax11-17:16978] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b1473b13370] > [pax11-17:16978] [ 1] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_ > vader.so(mca_btl_vader_frag_init+0x8e)[0x2b14794a413e] > [pax11-17:16978] [ 2] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(ompi_ > free_list_grow+0x199)[0x2b147384f309] > [pax11-17:16978] [ 3] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_ > vader.so(+0x270d)[0x2b14794a270d] > [pax11-17:16978] [ 4] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ > ob1.so(mca_pml_ob1_send_request_start_prepare+0x43)[0x2b1479ae3a13] > [pax11-17:16978] [ 5] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_ > ob1.so(mca_pml_ob1_send+0x89a)[0x2b1479ad90ca] > [pax11-17:16978] [ 6] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_coll_ > tuned.so(ompi_coll_tuned_allreduce_intra_ring+0x3f1)[0x2b147ad6ec41] > [pax11-17:16978] [ 7] > /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(MPI_ > Allreduce+0x17b)[0x2b147387d6bb] > [pax11-17:16978] [ 8] IMB-MPI1[0x40b316] > [pax11-17:16978] [ 9] IMB-MPI1[0x407284] > [pax11-17:16978] [10] IMB-MPI1[0x40250e] > [pax11-17:16978] [11] > /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1473d41b35] > [pax11-17:16978] [12] IMB-MPI1[0x401f79] > [pax11-17:16978] *** End of error message *** > -- > mpirun noticed that process rank 552 with PID 0 on node pax11-17 > exited on signal 7 (Bus error). > -- > > The program is started from the slurm batch system using mpirun. The > same application is working fine when using mvapich2 instead. > > Regards, Götz Waschk > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Shared Windows and MPI_Accumulate
Hello Joseph, I'm still unable to reproduce this system on my SLES12 x86_64 node. Are you building with CFLAGS=-O3? If so, could you build without CFLAGS set and see if you still see the failure? Howard 2017-03-02 2:34 GMT-07:00 Joseph Schuchart <schuch...@hlrs.de>: > Hi Howard, > > Thanks for trying to reproduce this. It seems that on master the issue > occurs less frequently but is still there. I used the following bash > one-liner on my laptop and on our Linux Cluster (single node, 4 processes): > > ``` > $ for i in $(seq 1 100) ; do echo $i && mpirun -n 4 > ./mpi_shared_accumulate | grep \! && break ; done > 1 > 2 > [0] baseptr[0]: 1004 (expected 1010) [!!!] > [0] baseptr[1]: 1005 (expected 1011) [!!!] > [0] baseptr[2]: 1006 (expected 1012) [!!!] > [0] baseptr[3]: 1007 (expected 1013) [!!!] > [0] baseptr[4]: 1008 (expected 1014) [!!!] > ``` > > Sometimes the error occurs after one or two iterations (like above), > sometimes only at iteration 20 or later. However, I can reproduce it within > the 100 runs every time I run the statement above. I am attaching the > config.log and output of ompi_info of master on my laptop. Please let me > know if I can help with anything else. > > Thanks, > Joseph > > On 03/01/2017 11:24 PM, Howard Pritchard wrote: > > Hi Joseph, > > I built this test with craypich (Cray MPI) and it passed. I also tried > with Open MPI master and the test passed. I also tried with 2.0.2 > and can't seem to reproduce on my system. > > Could you post the output of config.log? > > Also, how intermittent is the problem? > > > Thanks, > > Howard > > > > > 2017-03-01 8:03 GMT-07:00 Joseph Schuchart <schuch...@hlrs.de>: > >> Hi all, >> >> We are seeing issues in one of our applications, in which processes in a >> shared communicator allocate a shared MPI window and execute MPI_Accumulate >> simultaneously on it to iteratively update each process' values. The test >> boils down to the sample code attached. Sample output is as follows: >> >> ``` >> $ mpirun -n 4 ./mpi_shared_accumulate >> [1] baseptr[0]: 1010 (expected 1010) >> [1] baseptr[1]: 1011 (expected 1011) >> [1] baseptr[2]: 1012 (expected 1012) >> [1] baseptr[3]: 1013 (expected 1013) >> [1] baseptr[4]: 1014 (expected 1014) >> [2] baseptr[0]: 1005 (expected 1010) [!!!] >> [2] baseptr[1]: 1006 (expected 1011) [!!!] >> [2] baseptr[2]: 1007 (expected 1012) [!!!] >> [2] baseptr[3]: 1008 (expected 1013) [!!!] >> [2] baseptr[4]: 1009 (expected 1014) [!!!] >> [3] baseptr[0]: 1010 (expected 1010) >> [0] baseptr[0]: 1010 (expected 1010) >> [0] baseptr[1]: 1011 (expected 1011) >> [0] baseptr[2]: 1012 (expected 1012) >> [0] baseptr[3]: 1013 (expected 1013) >> [0] baseptr[4]: 1014 (expected 1014) >> [3] baseptr[1]: 1011 (expected 1011) >> [3] baseptr[2]: 1012 (expected 1012) >> [3] baseptr[3]: 1013 (expected 1013) >> [3] baseptr[4]: 1014 (expected 1014) >> ``` >> >> Each process should hold the same values but sometimes (not on all >> executions) random processes diverge (marked through [!!!]). >> >> I made the following observations: >> >> 1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with MPICH >> 3.2. >> 2) The issue occurs only if the window is allocated through >> MPI_Win_allocate_shared, using MPI_Win_allocate works fine. >> 3) The code assumes that MPI_Accumulate atomically updates individual >> elements (please correct me if that is not covered by the MPI standard). >> >> Both OpenMPI and the example code were compiled using GCC 5.4.1 and run >> on a Linux system (single node). OpenMPI was configure with >> --enable-mpi-thread-multiple and --with-threads but the application is not >> multi-threaded. Please let me know if you need any other information. >> >> Cheers >> Joseph >> >> -- >> Dipl.-Inf. Joseph Schuchart >> High Performance Computing Center Stuttgart (HLRS) >> Nobelstr. 19 >> D-70569 Stuttgart >> >> Tel.: +49(0)711-68565890 >> Fax: +49(0)711-6856832 >> E-Mail: schuch...@hlrs.de >> >> >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> > > > > ___ > users mailing > listus...@lists.open-mpi.orghttps://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > -- > Dipl.-Inf. Joseph Schuchart > High Performance Computing Center Stuttgart (HLRS) > Nobelstr. 19 > D-70569 Stuttgart > > Tel.: +49(0)711-68565890 <+49%20711%2068565890> > Fax: +49(0)711-6856832 <+49%20711%206856832> > E-Mail: schuch...@hlrs.de > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] sharedfp/lockedfile collision between multiple program instances
Hi Edgar Please open an issue too so we can track the fix. Howard Edgar Gabriel <egabr...@central.uh.edu> schrieb am Fr. 3. März 2017 um 07:45: > Nicolas, > > thank you for the bug report, I can confirm the behavior. I will work on > a patch and will try to get that into the next release, should hopefully > not be too complicated. > > Thanks > > Edgar > > > On 3/3/2017 7:36 AM, Nicolas Joly wrote: > > Hi, > > > > We just got hit by a problem with sharedfp/lockedfile component under > > v2.0.1 (should be identical with v2.0.2). We had 2 instances of an MPI > > program running conccurrently on the same input file and using > > MPI_File_read_shared() function ... > > > > If the shared file pointer is maintained with the lockedfile > > component, a "XXX.lockedfile" is created near to the data > > file. Unfortunately, this fixed name will collide with multiple tools > > instances ;) > > > > Running 2 instances of the following command line (source code > > attached) on the same machine will show the problematic behaviour. > > > > mpirun -n 1 --mca sharedfp lockedfile ./shrread -v input.dat > > > > Confirmed with lsof(8) output : > > > > njoly@tars [~]> lsof input.dat.lockedfile > > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > > shrread 5876 njoly 21w REG 0,308 13510798885996031 > input.dat.lockedfile > > shrread 5884 njoly 21w REG 0,308 13510798885996031 > input.dat.lockedfile > > > > Thanks in advance. > > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Shared Windows and MPI_Accumulate
Hi Joseph, I built this test with craypich (Cray MPI) and it passed. I also tried with Open MPI master and the test passed. I also tried with 2.0.2 and can't seem to reproduce on my system. Could you post the output of config.log? Also, how intermittent is the problem? Thanks, Howard 2017-03-01 8:03 GMT-07:00 Joseph Schuchart <schuch...@hlrs.de>: > Hi all, > > We are seeing issues in one of our applications, in which processes in a > shared communicator allocate a shared MPI window and execute MPI_Accumulate > simultaneously on it to iteratively update each process' values. The test > boils down to the sample code attached. Sample output is as follows: > > ``` > $ mpirun -n 4 ./mpi_shared_accumulate > [1] baseptr[0]: 1010 (expected 1010) > [1] baseptr[1]: 1011 (expected 1011) > [1] baseptr[2]: 1012 (expected 1012) > [1] baseptr[3]: 1013 (expected 1013) > [1] baseptr[4]: 1014 (expected 1014) > [2] baseptr[0]: 1005 (expected 1010) [!!!] > [2] baseptr[1]: 1006 (expected 1011) [!!!] > [2] baseptr[2]: 1007 (expected 1012) [!!!] > [2] baseptr[3]: 1008 (expected 1013) [!!!] > [2] baseptr[4]: 1009 (expected 1014) [!!!] > [3] baseptr[0]: 1010 (expected 1010) > [0] baseptr[0]: 1010 (expected 1010) > [0] baseptr[1]: 1011 (expected 1011) > [0] baseptr[2]: 1012 (expected 1012) > [0] baseptr[3]: 1013 (expected 1013) > [0] baseptr[4]: 1014 (expected 1014) > [3] baseptr[1]: 1011 (expected 1011) > [3] baseptr[2]: 1012 (expected 1012) > [3] baseptr[3]: 1013 (expected 1013) > [3] baseptr[4]: 1014 (expected 1014) > ``` > > Each process should hold the same values but sometimes (not on all > executions) random processes diverge (marked through [!!!]). > > I made the following observations: > > 1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with MPICH > 3.2. > 2) The issue occurs only if the window is allocated through > MPI_Win_allocate_shared, using MPI_Win_allocate works fine. > 3) The code assumes that MPI_Accumulate atomically updates individual > elements (please correct me if that is not covered by the MPI standard). > > Both OpenMPI and the example code were compiled using GCC 5.4.1 and run on > a Linux system (single node). OpenMPI was configure with > --enable-mpi-thread-multiple and --with-threads but the application is not > multi-threaded. Please let me know if you need any other information. > > Cheers > Joseph > > -- > Dipl.-Inf. Joseph Schuchart > High Performance Computing Center Stuttgart (HLRS) > Nobelstr. 19 > D-70569 Stuttgart > > Tel.: +49(0)711-68565890 > Fax: +49(0)711-6856832 > E-Mail: schuch...@hlrs.de > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Issues with different IB adapters and openmpi 2.0.2
Hi Orion Does the problem occur if you only use font2 and 3? Do you have MXM installed on the font1 node? The 2.x series is using PMIX and it could be that is impacting the PML sanity check. Howard Orion Poplawski <or...@cora.nwra.com> schrieb am Mo. 27. Feb. 2017 um 14:50: > We have a couple nodes with different IB adapters in them: > > font1/var/log/lspci:03:00.0 InfiniBand [0c06]: Mellanox Technologies > MT25204 > [InfiniHost III Lx HCA] [15b3:6274] (rev 20) > font2/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 > InfiniBand > HCA [1077:7220] (rev 02) > font3/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 > InfiniBand > HCA [1077:7220] (rev 02) > > With 1.10.3 we saw the following errors with mpirun: > > [font2.cora.nwra.com:13982] [[23220,1],10] selected pml cm, but peer > [[23220,1],0] on font1 selected pml ob1 > > which crashed MPI_Init. > > We worked around this by passing "--mca pml ob1". I notice now with > openmpi > 2.0.2 without that option I no longer see errors, but the mpi program will > hang shortly after startup. Re-adding the option makes it work, so I'm > assuming the underlying problem is still the same, but openmpi appears to > have > stopped alerting me to the issue. > > Thoughts? > > -- > Orion Poplawski > Technical Manager 720-772-5637 > NWRA, Boulder/CoRA Office FAX: 303-415-9702 > 3380 Mitchell Lane or...@nwra.com > Boulder, CO 80301 http://www.nwra.com > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] MPI_THREAD_MULTIPLE: Fatal error on MPI_Win_create
Hi Joseph What OS are you using when running the test? Could you try running with export OMPI_mca_osc=^pt2pt and export OMPI_mca_osc_base_verbose=10 This error message was put in to this OMPI release because this part of the code has known problems when used multi threaded. Joseph Schuchartschrieb am Sa. 18. Feb. 2017 um 04:02: > All, > > I am seeing a fatal error with OpenMPI 2.0.2 if requesting support for > MPI_THREAD_MULTIPLE and afterwards creating a window using > MPI_Win_create. I am attaching a small reproducer. The output I get is > the following: > > ``` > MPI_THREAD_MULTIPLE supported: yes > MPI_THREAD_MULTIPLE supported: yes > MPI_THREAD_MULTIPLE supported: yes > MPI_THREAD_MULTIPLE supported: yes > -- > The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this > release. > Workarounds are to run on a single node, or to use a system with an RDMA > capable network such as Infiniband. > -- > [beryl:10705] *** An error occurred in MPI_Win_create > [beryl:10705] *** reported by process [2149974017,2] > [beryl:10705] *** on communicator MPI_COMM_WORLD > [beryl:10705] *** MPI_ERR_WIN: invalid window > [beryl:10705] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [beryl:10705] ***and potentially your MPI job) > [beryl:10698] 3 more processes have sent help message help-osc-pt2pt.txt > / mpi-thread-multiple-not-supported > [beryl:10698] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > [beryl:10698] 3 more processes have sent help message > help-mpi-errors.txt / mpi_errors_are_fatal > ``` > > I am running on a single node (my laptop). Both OpenMPI and the > application were compiled using GCC 5.3.0. Naturally, there is no > support for Infiniband available. Should I signal OpenMPI that I am > indeed running on a single node? If so, how can I do that? Can't this be > detected by OpenMPI automatically? The test succeeds if I only request > MPI_THREAD_SINGLE. > > OpenMPI 2.0.2 has been configured using only > --enable-mpi-thread-multiple and --prefix configure parameters. I am > attaching the output of ompi_info. > > Please let me know if you need any additional information. > > Cheers, > Joseph > > -- > Dipl.-Inf. Joseph Schuchart > High Performance Computing Center Stuttgart (HLRS) > Nobelstr. 19 > D-70569 Stuttgart > > Tel.: +49(0)711-68565890 > Fax: +49(0)711-6856832 > E-Mail: schuch...@hlrs.de > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Problem with MPI_Comm_spawn using openmpi 2.0.x + sbatch
Hi Anastasia, Definitely check the mpirun when in batch environment but you may also want to upgrade to Open MPI 2.0.2. Howard r...@open-mpi.org <r...@open-mpi.org> schrieb am Mi. 15. Feb. 2017 um 07:49: > Nothing immediate comes to mind - all sbatch does is create an allocation > and then run your script in it. Perhaps your script is using a different > “mpirun” command than when you type it interactively? > > On Feb 14, 2017, at 5:11 AM, Anastasia Kruchinina < > nastja.kruchin...@gmail.com> wrote: > > Hi, > > I am trying to use MPI_Comm_spawn function in my code. I am having trouble > with openmpi 2.0.x + sbatch (batch system Slurm). > My test program is located here: > http://user.it.uu.se/~anakr367/files/MPI_test/ > > When I am running my code I am getting an error: > > OPAL ERROR: Timeout in file > ../../../../openmpi-2.0.1/opal/mca/pmix/base/pmix_base_fns.c at line 193 > *** An error occurred in MPI_Init_thread > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > ***and potentially your MPI job) > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > >ompi_dpm_dyn_init() failed >--> Returned "Timeout" (-15) instead of "Success" (0) > -- > > The interesting thing is that there is no error when I am firstly > allocating nodes with salloc and then run my program. So, I noticed that > the program works fine using openmpi 1.x+sbach/salloc or openmpi > 2.0.x+salloc but not openmpi 2.0.x+sbatch. > > The error was reproduced on three different computer clusters. > > Best regards, > Anastasia > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12
Hi Michel, Could you try running the app with export TMPDIR=/tmp set in the shell you are using? Howard 2017-02-02 13:46 GMT-07:00 Michel Lesoinne <mlesoi...@cmsoftinc.com>: Howard, First, thanks to you and Jeff for looking into this with me. I tried ../configure --disable-shared --enable-static --prefix ~/.local The result is the same as without --disable-shared. i.e. I get the following error: [Michels-MacBook-Pro.local:92780] [[46617,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../orte/orted/pmix/pmix_server.c at line 262 [Michels-MacBook-Pro.local:92780] [[46617,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line 666 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): pmix server init failed --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS -- On Thu, Feb 2, 2017 at 12:29 PM, Howard Pritchard <hpprit...@gmail.com> wrote: Hi Michel Try adding --enable-static to the configure. That fixed the problem for me. Howard Michel Lesoinne <mlesoi...@cmsoftinc.com> schrieb am Mi. 1. Feb. 2017 um 19:07: I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have been trying to run simple program. I configured openmpi with ../configure --disable-shared --prefix ~/.local make all install Then I have a simple code only containing a call to MPI_Init. I compile it with mpirun -np 2 ./mpitest The output is: [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: unable to open mca_patcher_overwrite: File not found (ignored) [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: unable to open mca_shmem_mmap: File not found (ignored) [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: unable to open mca_shmem_posix: File not found (ignored) [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: unable to open mca_shmem_sysv: File not found (ignored) -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_shmem_base_select failed --> Returned value -1 instead of OPAL_SUCCESS -- Without the --disable-shared in the configuration, then I get: [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../orte/orted/pmix/pmix_server.c at line 264 [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line 666 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): pmix server init failed --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS -- Has anyone seen this? What am I missing? ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Open MPI over RoCE using breakout cable and switch
Hello Brendan, Sorry for the delay in responding. I've been on travel the past two weeks. I traced through the debug output you sent. It provided enough information to show that for some reason, when using the breakout cable, Open MPI is unable to complete initialization it needs to use the openib BTL. It correctly detects that the first port is not available, but for port 1, it still fails to initialize. To debug this further, I'd need to provide you with a custom Open MPI to try that would have more debug output in the suspect area. If you'd like to go this route let me know and I'll build a one of library to try to debug this problem. One thing to do just as a sanity check is to try tcp: mpirun --mca btl tcp,self,sm with the breakout cable. If that doesn't work, then I think there may be some network setup problem that needs to be resolved first before trying custom Open MPI tarballs. Thanks, Howard 2017-02-01 15:08 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>: > Hello Howard, > > I was wondering if you have been able to look at this issue at all, or if > anyone has any ideas on what to try next. > > > > Thank you, > > Brendan > > > > *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Brendan > Myers > *Sent:* Tuesday, January 24, 2017 11:11 AM > > *To:* 'Open MPI Users' <users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and > switch > > > > Hello Howard, > > Here is the error output after building with debug enabled. These CX4 > Mellanox cards view each port as a separate device and I am using port 1 on > the card which is device mlx5_0. > > > > Thank you, > > Brendan > > > > *From:* users [mailto:users-boun...@lists.open-mpi.org > <users-boun...@lists.open-mpi.org>] *On Behalf Of *Howard Pritchard > *Sent:* Tuesday, January 24, 2017 8:21 AM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and > switch > > > > Hello Brendan, > > > > This helps some, but looks like we need more debug output. > > > > Could you build a debug version of Open MPI by adding --enable-debug > > to the config options and rerun the test with the breakout cable setup > > and keeping the --mca btl_base_verbose 100 command line option? > > > > Thanks > > > > Howard > > > > > > 2017-01-23 8:23 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>: > > Hello Howard, > > Thank you for looking into this. Attached is the output you requested. > Also, I am using Open MPI 2.0.1. > > > > Thank you, > > Brendan > > > > *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard > Pritchard > *Sent:* Friday, January 20, 2017 6:35 PM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and > switch > > > > Hi Brendan > > > > I doubt this kind of config has gotten any testing with OMPI. Could you > rerun with > > > > --mca btl_base_verbose 100 > > > > added to the command line and post the output to the list? > > > > Howard > > > > > > Brendan Myers <brendan.my...@soft-forge.com> schrieb am Fr. 20. Jan. 2017 > um 15:04: > > Hello, > > I am attempting to get Open MPI to run over 2 nodes using a switch and a > single breakout cable with this design: > > (100GbE)QSFP ßà 2x (50GbE)QSFP > > > > Hardware Layout: > > Breakout cable module A connects to switch (100GbE) > > Breakout cable module B1 connects to node 1 RoCE NIC (50GbE) > > Breakout cable module B2 connects to node 2 RoCE NIC (50GbE) > > Switch is Mellanox SN 2700 100GbE RoCE switch > > > > · I am able to pass RDMA traffic between the nodes with perftest > (ib_write_bw) when using the breakout cable as the IC from both nodes to > the switch. > > · When attempting to run a job using the breakout cable as the IC > Open MPI aborts with failure to initialize open fabrics device errors. > > · If I replace the breakout cable with 2 standard QSFP cables the > Open MPI job will complete correctly. > > > > > > This is the command I use, it works unless I attempt a run with the > breakout cable used as IC: > > *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues > P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile > mpi-hosts-ce /usr/local/bin/IMB-MPI1* > > > > If anyone has any idea as to why using a breakout cable is causing my jobs > to fail please let me
Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12
Hi Michel Try adding --enable-static to the configure. That fixed the problem for me. Howard Michel Lesoinne <mlesoi...@cmsoftinc.com> schrieb am Mi. 1. Feb. 2017 um 19:07: > I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have > been trying to run simple program. > I configured openmpi with > ../configure --disable-shared --prefix ~/.local > make all install > > Then I have a simple code only containing a call to MPI_Init. > I compile it with > mpirun -np 2 ./mpitest > > The output is: > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_patcher_overwrite: File not found (ignored) > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_shmem_mmap: File not found (ignored) > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_shmem_posix: File not found (ignored) > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_shmem_sysv: File not found (ignored) > > -- > > It looks like opal_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during opal_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > opal_shmem_base_select failed > > --> Returned value -1 instead of OPAL_SUCCESS > > -- > > Without the --disable-shared in the configuration, then I get: > > > [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad > parameter in file ../../orte/orted/pmix/pmix_server.c at line 264 > > [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad > parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line > 666 > > -- > > It looks like orte_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during orte_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > pmix server init failed > > --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS > > -- > > > > > Has anyone seen this? What am I missing? > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12
Hi Michael, I reproduced this problem on my Mac too: pn1249323:~/ompi/examples (v2.0.x *)$ mpirun -np 2 ./ring_c [pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to open mca_patcher_overwrite: File not found (ignored) [pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to open mca_shmem_mmap: File not found (ignored) [pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to open mca_shmem_posix: File not found (ignored) [pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to open mca_shmem_sysv: File not found (ignored) -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_shmem_base_select failed --> Returned value -1 instead of OPAL_SUCCESS Is there a reason why you are using the --disable-shared option? Can you use the --disable-dlopen instead? I'll do some more investigating and open an issue. Howard 2017-02-01 19:05 GMT-07:00 Michel Lesoinne <mlesoi...@cmsoftinc.com>: > I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have > been trying to run simple program. > I configured openmpi with > ../configure --disable-shared --prefix ~/.local > make all install > > Then I have a simple code only containing a call to MPI_Init. > I compile it with > mpirun -np 2 ./mpitest > > The output is: > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_patcher_overwrite: File not found (ignored) > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_shmem_mmap: File not found (ignored) > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_shmem_posix: File not found (ignored) > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_shmem_sysv: File not found (ignored) > > -- > > It looks like opal_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during opal_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > opal_shmem_base_select failed > > --> Returned value -1 instead of OPAL_SUCCESS > > -- > > Without the --disable-shared in the configuration, then I get: > > > [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad > parameter in file ../../orte/orted/pmix/pmix_server.c at line 264 > > [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad > parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at > line 666 > > -- > > It looks like orte_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during orte_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > pmix server init failed > > --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS > > -- > > > > > Has anyone seen this? What am I missing? > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12
Hi Michel It's somewhat unusual to use the disable-shared configure option. That may be causing this. Could you try to build without using this option and see if you still see the problem? Thanks, Howard Michel Lesoinne <mlesoi...@cmsoftinc.com> schrieb am Mi. 1. Feb. 2017 um 21:07: > I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have > been trying to run simple program. > I configured openmpi with > ../configure --disable-shared --prefix ~/.local > make all install > > Then I have a simple code only containing a call to MPI_Init. > I compile it with > mpirun -np 2 ./mpitest > > The output is: > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_patcher_overwrite: File not found (ignored) > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_shmem_mmap: File not found (ignored) > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_shmem_posix: File not found (ignored) > > [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open: > unable to open mca_shmem_sysv: File not found (ignored) > > -- > > It looks like opal_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during opal_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > opal_shmem_base_select failed > > --> Returned value -1 instead of OPAL_SUCCESS > > -- > > Without the --disable-shared in the configuration, then I get: > > > [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad > parameter in file ../../orte/orted/pmix/pmix_server.c at line 264 > > [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad > parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line > 666 > > -- > > It looks like orte_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during orte_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > pmix server init failed > > --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS > > -- > > > > > Has anyone seen this? What am I missing? > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Error using hpcc benchmark
Hi Wodel Randomaccess part of HPCC is probably causing this. Perhaps set PSM env. variable - Export PSM_MQ_REVCREQ_MAX=1000 or something like that. Alternatively launch the job using mpirun --mca plm ob1 --host to avoid use of psm. Performance will probably suffer with this option however. Howard wodel youchi <wodel.you...@gmail.com> schrieb am Di. 31. Jan. 2017 um 08:27: > Hi, > > I am n newbie in HPC world > > I am trying to execute the hpcc benchmark on our cluster, but every time I > start the job, I get this error, then the job exits > > > > > > > > > > > > > > > *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which > usually indicates a user program error or insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576)compute024.22840Exhausted 1048576 MQ irecv > request descriptors, which usually indicates a user program error or > insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576)compute019.22847Exhausted 1048576 MQ irecv > request descriptors, which usually indicates a user program error or > insufficient request descriptors > (PSM_MQ_RECVREQS_MAX=1048576)---Primary > job terminated normally, but 1 process returneda non-zero exit code.. Per > user-direction, the job has been > aborted.-mpirun > detected that one or more processes exited with non-zero status, thus > causingthe job to be terminated. The first process to do so was: Process > name: [[19601,1],272] Exit code: > 255--* > > Platform : IBM PHPC > OS : RHEL 6.5 > one management node > 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD > infiniband 40Gb/s > > I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7) > and Openmpi 1.8.1 (compiled with gcc 4.4.7) > > I get the errors, but each time on different compute nodes. > > This is the command I used to start the job > > *mpirun -np 512 --mca mtl psm --hostfile hosts32 > /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt* > > Any help will be appreciated, and if you need more details, let me know. > Thanks in advance. > > > Regards. > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Open MPI over RoCE using breakout cable and switch
Hello Brendan, This helps some, but looks like we need more debug output. Could you build a debug version of Open MPI by adding --enable-debug to the config options and rerun the test with the breakout cable setup and keeping the --mca btl_base_verbose 100 command line option? Thanks Howard 2017-01-23 8:23 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>: > Hello Howard, > > Thank you for looking into this. Attached is the output you requested. > Also, I am using Open MPI 2.0.1. > > > > Thank you, > > Brendan > > > > *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard > Pritchard > *Sent:* Friday, January 20, 2017 6:35 PM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and > switch > > > > Hi Brendan > > > > I doubt this kind of config has gotten any testing with OMPI. Could you > rerun with > > > > --mca btl_base_verbose 100 > > > > added to the command line and post the output to the list? > > > > Howard > > > > > > Brendan Myers <brendan.my...@soft-forge.com> schrieb am Fr. 20. Jan. 2017 > um 15:04: > > Hello, > > I am attempting to get Open MPI to run over 2 nodes using a switch and a > single breakout cable with this design: > > (100GbE)QSFP ßà 2x (50GbE)QSFP > > > > Hardware Layout: > > Breakout cable module A connects to switch (100GbE) > > Breakout cable module B1 connects to node 1 RoCE NIC (50GbE) > > Breakout cable module B2 connects to node 2 RoCE NIC (50GbE) > > Switch is Mellanox SN 2700 100GbE RoCE switch > > > > · I am able to pass RDMA traffic between the nodes with perftest > (ib_write_bw) when using the breakout cable as the IC from both nodes to > the switch. > > · When attempting to run a job using the breakout cable as the IC > Open MPI aborts with failure to initialize open fabrics device errors. > > · If I replace the breakout cable with 2 standard QSFP cables the > Open MPI job will complete correctly. > > > > > > This is the command I use, it works unless I attempt a run with the > breakout cable used as IC: > > *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues > P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile > mpi-hosts-ce /usr/local/bin/IMB-MPI1* > > > > If anyone has any idea as to why using a breakout cable is causing my jobs > to fail please let me know. > > > > Thank you, > > > > Brendan T. W. Myers > > brendan.my...@soft-forge.com > > Software Forge Inc > > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Open MPI over RoCE using breakout cable and switch
Hi Brendan I doubt this kind of config has gotten any testing with OMPI. Could you rerun with --mca btl_base_verbose 100 added to the command line and post the output to the list? Howard Brendan Myers <brendan.my...@soft-forge.com> schrieb am Fr. 20. Jan. 2017 um 15:04: > Hello, > > I am attempting to get Open MPI to run over 2 nodes using a switch and a > single breakout cable with this design: > > (100GbE)QSFP ßà 2x (50GbE)QSFP > > > > Hardware Layout: > > Breakout cable module A connects to switch (100GbE) > > Breakout cable module B1 connects to node 1 RoCE NIC (50GbE) > > Breakout cable module B2 connects to node 2 RoCE NIC (50GbE) > > Switch is Mellanox SN 2700 100GbE RoCE switch > > > > · I am able to pass RDMA traffic between the nodes with perftest > (ib_write_bw) when using the breakout cable as the IC from both nodes to > the switch. > > · When attempting to run a job using the breakout cable as the IC > Open MPI aborts with failure to initialize open fabrics device errors. > > · If I replace the breakout cable with 2 standard QSFP cables the > Open MPI job will complete correctly. > > > > > > This is the command I use, it works unless I attempt a run with the > breakout cable used as IC: > > *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues > P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm -hostfile > mpi-hosts-ce /usr/local/bin/IMB-MPI1* > > > > If anyone has any idea as to why using a breakout cable is causing my jobs > to fail please let me know. > > > > Thank you, > > > > Brendan T. W. Myers > > brendan.my...@soft-forge.com > > Software Forge Inc > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux
HI Siegmar, You have some config parameters I wasn't trying that may have some impact. I'll give a try with these parameters. This should be enough info for now, Thanks, Howard 2017-01-09 0:59 GMT-07:00 Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de>: > Hi Howard, > > I use the following commands to build and install the package. > ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my > Linux machine. > > mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc > cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc > > ../openmpi-2.0.2rc3/configure \ > --prefix=/usr/local/openmpi-2.0.2_64_cc \ > --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \ > --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \ > --with-jdk-headers=/usr/local/jdk1.8.0_66/include \ > JAVA_HOME=/usr/local/jdk1.8.0_66 \ > LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \ > CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \ > CPP="cpp" CXXCPP="cpp" \ > --enable-mpi-cxx \ > --enable-mpi-cxx-bindings \ > --enable-cxx-exceptions \ > --enable-mpi-java \ > --enable-heterogeneous \ > --enable-mpi-thread-multiple \ > --with-hwloc=internal \ > --without-verbs \ > --with-wrapper-cflags="-m64 -mt" \ > --with-wrapper-cxxflags="-m64" \ > --with-wrapper-fcflags="-m64" \ > --with-wrapper-ldflags="-mt" \ > --enable-debug \ > |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc > > make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc > rm -r /usr/local/openmpi-2.0.2_64_cc.old > mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old > make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc > make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc > > > I get a different error if I run the program with gdb. > > loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec > GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1 > Copyright (C) 2016 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.h > tml> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-suse-linux". > Type "show configuration" for configuration details. > For bug reporting instructions, please see: > <http://bugs.opensuse.org/>. > Find the GDB manual and other documentation resources online at: > <http://www.gnu.org/software/gdb/documentation/>. > For help, type "help". > Type "apropos word" to search for commands related to "word"... > Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done. > (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master > Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host > loki --slot-list 0:0-5,1:0-5 spawn_master > Missing separate debuginfos, use: zypper install > glibc-debuginfo-2.24-2.3.x86_64 > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > [New Thread 0x73b97700 (LWP 13582)] > [New Thread 0x718a4700 (LWP 13583)] > [New Thread 0x710a3700 (LWP 13584)] > [New Thread 0x7fffebbba700 (LWP 13585)] > Detaching after fork from child process 13586. > > Parent process 0 running on loki > I create 4 slave processes > > Detaching after fork from child process 13589. > Detaching after fork from child process 13590. > Detaching after fork from child process 13591. > [loki:13586] OPAL ERROR: Timeout in file ../../../../openmpi-2.0.2rc3/o > pal/mca/pmix/base/pmix_base_fns.c at line 193 > [loki:13586] *** An error occurred in MPI_Comm_spawn > [loki:13586] *** reported by process [2873294849,0] > [loki:13586] *** on communicator MPI_COMM_WORLD > [loki:13586] *** MPI_ERR_UNKNOWN: unknown error > [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > now abort, > [loki:13586] ***and potentially your MPI job) > [Thread 0x7fffebbba700 (LWP 13585) exited] > [Thread 0x710a3700 (LWP 13584) exited] > [Thread 0x718a4700 (LWP 13583) exited] > [Thread 0x73b97700 (LWP 13582) exited] > [Inferior 1 (process 13567) exited with code 016] > Missing separate debuginfos, use: zypper install > libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3 > .x86_64 > (gdb) bt > No stack. > (gdb) > > Do you need anything else? > > &g
Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux
HI Siegmar, Could you post the configury options you use when building the 2.0.2rc3? Maybe that will help in trying to reproduce the segfault you are observing. Howard 2017-01-07 2:30 GMT-07:00 Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de>: > Hi, > > I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise > Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately, > I still get the same error that I reported for rc2. > > I would be grateful, if somebody can fix the problem before > releasing the final version. Thank you very much for any help > in advance. > > > Kind regards > > Siegmar > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] segmentation fault with openmpi-2.0.2rc2 on Linux
HI Siegmar, Could you please rerun the spawn_slave program with 4 processes? Your original traceback indicates a failure in the barrier in the slave program. I'm interested in seeing if when you run the slave program standalone with 4 processes the barrier failure is observed. Thanks, Howard 2017-01-03 0:32 GMT-07:00 Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de>: > Hi Howard, > > thank you very much that you try to solve my problem. I haven't > changed the programs since 2013 so that you use the correct > version. The program works as expected with the master trunk as > you can see at the bottom of this email from my last mail. The > slave program works when I launch it directly. > > loki spawn 122 mpicc --showme > cc -I/usr/local/openmpi-2.0.2_64_cc/include -m64 -mt -mt -Wl,-rpath > -Wl,/usr/local/openmpi-2.0.2_64_cc/lib64 -Wl,--enable-new-dtags > -L/usr/local/openmpi-2.0.2_64_cc/lib64 -lmpi > loki spawn 123 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:" > Open MPI: 2.0.2rc2 > C compiler absolute: /opt/solstudio12.5b/bin/cc > loki spawn 124 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 --mca > btl_base_verbose 10 spawn_slave > [loki:05572] mca: base: components_register: registering framework btl > components > [loki:05572] mca: base: components_register: found loaded component self > [loki:05572] mca: base: components_register: component self register > function successful > [loki:05572] mca: base: components_register: found loaded component sm > [loki:05572] mca: base: components_register: component sm register > function successful > [loki:05572] mca: base: components_register: found loaded component tcp > [loki:05572] mca: base: components_register: component tcp register > function successful > [loki:05572] mca: base: components_register: found loaded component vader > [loki:05572] mca: base: components_register: component vader register > function successful > [loki:05572] mca: base: components_open: opening btl components > [loki:05572] mca: base: components_open: found loaded component self > [loki:05572] mca: base: components_open: component self open function > successful > [loki:05572] mca: base: components_open: found loaded component sm > [loki:05572] mca: base: components_open: component sm open function > successful > [loki:05572] mca: base: components_open: found loaded component tcp > [loki:05572] mca: base: components_open: component tcp open function > successful > [loki:05572] mca: base: components_open: found loaded component vader > [loki:05572] mca: base: components_open: component vader open function > successful > [loki:05572] select: initializing btl component self > [loki:05572] select: init of component self returned success > [loki:05572] select: initializing btl component sm > [loki:05572] select: init of component sm returned failure > [loki:05572] mca: base: close: component sm closed > [loki:05572] mca: base: close: unloading component sm > [loki:05572] select: initializing btl component tcp > [loki:05572] select: init of component tcp returned success > [loki:05572] select: initializing btl component vader > [loki][[35331,1],0][../../../../../openmpi-2.0.2rc2/opal/mca > /btl/vader/btl_vader_component.c:454:mca_btl_vader_component_init] No > peers to communicate with. Disabling vader. > [loki:05572] select: init of component vader returned failure > [loki:05572] mca: base: close: component vader closed > [loki:05572] mca: base: close: unloading component vader > [loki:05572] mca: bml: Using self btl for send to [[35331,1],0] on node > loki > Slave process 0 of 1 running on loki > spawn_slave 0: argv[0]: spawn_slave > [loki:05572] mca: base: close: component self closed > [loki:05572] mca: base: close: unloading component self > [loki:05572] mca: base: close: component tcp closed > [loki:05572] mca: base: close: unloading component tcp > loki spawn 125 > > > Kind regards and thank you very much once more > > Siegmar > > Am 03.01.2017 um 00:17 schrieb Howard Pritchard: > >> HI Siegmar, >> >> I've attempted to reproduce this using gnu compilers and >> the version of this test program(s) you posted earlier in 2016 >> but am unable to reproduce the problem. >> >> Could you double check that the slave program can be >> successfully run when launched directly by mpirun/mpiexec? >> It might also help to use --mca btl_base_verbose 10 when >> running the slave program standalone. >> >> Thanks, >> >> Howard >> >> >> >> 2016-12-28 7:06 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-f >> ulda.de <mailto:siegmar.gr...@informatik.hs-fulda.de>>: >> >
Re: [OMPI users] segmentation fault with openmpi-2.0.2rc2 on Linux
HI Siegmar, I've attempted to reproduce this using gnu compilers and the version of this test program(s) you posted earlier in 2016 but am unable to reproduce the problem. Could you double check that the slave program can be successfully run when launched directly by mpirun/mpiexec? It might also help to use --mca btl_base_verbose 10 when running the slave program standalone. Thanks, Howard 2016-12-28 7:06 GMT-07:00 Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de>: > Hi, > > I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise > Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately, > I get an error when I run one of my programs. Everything works as > expected with openmpi-master-201612232109-67a08e8. The program > gets a timeout with openmpi-v2.x-201612232156-5ce66b0. > > loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:" > Open MPI: 2.0.2rc2 > C compiler absolute: /opt/solstudio12.5b/bin/cc > > > loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 > spawn_master > > Parent process 0 running on loki > I create 4 slave processes > > -- > A system call failed during shared memory initialization that should > not have. It is likely that your MPI job will now either abort or > experience performance degradation. > > Local host: loki > System call: open(2) > Error: No such file or directory (errno 2) > -- > [loki:17855] *** Process received signal *** > [loki:17855] Signal: Segmentation fault (11) > [loki:17855] Signal code: Address not mapped (1) > [loki:17855] Failing at address: 0x8 > [loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870] > [loki:17855] [ 1] /usr/local/openmpi-2.0.2_64_cc > /lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae] > [loki:17855] [ 2] /usr/local/openmpi-2.0.2_64_cc > /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x1 > 96)[0x7f053250cb16] > [loki:17855] [ 3] /usr/local/openmpi-2.0.2_64_cc > /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8] > [loki:17855] [ 4] /usr/local/openmpi-2.0.2_64_cc > /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c > )[0x7f053e52300c] > [loki:17855] [ 5] /usr/local/openmpi-2.0.2_64_cc > /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+ > 0x1ed)[0x7f053e523eed] > [loki:17855] [ 6] /usr/local/openmpi-2.0.2_64_cc > /lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_ > intra_dec_fixed+0x1a3)[0x7f0531ea7c03] > [loki:17855] [ 7] /usr/local/openmpi-2.0.2_64_cc > /lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38] > [loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in > file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line > 186 > /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_ > dyn_init+0xcd)[0x7f053d48aeed] > [loki:17855] [ 9] /usr/local/openmpi-2.0.2_64_cc > /lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3] > [loki:17855] [10] /usr/local/openmpi-2.0.2_64_cc > /lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd] > [loki:17855] [11] spawn_slave[0x4009cf] > [loki:17855] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f053cd53b25] > [loki:17855] [13] spawn_slave[0x400892] > [loki:17855] *** End of error message *** > [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file > ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186 > -- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[55817,2],0]) is on host: loki > Process 2 ([[55817,2],1]) is on host: unknown! > BTLs attempted: self sm tcp vader > > Your MPI job is now going to abort; sorry. > -- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > ***and potentially your MPI job) > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MP
Re: [OMPI users] Segmentation Fault (Core Dumped) on mpif90 -v
Hi Paul, Thanks very much Christmas present. The Open MPI README has been updated to include a note about issues with the Intel 16.0.3-4 compiler suites. Enjoy the holidays, Howard 2016-12-23 3:41 GMT-07:00 Paul Kapinos <kapi...@itc.rwth-aachen.de>: > Hi all, > > we discussed this issue with Intel compiler support and it looks like they > now know what the issue is and how to protect after. It is a known issue > resulting from a backwards incompatibility in an OS/glibc update, cf. > https://sourceware.org/bugzilla/show_bug.cgi?id=20019 > > Affected versions of the Intel compilers: 16.0.3, 16.0.4 > Not affected versions: 16.0.2, 17.0 > > So, simply do not use affected versions (and hope on an bugfix update in > 16x series if you cannot immediately upgrade to 17x, like we, despite this > is the favourite option from Intel). > > Have a nice Christmas time! > > Paul Kapinos > > On 12/14/16 13:29, Paul Kapinos wrote: > >> Hello all, >> we seem to run into the same issue: 'mpif90' sigsegvs immediately for >> Open MPI >> 1.10.4 compiled using Intel compilers 16.0.4.258 and 16.0.3.210, while it >> works >> fine when compiled with 16.0.2.181. >> >> It seems to be a compiler issue (more exactly: library issue on libs >> delivered >> with 16.0.4.258 and 16.0.3.210 versions). Changing the version of compiler >> loaded back to 16.0.2.181 (=> change of dynamically loaded libs) let the >> prevously-failing binary (compiled with newer compilers) to work >> propperly. >> >> Compiling with -O0 does not help. As the issue is likely in the Intel >> libs (as >> said changing out these solves/raises the issue) we will do a failback to >> 16.0.2.181 compiler version. We will try to open a case by Intel - let's >> see... >> >> Have a nice day, >> >> Paul Kapinos >> >> >> >> On 05/06/16 14:10, Jeff Squyres (jsquyres) wrote: >> >>> Ok, good. >>> >>> I asked that question because typically when we see errors like this, it >>> is >>> usually either a busted compiler installation or inadvertently mixing the >>> run-times of multiple different compilers in some kind of incompatible >>> way. >>> Specifically, the mpifort (aka mpif90) application is a fairly simple >>> program >>> -- there's no reason it should segv, especially with a stack trace that >>> you >>> sent that implies that it's dying early in startup, potentially even >>> before it >>> has hit any Open MPI code (i.e., it could even be pre-main). >>> >>> BTW, you might be able to get a more complete stack trace from the >>> debugger >>> that comes with the Intel compiler (idb? I don't remember offhand). >>> >>> Since you are able to run simple programs compiled by this compiler, it >>> sounds >>> like the compiler is working fine. Good! >>> >>> The next thing to check is to see if somehow the compiler and/or run-time >>> environments are getting mixed up. E.g., the apps were compiled for one >>> compiler/run-time but are being used with another. Also ensure that any >>> compiler/linker flags that you are passing to Open MPI's configure >>> script are >>> native and correct for the platform for which you're compiling (e.g., >>> don't >>> pass in flags that optimize for a different platform; that may result in >>> generating machine code instructions that are invalid for your platform). >>> >>> Try recompiling/re-installing Open MPI from scratch, and if it still >>> doesn't >>> work, then send all the information listed here: >>> >>> https://www.open-mpi.org/community/help/ >>> >>> >>> On May 6, 2016, at 3:45 AM, Giacomo Rossi <giacom...@gmail.com> wrote: >>>> >>>> Yes, I've tried three simple "Hello world" programs in fortan, C and >>>> C++ and >>>> the compile and run with intel 16.0.3. The problem is with the openmpi >>>> compiled from source. >>>> >>>> Giacomo Rossi Ph.D., Space Engineer >>>> >>>> Research Fellow at Dept. of Mechanical and Aerospace Engineering, >>>> "Sapienza" >>>> University of Rome >>>> p: (+39) 0692927207 | m: (+39) 3408816643 | e: giacom...@gmail.com >>>> >>>> Member of Fortran-FOSS-programmers >>>> >>>> >>>> 2016-05-05 11:15 GMT+02:00 Giacomo Rossi <giacom...@gmail.com>: >>>> gdb /opt/openmpi/
Re: [OMPI users] device failed to appear .. Connection timed out
Hi Daniele, I bet this psm2 got installed as part of Mpss 3.7. I see something in the readme for that about MPSS install with OFED support. I think if you want to go the route of using the RHEL Open MPI RPMS, you could use the mca-params.conf file approach to disabling the use of psm2. This file and a lot of other stuff about mca parameters is described here: https://www.open-mpi.org/faq/?category=tuning Alternatively, you could try and build/install Open MPI yourself from the download page: https://www.open-mpi.org/software/ompi/v1.10/ The simplest solution - but you need to be confident that nothing's using the PSM2 software - would be just use yum to deinstall the psm2 rpm. Good luck, Howard 2016-12-08 14:17 GMT-07:00 Daniele Tartarini <d.tartar...@sheffield.ac.uk>: > Hi, > many thanks for tour reply. > > I have a S2600IP Intel motherboard. it is a stand alone server and I > cannot see any omnipath device and so not such modules. > opainfo is not available on my system > > missing anything? > cheers > Daniele > > On 8 December 2016 at 17:55, Cabral, Matias A <matias.a.cab...@intel.com> > wrote: > >> >Anyway, * /dev/hfi1_0* doesn't exist. >> >> Make sure you have the hfi1 module/driver loaded. >> >> In addition, please confirm the links are in active state on all the >> nodes `opainfo` >> >> >> >> _MAC >> >> >> >> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard >> Pritchard >> *Sent:* Thursday, December 08, 2016 9:23 AM >> *To:* Open MPI Users <users@lists.open-mpi.org> >> *Subject:* Re: [OMPI users] device failed to appear .. Connection timed >> out >> >> >> >> hello Daniele, >> >> >> >> Could you post the output from ompi_info command? I'm noticing on the >> RPMS that came with the rhel7.2 distro on >> >> one of our systems that it was built to support psm2/hfi-1. >> >> >> >> Two things, could you try running applications with >> >> >> >> mpirun --mca pml ob1 (all the rest of your args) >> >> >> >> and see if that works? >> >> >> >> Second, what sort of system are you using? Is this a cluster? If it >> is, you may want to check whether >> >> you have a situation where its an omnipath interconnect and you have the >> psm2/hfi1 packages installed >> >> but for some reason the omnipath HCAs themselves are not active. >> >> >> >> On one of our omnipath systems the following hfi1 related pms are >> installed: >> >> >> >> *hfi*diags-0.8-13.x86_64 >> >> *hfi*1-psm-devel-0.7-244.x86_64 >> lib*hfi*1verbs-0.5-16.el7.x86_64 >> *hfi*1-psm-0.7-244.x86_64 >> *hfi*1-firmware-0.9-36.noarch >> *hfi*1-psm-compat-0.7-244.x86_64 >> lib*hfi*1verbs-devel-0.5-16.el7.x86_64 >> *hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64 >> *hfi*1-firmware_debug-0.9-36.noarc >> *hfi*1-diagtools-sw-0.8-13.x86_64 >> >> >> >> Howard >> >> >> >> 2016-12-08 8:45 GMT-07:00 r...@open-mpi.org <r...@open-mpi.org>: >> >> Sounds like something didn’t quite get configured right, or maybe you >> have a library installed that isn’t quite setup correctly, or... >> >> >> >> Regardless, we generally advise building from source to avoid such >> problems. Is there some reason not to just do so? >> >> >> >> On Dec 8, 2016, at 6:16 AM, Daniele Tartarini < >> d.tartar...@sheffield.ac.uk> wrote: >> >> >> >> Hi, >> >> I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum: >> >> *openmpi-devel.x86_64 1.10.3-3.el7 * >> >> >> >> any code I try to run (including the mpitests-*) I get the following >> message with slight variants: >> >> >> >> * my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device >> failed to appear after 15.0 seconds: Connection timed out* >> >> >> >> Is anyone able to help me in identifying the source of the problem? >> >> Anyway, * /dev/hfi1_0* doesn't exist. >> >> >> >> If I use an OpenMPI version compiled from source I have no issue (gcc >> 4.8.5). >> >> >> >> many thanks in advance. >> >> >> >> cheers >> >> Daniele >> >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium
Re: [OMPI users] device failed to appear .. Connection timed out
hello Daniele, Could you post the output from ompi_info command? I'm noticing on the RPMS that came with the rhel7.2 distro on one of our systems that it was built to support psm2/hfi-1. Two things, could you try running applications with mpirun --mca pml ob1 (all the rest of your args) and see if that works? Second, what sort of system are you using? Is this a cluster? If it is, you may want to check whether you have a situation where its an omnipath interconnect and you have the psm2/hfi1 packages installed but for some reason the omnipath HCAs themselves are not active. On one of our omnipath systems the following hfi1 related pms are installed: *hfi*diags-0.8-13.x86_64 *hfi*1-psm-devel-0.7-244.x86_64 lib*hfi*1verbs-0.5-16.el7.x86_64 *hfi*1-psm-0.7-244.x86_64 *hfi*1-firmware-0.9-36.noarch *hfi*1-psm-compat-0.7-244.x86_64 lib*hfi*1verbs-devel-0.5-16.el7.x86_64 *hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64 *hfi*1-firmware_debug-0.9-36.noarc *hfi*1-diagtools-sw-0.8-13.x86_64 Howard 2016-12-08 8:45 GMT-07:00 r...@open-mpi.org <r...@open-mpi.org>: > Sounds like something didn’t quite get configured right, or maybe you have > a library installed that isn’t quite setup correctly, or... > > Regardless, we generally advise building from source to avoid such > problems. Is there some reason not to just do so? > > On Dec 8, 2016, at 6:16 AM, Daniele Tartarini <d.tartar...@sheffield.ac.uk> > wrote: > > Hi, > > I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum: > > *openmpi-devel.x86_64 1.10.3-3.el7 * > > any code I try to run (including the mpitests-*) I get the following > message with slight variants: > > * my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device > failed to appear after 15.0 seconds: Connection timed out* > > Is anyone able to help me in identifying the source of the problem? > Anyway, * /dev/hfi1_0* doesn't exist. > > If I use an OpenMPI version compiled from source I have no issue (gcc > 4.8.5). > > many thanks in advance. > > cheers > Daniele > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Follow-up to Open MPI SC'16 BOF
Hi Jeff, I don't think it was the use of memkind itself, but a need to refactor the way Open MPI is using info objects that was the issue. I don't recall the details. Howard 2016-11-22 16:27 GMT-07:00 Jeff Hammond <jeff.scie...@gmail.com>: > >> >>1. MPI_ALLOC_MEM integration with memkind >> >> It would sense to prototype this as a standalone project that is > integrated with any MPI library via PMPI. It's probably a day or two of > work to get that going. > > Jeff > > -- > Jeff Hammond > jeff.scie...@gmail.com > http://jeffhammond.github.io/ > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Follow-up to Open MPI SC'16 BOF
Hello Folks, This is a followup to the question posed at the SC’16 Open MPI BOF: Would the community prefer to have a v2.2.x limited feature but backwards compatible release sometime in 2017, or would the community prefer a v3.x (not backwards compatible but potentially more features) sometime in late 2017 to early 2018? BOF attendees expressed an interest in having a list of features that might make it in to v2.2.x and ones that the Open MPI developers think would be too hard to back port from the development branch (master) to a v2.2.x release stream. Here are the requested lists: Features that we anticipate we could port to a v2.2.x release 1. Improved collective performance (a new “tuned” module) 2. Enable Linux CMA shared memory support by default 3. PMIx 3.0 (If new functionality were to be used in this release of Open MPI) Features that we anticipate would be too difficult to port to a v2.2.x release 1. Revamped CUDA support 2. MPI_ALLOC_MEM integration with memkind 3. OpenMP affinity/placement integration 4. THREAD_MULTIPLE improvements to MTLs (not so clear on the level of difficult for this one) You can register your opinion on whether to go with a v2.2.x release next year or to go from v2.1.x to v3.x in late 2017 or early 2018 at the link below: https://www.open-mpi.org/sc16/ Thanks very much, Howard -- Howard Pritchard HPC-DES Los Alamos National Laboratory ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users