For the mailing list: the issue is that some M1's apparently default to an 
unlimited max number of open files per process.  If you use "ulimit -n" to set 
some reasonable number, then Open MPI should behave better:

ulimit -n 1024
mpirun ...

We don't (yet?) know why this seems to be necessary on some M1s (e.g., Scott's) 
but not others (e.g., George's).

We'll put a guard in against the "unlimited" case in future releases.

See https://github.com/open-mpi/ompi/issues/10358 for more details, but I 
figured I'd put the workaround out here on the mailing list.

--
Jeff Squyres
jsquy...@cisco.com

________________________________________
From: users <users-boun...@lists.open-mpi.org> on behalf of Jeff Squyres 
(jsquyres) via users <users@lists.open-mpi.org>
Sent: Thursday, May 5, 2022 3:31 PM
To: George Bosilca; Open MPI Users
Cc: Jeff Squyres (jsquyres)
Subject: Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

Scott and I conversed a bit off list, and I got more data.  I posted everything 
in https://github.com/open-mpi/ompi/issues/10358 -- let's follow up on this 
issue there.

--
Jeff Squyres
jsquy...@cisco.com

________________________________________
From: George Bosilca <bosi...@icl.utk.edu>
Sent: Thursday, May 5, 2022 3:19 PM
To: Open MPI Users
Cc: Jeff Squyres (jsquyres); Scott Sayres
Subject: Re: [OMPI users] mpirun hangs on m1 mac w openmpi-4.1.3

That is weird, but maybe it is not a deadlock, but a very slow progress. In the 
child can you print the fdmax and i in the frame do_child.

George.

On Thu, May 5, 2022 at 11:50 AM Scott Sayres via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
Jeff, thanks.
from 1:

(lldb) process attach --pid 95083

Process 95083 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

    frame #0: 0x00000001bde25628 libsystem_kernel.dylib`close + 8

libsystem_kernel.dylib`close:

->  0x1bde25628 <+8>:  b.lo   0x1bde25648               ; <+40>

    0x1bde2562c <+12>: pacibsp

    0x1bde25630 <+16>: stp    x29, x30, [sp, #-0x10]!

    0x1bde25634 <+20>: mov    x29, sp

Target 0: (orterun) stopped.

Executable module set to "/usr/local/bin/orterun".

Architecture set to: arm64e-apple-macosx-.

(lldb) thread backtrace

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

  * frame #0: 0x00000001bde25628 libsystem_kernel.dylib`close + 8

    frame #1: 0x0000000101563074 
mca_odls_default.so`do_child(cd=0x0000600001e28000, write_fd=40) at 
odls_default_module.c:410:17

    frame #2: 0x0000000101562d7c 
mca_odls_default.so`odls_default_fork_local_proc(cdptr=0x0000600001e28000) at 
odls_default_module.c:646:9

    frame #3: 0x0000000100e2c6f8 
libopen-rte.40.dylib`orte_odls_base_spawn_proc(fd=-1, sd=4, 
cbdata=0x0000600001e28000) at odls_base_default_fns.c:1046:31

    frame #4: 0x00000001011827a0 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active_single_queue(base=0x000000010df069d0) at event.c:1370:4 
[opt]

    frame #5: 0x0000000101182628 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active(base=0x000000010df069d0) at event.c:1440:8 [opt]

    frame #6: 0x00000001011825ec 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop(base=0x000000010df069d0, 
flags=<unavailable>) at event.c:1644:12 [opt]

    frame #7: 0x0000000100bbfb04 orterun`orterun(argc=4, 
argv=0x000000016f2432f8) at orterun.c:179:9

    frame #8: 0x0000000100bbf904 orterun`main(argc=4, argv=0x000000016f2432f8) 
at main.c:13:12

    frame #9: 0x0000000100f19088 dyld`start + 516

from 2:

scottsayres@scotts-mbp ~ % lldb -p 95082

(lldb) process attach --pid 95082

Process 95082 stopped

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

    frame #0: 0x00000001bde25654 libsystem_kernel.dylib`read + 8

libsystem_kernel.dylib`read:

->  0x1bde25654 <+8>:  b.lo   0x1bde25674               ; <+40>

    0x1bde25658 <+12>: pacibsp

    0x1bde2565c <+16>: stp    x29, x30, [sp, #-0x10]!

    0x1bde25660 <+20>: mov    x29, sp

Target 0: (orterun) stopped.

Executable module set to "/usr/local/bin/orterun".

Architecture set to: arm64e-apple-macosx-.

(lldb) thread backtrace

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP

  * frame #0: 0x00000001bde25654 libsystem_kernel.dylib`read + 8

    frame #1: 0x000000010116969c libopen-pal.40.dylib`opal_fd_read(fd=22, 
len=20, buffer=0x000000016f24299c) at fd.c:51:14

    frame #2: 0x0000000101563388 
mca_odls_default.so`do_parent(cd=0x0000600001e28200, read_fd=22) at 
odls_default_module.c:495:14

    frame #3: 0x0000000101562d90 
mca_odls_default.so`odls_default_fork_local_proc(cdptr=0x0000600001e28200) at 
odls_default_module.c:651:12

    frame #4: 0x0000000100e2c6f8 
libopen-rte.40.dylib`orte_odls_base_spawn_proc(fd=-1, sd=4, 
cbdata=0x0000600001e28200) at odls_base_default_fns.c:1046:31

    frame #5: 0x00000001011827a0 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active_single_queue(base=0x000000010df069d0) at event.c:1370:4 
[opt]

    frame #6: 0x0000000101182628 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop [inlined] 
event_process_active(base=0x000000010df069d0) at event.c:1440:8 [opt]

    frame #7: 0x00000001011825ec 
libopen-pal.40.dylib`opal_libevent2022_event_base_loop(base=0x000000010df069d0, 
flags=<unavailable>) at event.c:1644:12 [opt]

    frame #8: 0x0000000100bbfb04 orterun`orterun(argc=4, 
argv=0x000000016f2432f8) at orterun.c:179:9

    frame #9: 0x0000000100bbf904 orterun`main(argc=4, argv=0x000000016f2432f8) 
at main.c:13:12

    frame #10: 0x0000000100f19088 dyld`start + 516

Reply via email to