Greg, If Open MPI was built with UCX, your jobs will likely use UCX (and the shared memory provider) even if running on a single node. You can mpirun --mca pml ob1 --mca btl self,sm ... if you want to avoid using UCX.
What is a typical mpirun command line used under the hood by your "make test"? Though the warning might be ignored, SIGILL is definitely an issue. I encourage you to have your app dump a core in order to figure out where this is coming from Cheers, Gilles On Tue, Apr 16, 2024 at 5:20 AM Greg Samonds via users < users@lists.open-mpi.org> wrote: > Hello, > > > > We’re running into issues with jobs failing in a non-deterministic way > when running multiple jobs concurrently within a “make test” framework. > > > > Make test is launched from within a shell script running inside a Podman > container, and we’re typically running with “-j 20” and “-np 4” (20 jobs > concurrently with 4 procs each). We’ve also tried reducing the number of > jobs to no avail. Each time the battery of test cases is run, about 2 to 4 > different jobs out of around 200 fail with the following errors: > > > > > *[podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all > available processors) [podman-ci-rocky-8.8:03540] MCW rank 3 is not bound > (or bound to all available processors) [podman-ci-rocky-8.8:03519] MCW rank > 0 is not bound (or bound to all available processors) > [podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all > available processors) * > > *Program received signal SIGILL: Illegal instruction.* > > Some info about our setup: > > - Ampere Altra 80 core ARM machine > - Open MPI 4.1.7a1 from HPC-X v2.18 > - Rocky Linux 8.6 host, Rocky Linux 8.8 container > - Podman 4.4.1 > - This machine has a Mellanox Connect X-6 Lx NIC, however we’re > avoiding the Mellanox software stack by running in a container, and these > are single node jobs only > > > > We tried passing “—bind-to none” to the running jobs, and while this > seemed to reduce the number of failing jobs on average, it didn’t eliminate > the issue. > > > > We also encounter the following warning: > > > > *[1712927028.412063] [**podman-ci-rocky-8:3519 :0] sock.c:514 > UCX WARN unable to read somaxconn value from /proc/sys/net/core/somaxconn > file* > > > > …however as far as I can tell this is probably unrelated and occurs > because the associated file isn’t accessible inside the container, and > after checking the UCX source I can see that SOMAXCONN is picked up from > the system headers anyway. > > > > If anyone has hints about how to workaround this issue we’d greatly > appreciate it! > > > > Thanks, > > Greg >