Greg,

If Open MPI was built with UCX, your jobs will likely use UCX (and the
shared memory provider) even if running on a single node.
You can
mpirun --mca pml ob1 --mca btl self,sm ...
if you want to avoid using UCX.

What is a typical mpirun command line used under the hood by your "make
test"?
Though the warning might be ignored, SIGILL is definitely an issue.
I encourage you to have your app dump a core in order to figure out where
this is coming from


Cheers,

Gilles

On Tue, Apr 16, 2024 at 5:20 AM Greg Samonds via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
>
>
> We’re running into issues with jobs failing in a non-deterministic way
> when running multiple jobs concurrently within a “make test” framework.
>
>
>
> Make test is launched from within a shell script running inside a Podman
> container, and we’re typically running with “-j 20” and “-np 4” (20 jobs
> concurrently with 4 procs each).  We’ve also tried reducing the number of
> jobs to no avail.  Each time the battery of test cases is run, about 2 to 4
> different jobs out of around 200 fail with the following errors:
>
>
>
>
> *[podman-ci-rocky-8.8:03528] MCW rank 1 is not bound (or bound to all
> available processors) [podman-ci-rocky-8.8:03540] MCW rank 3 is not bound
> (or bound to all available processors) [podman-ci-rocky-8.8:03519] MCW rank
> 0 is not bound (or bound to all available processors)
> [podman-ci-rocky-8.8:03533] MCW rank 2 is not bound (or bound to all
> available processors) *
>
> *Program received signal SIGILL: Illegal instruction.*
>
> Some info about our setup:
>
>    - Ampere Altra 80 core ARM machine
>    - Open MPI 4.1.7a1 from HPC-X v2.18
>    - Rocky Linux 8.6 host, Rocky Linux 8.8 container
>    - Podman 4.4.1
>    - This machine has a Mellanox Connect X-6 Lx NIC, however we’re
>    avoiding the Mellanox software stack by running in a container, and these
>    are single node jobs only
>
>
>
> We tried passing “—bind-to none” to the running jobs, and while this
> seemed to reduce the number of failing jobs on average, it didn’t eliminate
> the issue.
>
>
>
> We also encounter the following warning:
>
>
>
> *[1712927028.412063] [**podman-ci-rocky-8:3519 :0]            sock.c:514
> UCX  WARN  unable to read somaxconn value from /proc/sys/net/core/somaxconn
> file*
>
>
>
> …however as far as I can tell this is probably unrelated and occurs
> because the associated file isn’t accessible inside the container, and
> after checking the UCX source I can see that SOMAXCONN is picked up from
> the system headers anyway.
>
>
>
> If anyone has hints about how to workaround this issue we’d greatly
> appreciate it!
>
>
>
> Thanks,
>
> Greg
>

Reply via email to