If that works, then it might be possible to include the namespace ID in the job-info provided by PMIx at startup - would have to investigate, so please confirm that the modex option works first.
> On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users > <users@lists.open-mpi.org> wrote: > > Adrian, > > > An option is to involve the modex. > > each task would OPAL_MODEX_SEND() its own namespace ID, and then > OPAL_MODEX_RECV() > > the one from its peers and decide whether CMA support can be enabled. > > > Cheers, > > > Gilles > > On 7/22/2019 4:53 PM, Adrian Reber via users wrote: >> I had a look at it and not sure if it really makes sense. >> >> In btl_vader_{put,get}.c it would be easy to check for the user >> namespace ID of the other process, but the function would then just >> return OPAL_ERROR a bit earlier instead of as a result of >> process_vm_{read,write}v(). Nothing would really change. >> >> A better place for the check would be mca_btl_vader_check_single_copy() >> but I do not know if at this point the PID of the other processes is >> already known. Not sure if I can check for the user namespace ID of the >> other processes. >> >> Any recommendations how to do this? >> >> Adrian >> >> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: >>> Patches are always welcome. What would be great is a nice big warning that >>> CMA support is disabled because the processes are on different namespaces. >>> Ideally all MPI processes should be on the same namespace to ensure the >>> best performance. >>> >>> -Nathan >>> >>>> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users >>>> <users@lists.open-mpi.org> wrote: >>>> >>>> For completeness I am mentioning my results also here. >>>> >>>> To be able to mount file systems in the container it can only work if >>>> user namespaces are used and even if the user IDs are all the same (in >>>> each container and on the host), to be able to ptrace the kernel also >>>> checks if the processes are in the same user namespace (in addition to >>>> being owned by the same user). This check - same user namespace - fails >>>> and so process_vm_readv() and process_vm_writev() will also fail. >>>> >>>> So Open MPI's checks are currently not enough to detect if 'cma' can be >>>> used. Checking for the same user namespace would also be necessary. >>>> >>>> Is this a use case important enough to accept a patch for it? >>>> >>>> Adrian >>>> >>>>> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote: >>>>> Gilles, >>>>> >>>>> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps >>>>> indeed. >>>>> >>>>> The default seems to be 'cma' and that seems to use process_vm_readv() >>>>> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but >>>>> telling Podman to give the process CAP_SYS_PTRACE with >>>>> '--cap-add=SYS_PTRACE' >>>>> does not seem to be enough. Not sure yet if this related to the fact >>>>> that Podman is running rootless. I will continue to investigate, but now >>>>> I know where to look. Thanks! >>>>> >>>>> Adrian >>>>> >>>>>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users >>>>>> wrote: >>>>>> Adrian, >>>>>> >>>>>> Can you try >>>>>> mpirun --mca btl_vader_copy_mechanism none ... >>>>>> >>>>>> Please double check the MCA parameter name, I am AFK >>>>>> >>>>>> IIRC, the default copy mechanism used by vader directly accesses the >>>>>> remote process address space, and this requires some permission >>>>>> (ptrace?) that might be dropped by podman. >>>>>> >>>>>> Note Open MPI might not detect both MPI tasks run on the same node >>>>>> because of podman. >>>>>> If you use UCX, then btl/vader is not used at all (pml/ucx is used >>>>>> instead) >>>>>> >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> Sent from my iPod >>>>>> >>>>>>> On Jul 12, 2019, at 18:33, Adrian Reber via users >>>>>>> <users@lists.open-mpi.org> wrote: >>>>>>> >>>>>>> So upstream Podman was really fast and merged a PR which makes my >>>>>>> wrapper unnecessary: >>>>>>> >>>>>>> Add support for --env-host : >>>>>>> https://github.com/containers/libpod/pull/3557 >>>>>>> >>>>>>> As commented in the PR I can now start mpirun with Podman without a >>>>>>> wrapper: >>>>>>> >>>>>>> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun >>>>>>> podman run --env-host --security-opt label=disable -v >>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host >>>>>>> mpi-test /home/mpi/ring >>>>>>> Rank 0 has cleared MPI_Init >>>>>>> Rank 1 has cleared MPI_Init >>>>>>> Rank 0 has completed ring >>>>>>> Rank 0 has completed MPI_Barrier >>>>>>> Rank 1 has completed ring >>>>>>> Rank 1 has completed MPI_Barrier >>>>>>> >>>>>>> This is example was using TCP and on an InfiniBand based system I have >>>>>>> to map the InfiniBand devices into the container. >>>>>>> >>>>>>> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base >>>>>>> /tmp/podman-mpirun podman run --env-host -v >>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable >>>>>>> --userns=keep-id --device /dev/infiniband/uverbs0 --device >>>>>>> /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host >>>>>>> mpi-test /home/mpi/ring >>>>>>> Rank 0 has cleared MPI_Init >>>>>>> Rank 1 has cleared MPI_Init >>>>>>> Rank 0 has completed ring >>>>>>> Rank 0 has completed MPI_Barrier >>>>>>> Rank 1 has completed ring >>>>>>> Rank 1 has completed MPI_Barrier >>>>>>> >>>>>>> This is all running without root and only using Podman's rootless >>>>>>> support. >>>>>>> >>>>>>> Running multiple processes on one system, however, still gives me an >>>>>>> error. If I disable vader I guess that Open MPI is using TCP for >>>>>>> localhost communication and that works. But with vader it fails. >>>>>>> >>>>>>> The first error message I get is a segfault: >>>>>>> >>>>>>> [test1:00001] *** Process received signal *** >>>>>>> [test1:00001] Signal: Segmentation fault (11) >>>>>>> [test1:00001] Signal code: Address not mapped (1) >>>>>>> [test1:00001] Failing at address: 0x7fb7b1552010 >>>>>>> [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80] >>>>>>> [test1:00001] [ 1] >>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b] >>>>>>> [test1:00001] [ 2] >>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb] >>>>>>> [test1:00001] [ 3] >>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086] >>>>>>> [test1:00001] [ 4] >>>>>>> /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d] >>>>>>> [test1:00001] [ 5] /home/mpi/ring[0x400b76] >>>>>>> [test1:00001] [ 6] >>>>>>> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813] >>>>>>> [test1:00001] [ 7] /home/mpi/ring[0x4008be] >>>>>>> [test1:00001] *** End of error message *** >>>>>>> >>>>>>> Guessing that vader uses shared memory this is expected to fail, with >>>>>>> all the namespace isolations in place. Maybe not with a segfault, but >>>>>>> each container has its own shared memory. So next step was to use the >>>>>>> host's ipc and pid namespace and mount /dev/shm: >>>>>>> >>>>>>> '-v /dev/shm:/dev/shm --ipc=host --pid=host' >>>>>>> >>>>>>> Which does not segfault, but still does not look correct: >>>>>>> >>>>>>> Rank 0 has cleared MPI_Init >>>>>>> Rank 1 has cleared MPI_Init >>>>>>> Rank 2 has cleared MPI_Init >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>> Rank 0 has completed ring >>>>>>> Rank 2 has completed ring >>>>>>> Rank 0 has completed MPI_Barrier >>>>>>> Rank 1 has completed ring >>>>>>> Rank 2 has completed MPI_Barrier >>>>>>> Rank 1 has completed MPI_Barrier >>>>>>> >>>>>>> This is using the Open MPI ring.c example with SIZE increased from 20 >>>>>>> to 20000. >>>>>>> >>>>>>> Any recommendations what vader needs to communicate correctly? >>>>>>> >>>>>>> Adrian >>>>>>> >>>>>>>> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users wrote: >>>>>>>> Gilles, >>>>>>>> >>>>>>>> thanks for pointing out the environment variables. I quickly created a >>>>>>>> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables >>>>>>>> (grep "\(PMIX\|OMPI\)"). Now it works: >>>>>>>> >>>>>>>> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id >>>>>>>> --net=host mpi-test /home/mpi/hello >>>>>>>> >>>>>>>> Hello, world (2 procs total) >>>>>>>> --> Process # 0 of 2 is alive. ->test1 >>>>>>>> --> Process # 1 of 2 is alive. ->test2 >>>>>>>> >>>>>>>> I need to tell Podman to mount /tmp from the host into the container, >>>>>>>> as >>>>>>>> I am running rootless I also need to tell Podman to use the same user >>>>>>>> ID >>>>>>>> in the container as outside (so that the Open MPI files in /tmp) can be >>>>>>>> shared and I am also running without a network namespace. >>>>>>>> >>>>>>>> So this is now with the full Podman provided isolation except the >>>>>>>> network namespace. Thanks for you help! >>>>>>>> >>>>>>>> Adrian >>>>>>>> >>>>>>>>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via >>>>>>>>> users wrote: >>>>>>>>> Adrian, >>>>>>>>> >>>>>>>>> >>>>>>>>> the MPI application relies on some environment variables (they >>>>>>>>> typically >>>>>>>>> start with OMPI_ and PMIX_). >>>>>>>>> >>>>>>>>> The MPI application internally uses a PMIx client that must be able to >>>>>>>>> contact a PMIx server >>>>>>>>> >>>>>>>>> (that is included in mpirun and the orted daemon(s) spawned on the >>>>>>>>> remote >>>>>>>>> hosts). >>>>>>>>> >>>>>>>>> located on the same host. >>>>>>>>> >>>>>>>>> >>>>>>>>> If podman provides some isolation between the app inside the >>>>>>>>> container (e.g. >>>>>>>>> /home/mpi/hello) >>>>>>>>> >>>>>>>>> and the outside world (e.g. mpirun/orted), that won't be an easy ride. >>>>>>>>> >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> >>>>>>>>> Gilles >>>>>>>>> >>>>>>>>> >>>>>>>>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote: >>>>>>>>>> I did a quick test to see if I can use Podman in combination with >>>>>>>>>> Open >>>>>>>>>> MPI: >>>>>>>>>> >>>>>>>>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run >>>>>>>>>> quay.io/adrianreber/mpi-test /home/mpi/hello >>>>>>>>>> >>>>>>>>>> Hello, world (1 procs total) >>>>>>>>>> --> Process # 0 of 1 is alive. ->789b8fb622ef >>>>>>>>>> >>>>>>>>>> Hello, world (1 procs total) >>>>>>>>>> --> Process # 0 of 1 is alive. ->749eb4e1c01a >>>>>>>>>> >>>>>>>>>> The test program (hello) is taken from >>>>>>>>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The problem with this is that each process thinks it is process 0 of >>>>>>>>>> 1 >>>>>>>>>> instead of >>>>>>>>>> >>>>>>>>>> Hello, world (2 procs total) >>>>>>>>>> --> Process # 1 of 2 is alive. ->test1 >>>>>>>>>> --> Process # 0 of 2 is alive. ->test2 >>>>>>>>>> >>>>>>>>>> My questions is how is the rank determined? What resources do I need >>>>>>>>>> to have >>>>>>>>>> in my container to correctly determine the rank. >>>>>>>>>> >>>>>>>>>> This is Podman 1.4.2 and Open MPI 4.0.1. >>>>>>>>>> >>>>>>>>>> Adrian >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> users@lists.open-mpi.org >>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> users@lists.open-mpi.org >>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> users@lists.open-mpi.org >>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://lists.open-mpi.org/mailman/listinfo/users >> Adrian >> > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users