I had a look at it and not sure if it really makes sense. In btl_vader_{put,get}.c it would be easy to check for the user namespace ID of the other process, but the function would then just return OPAL_ERROR a bit earlier instead of as a result of process_vm_{read,write}v(). Nothing would really change.
A better place for the check would be mca_btl_vader_check_single_copy() but I do not know if at this point the PID of the other processes is already known. Not sure if I can check for the user namespace ID of the other processes. Any recommendations how to do this? Adrian On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: > Patches are always welcome. What would be great is a nice big warning that > CMA support is disabled because the processes are on different namespaces. > Ideally all MPI processes should be on the same namespace to ensure the best > performance. > > -Nathan > > > On Jul 21, 2019, at 2:53 PM, Adrian Reber via users > > <users@lists.open-mpi.org> wrote: > > > > For completeness I am mentioning my results also here. > > > > To be able to mount file systems in the container it can only work if > > user namespaces are used and even if the user IDs are all the same (in > > each container and on the host), to be able to ptrace the kernel also > > checks if the processes are in the same user namespace (in addition to > > being owned by the same user). This check - same user namespace - fails > > and so process_vm_readv() and process_vm_writev() will also fail. > > > > So Open MPI's checks are currently not enough to detect if 'cma' can be > > used. Checking for the same user namespace would also be necessary. > > > > Is this a use case important enough to accept a patch for it? > > > > Adrian > > > >> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote: > >> Gilles, > >> > >> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps > >> indeed. > >> > >> The default seems to be 'cma' and that seems to use process_vm_readv() > >> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but > >> telling Podman to give the process CAP_SYS_PTRACE with > >> '--cap-add=SYS_PTRACE' > >> does not seem to be enough. Not sure yet if this related to the fact > >> that Podman is running rootless. I will continue to investigate, but now > >> I know where to look. Thanks! > >> > >> Adrian > >> > >>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users > >>> wrote: > >>> Adrian, > >>> > >>> Can you try > >>> mpirun --mca btl_vader_copy_mechanism none ... > >>> > >>> Please double check the MCA parameter name, I am AFK > >>> > >>> IIRC, the default copy mechanism used by vader directly accesses the > >>> remote process address space, and this requires some permission (ptrace?) > >>> that might be dropped by podman. > >>> > >>> Note Open MPI might not detect both MPI tasks run on the same node > >>> because of podman. > >>> If you use UCX, then btl/vader is not used at all (pml/ucx is used > >>> instead) > >>> > >>> > >>> Cheers, > >>> > >>> Gilles > >>> > >>> Sent from my iPod > >>> > >>>> On Jul 12, 2019, at 18:33, Adrian Reber via users > >>>> <users@lists.open-mpi.org> wrote: > >>>> > >>>> So upstream Podman was really fast and merged a PR which makes my > >>>> wrapper unnecessary: > >>>> > >>>> Add support for --env-host : > >>>> https://github.com/containers/libpod/pull/3557 > >>>> > >>>> As commented in the PR I can now start mpirun with Podman without a > >>>> wrapper: > >>>> > >>>> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun > >>>> podman run --env-host --security-opt label=disable -v > >>>> /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host > >>>> mpi-test /home/mpi/ring > >>>> Rank 0 has cleared MPI_Init > >>>> Rank 1 has cleared MPI_Init > >>>> Rank 0 has completed ring > >>>> Rank 0 has completed MPI_Barrier > >>>> Rank 1 has completed ring > >>>> Rank 1 has completed MPI_Barrier > >>>> > >>>> This is example was using TCP and on an InfiniBand based system I have > >>>> to map the InfiniBand devices into the container. > >>>> > >>>> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base > >>>> /tmp/podman-mpirun podman run --env-host -v > >>>> /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable > >>>> --userns=keep-id --device /dev/infiniband/uverbs0 --device > >>>> /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host > >>>> mpi-test /home/mpi/ring > >>>> Rank 0 has cleared MPI_Init > >>>> Rank 1 has cleared MPI_Init > >>>> Rank 0 has completed ring > >>>> Rank 0 has completed MPI_Barrier > >>>> Rank 1 has completed ring > >>>> Rank 1 has completed MPI_Barrier > >>>> > >>>> This is all running without root and only using Podman's rootless > >>>> support. > >>>> > >>>> Running multiple processes on one system, however, still gives me an > >>>> error. If I disable vader I guess that Open MPI is using TCP for > >>>> localhost communication and that works. But with vader it fails. > >>>> > >>>> The first error message I get is a segfault: > >>>> > >>>> [test1:00001] *** Process received signal *** > >>>> [test1:00001] Signal: Segmentation fault (11) > >>>> [test1:00001] Signal code: Address not mapped (1) > >>>> [test1:00001] Failing at address: 0x7fb7b1552010 > >>>> [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80] > >>>> [test1:00001] [ 1] > >>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b] > >>>> [test1:00001] [ 2] > >>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb] > >>>> [test1:00001] [ 3] > >>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086] > >>>> [test1:00001] [ 4] > >>>> /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d] > >>>> [test1:00001] [ 5] /home/mpi/ring[0x400b76] > >>>> [test1:00001] [ 6] > >>>> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813] > >>>> [test1:00001] [ 7] /home/mpi/ring[0x4008be] > >>>> [test1:00001] *** End of error message *** > >>>> > >>>> Guessing that vader uses shared memory this is expected to fail, with > >>>> all the namespace isolations in place. Maybe not with a segfault, but > >>>> each container has its own shared memory. So next step was to use the > >>>> host's ipc and pid namespace and mount /dev/shm: > >>>> > >>>> '-v /dev/shm:/dev/shm --ipc=host --pid=host' > >>>> > >>>> Which does not segfault, but still does not look correct: > >>>> > >>>> Rank 0 has cleared MPI_Init > >>>> Rank 1 has cleared MPI_Init > >>>> Rank 2 has cleared MPI_Init > >>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>> [test1:17722] Read -1, expected 80000, errno = 1 > >>>> Rank 0 has completed ring > >>>> Rank 2 has completed ring > >>>> Rank 0 has completed MPI_Barrier > >>>> Rank 1 has completed ring > >>>> Rank 2 has completed MPI_Barrier > >>>> Rank 1 has completed MPI_Barrier > >>>> > >>>> This is using the Open MPI ring.c example with SIZE increased from 20 to > >>>> 20000. > >>>> > >>>> Any recommendations what vader needs to communicate correctly? > >>>> > >>>> Adrian > >>>> > >>>>> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users wrote: > >>>>> Gilles, > >>>>> > >>>>> thanks for pointing out the environment variables. I quickly created a > >>>>> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables > >>>>> (grep "\(PMIX\|OMPI\)"). Now it works: > >>>>> > >>>>> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id > >>>>> --net=host mpi-test /home/mpi/hello > >>>>> > >>>>> Hello, world (2 procs total) > >>>>> --> Process # 0 of 2 is alive. ->test1 > >>>>> --> Process # 1 of 2 is alive. ->test2 > >>>>> > >>>>> I need to tell Podman to mount /tmp from the host into the container, as > >>>>> I am running rootless I also need to tell Podman to use the same user ID > >>>>> in the container as outside (so that the Open MPI files in /tmp) can be > >>>>> shared and I am also running without a network namespace. > >>>>> > >>>>> So this is now with the full Podman provided isolation except the > >>>>> network namespace. Thanks for you help! > >>>>> > >>>>> Adrian > >>>>> > >>>>>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via > >>>>>> users wrote: > >>>>>> Adrian, > >>>>>> > >>>>>> > >>>>>> the MPI application relies on some environment variables (they > >>>>>> typically > >>>>>> start with OMPI_ and PMIX_). > >>>>>> > >>>>>> The MPI application internally uses a PMIx client that must be able to > >>>>>> contact a PMIx server > >>>>>> > >>>>>> (that is included in mpirun and the orted daemon(s) spawned on the > >>>>>> remote > >>>>>> hosts). > >>>>>> > >>>>>> located on the same host. > >>>>>> > >>>>>> > >>>>>> If podman provides some isolation between the app inside the container > >>>>>> (e.g. > >>>>>> /home/mpi/hello) > >>>>>> > >>>>>> and the outside world (e.g. mpirun/orted), that won't be an easy ride. > >>>>>> > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> > >>>>>> Gilles > >>>>>> > >>>>>> > >>>>>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote: > >>>>>>> I did a quick test to see if I can use Podman in combination with Open > >>>>>>> MPI: > >>>>>>> > >>>>>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run > >>>>>>> quay.io/adrianreber/mpi-test /home/mpi/hello > >>>>>>> > >>>>>>> Hello, world (1 procs total) > >>>>>>> --> Process # 0 of 1 is alive. ->789b8fb622ef > >>>>>>> > >>>>>>> Hello, world (1 procs total) > >>>>>>> --> Process # 0 of 1 is alive. ->749eb4e1c01a > >>>>>>> > >>>>>>> The test program (hello) is taken from > >>>>>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c > >>>>>>> > >>>>>>> > >>>>>>> The problem with this is that each process thinks it is process 0 of 1 > >>>>>>> instead of > >>>>>>> > >>>>>>> Hello, world (2 procs total) > >>>>>>> --> Process # 1 of 2 is alive. ->test1 > >>>>>>> --> Process # 0 of 2 is alive. ->test2 > >>>>>>> > >>>>>>> My questions is how is the rank determined? What resources do I need > >>>>>>> to have > >>>>>>> in my container to correctly determine the rank. > >>>>>>> > >>>>>>> This is Podman 1.4.2 and Open MPI 4.0.1. > >>>>>>> > >>>>>>> Adrian > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> users@lists.open-mpi.org > >>>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> users@lists.open-mpi.org > >>>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> users@lists.open-mpi.org > >>>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>> _______________________________________________ > >>>> users mailing list > >>>> users@lists.open-mpi.org > >>>> https://lists.open-mpi.org/mailman/listinfo/users > >>> _______________________________________________ > >>> users mailing list > >>> users@lists.open-mpi.org > >>> https://lists.open-mpi.org/mailman/listinfo/users > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/users > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users Adrian -- Adrian Reber <adr...@lisas.de> http://lisas.de/~adrian/ I retain the right to change my mind, as always. Le Linus e mobile. - Linus Torvalds _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users