I had a look at it and not sure if it really makes sense.

In btl_vader_{put,get}.c it would be easy to check for the user
namespace ID of the other process, but the function would then just
return OPAL_ERROR a bit earlier instead of as a result of
process_vm_{read,write}v(). Nothing would really change.

A better place for the check would be mca_btl_vader_check_single_copy()
but I do not know if at this point the PID of the other processes is
already known. Not sure if I can check for the user namespace ID of the
other processes.

Any recommendations how to do this?

                Adrian

On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote:
> Patches are always welcome. What would be great is a nice big warning that 
> CMA support is disabled because the processes are on different namespaces. 
> Ideally all MPI processes should be on the same namespace to ensure the best 
> performance. 
> 
> -Nathan
> 
> > On Jul 21, 2019, at 2:53 PM, Adrian Reber via users 
> > <users@lists.open-mpi.org> wrote:
> > 
> > For completeness I am mentioning my results also here.
> > 
> > To be able to mount file systems in the container it can only work if
> > user namespaces are used and even if the user IDs are all the same (in
> > each container and on the host), to be able to ptrace the kernel also
> > checks if the processes are in the same user namespace (in addition to
> > being owned by the same user). This check - same user namespace - fails
> > and so process_vm_readv() and process_vm_writev() will also fail.
> > 
> > So Open MPI's checks are currently not enough to detect if 'cma' can be
> > used. Checking for the same user namespace would also be necessary.
> > 
> > Is this a use case important enough to accept a patch for it?
> > 
> >        Adrian
> > 
> >> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote:
> >> Gilles,
> >> 
> >> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps
> >> indeed.
> >> 
> >> The default seems to be 'cma' and that seems to use process_vm_readv()
> >> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but
> >> telling Podman to give the process CAP_SYS_PTRACE with 
> >> '--cap-add=SYS_PTRACE'
> >> does not seem to be enough. Not sure yet if this related to the fact
> >> that Podman is running rootless. I will continue to investigate, but now
> >> I know where to look. Thanks!
> >> 
> >>        Adrian
> >> 
> >>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users 
> >>> wrote:
> >>> Adrian,
> >>> 
> >>> Can you try
> >>> mpirun --mca btl_vader_copy_mechanism none ...
> >>> 
> >>> Please double check the MCA parameter name, I am AFK
> >>> 
> >>> IIRC, the default copy mechanism used by vader directly accesses the 
> >>> remote process address space, and this requires some permission (ptrace?) 
> >>> that might be dropped by podman.
> >>> 
> >>> Note Open MPI might not detect both MPI tasks run on the same node 
> >>> because of podman.
> >>> If you use UCX, then btl/vader is not used at all (pml/ucx is used 
> >>> instead)
> >>> 
> >>> 
> >>> Cheers,
> >>> 
> >>> Gilles
> >>> 
> >>> Sent from my iPod
> >>> 
> >>>> On Jul 12, 2019, at 18:33, Adrian Reber via users 
> >>>> <users@lists.open-mpi.org> wrote:
> >>>> 
> >>>> So upstream Podman was really fast and merged a PR which makes my
> >>>> wrapper unnecessary:
> >>>> 
> >>>> Add support for --env-host : 
> >>>> https://github.com/containers/libpod/pull/3557
> >>>> 
> >>>> As commented in the PR I can now start mpirun with Podman without a
> >>>> wrapper:
> >>>> 
> >>>> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun 
> >>>> podman run --env-host --security-opt label=disable -v 
> >>>> /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host 
> >>>> mpi-test /home/mpi/ring
> >>>> Rank 0 has cleared MPI_Init
> >>>> Rank 1 has cleared MPI_Init
> >>>> Rank 0 has completed ring
> >>>> Rank 0 has completed MPI_Barrier
> >>>> Rank 1 has completed ring
> >>>> Rank 1 has completed MPI_Barrier
> >>>> 
> >>>> This is example was using TCP and on an InfiniBand based system I have
> >>>> to map the InfiniBand devices into the container.
> >>>> 
> >>>> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base 
> >>>> /tmp/podman-mpirun podman run --env-host -v 
> >>>> /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable 
> >>>> --userns=keep-id --device /dev/infiniband/uverbs0 --device 
> >>>> /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host 
> >>>> mpi-test /home/mpi/ring
> >>>> Rank 0 has cleared MPI_Init
> >>>> Rank 1 has cleared MPI_Init
> >>>> Rank 0 has completed ring
> >>>> Rank 0 has completed MPI_Barrier
> >>>> Rank 1 has completed ring
> >>>> Rank 1 has completed MPI_Barrier
> >>>> 
> >>>> This is all running without root and only using Podman's rootless
> >>>> support.
> >>>> 
> >>>> Running multiple processes on one system, however, still gives me an
> >>>> error. If I disable vader I guess that Open MPI is using TCP for
> >>>> localhost communication and that works. But with vader it fails.
> >>>> 
> >>>> The first error message I get is a segfault:
> >>>> 
> >>>> [test1:00001] *** Process received signal ***
> >>>> [test1:00001] Signal: Segmentation fault (11)
> >>>> [test1:00001] Signal code: Address not mapped (1)
> >>>> [test1:00001] Failing at address: 0x7fb7b1552010
> >>>> [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80]
> >>>> [test1:00001] [ 1] 
> >>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b]
> >>>> [test1:00001] [ 2] 
> >>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb]
> >>>> [test1:00001] [ 3] 
> >>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086]
> >>>> [test1:00001] [ 4] 
> >>>> /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d]
> >>>> [test1:00001] [ 5] /home/mpi/ring[0x400b76]
> >>>> [test1:00001] [ 6] 
> >>>> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813]
> >>>> [test1:00001] [ 7] /home/mpi/ring[0x4008be]
> >>>> [test1:00001] *** End of error message ***
> >>>> 
> >>>> Guessing that vader uses shared memory this is expected to fail, with
> >>>> all the namespace isolations in place. Maybe not with a segfault, but
> >>>> each container has its own shared memory. So next step was to use the
> >>>> host's ipc and pid namespace and mount /dev/shm:
> >>>> 
> >>>> '-v /dev/shm:/dev/shm --ipc=host --pid=host'
> >>>> 
> >>>> Which does not segfault, but still does not look correct:
> >>>> 
> >>>> Rank 0 has cleared MPI_Init
> >>>> Rank 1 has cleared MPI_Init
> >>>> Rank 2 has cleared MPI_Init
> >>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>> [test1:17722] Read -1, expected 80000, errno = 1
> >>>> Rank 0 has completed ring
> >>>> Rank 2 has completed ring
> >>>> Rank 0 has completed MPI_Barrier
> >>>> Rank 1 has completed ring
> >>>> Rank 2 has completed MPI_Barrier
> >>>> Rank 1 has completed MPI_Barrier
> >>>> 
> >>>> This is using the Open MPI ring.c example with SIZE increased from 20 to 
> >>>> 20000.
> >>>> 
> >>>> Any recommendations what vader needs to communicate correctly?
> >>>> 
> >>>>       Adrian
> >>>> 
> >>>>> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users wrote:
> >>>>> Gilles,
> >>>>> 
> >>>>> thanks for pointing out the environment variables. I quickly created a
> >>>>> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables
> >>>>> (grep "\(PMIX\|OMPI\)"). Now it works:
> >>>>> 
> >>>>> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id 
> >>>>> --net=host mpi-test /home/mpi/hello
> >>>>> 
> >>>>> Hello, world (2 procs total)
> >>>>>   --> Process #   0 of   2 is alive. ->test1
> >>>>>   --> Process #   1 of   2 is alive. ->test2
> >>>>> 
> >>>>> I need to tell Podman to mount /tmp from the host into the container, as
> >>>>> I am running rootless I also need to tell Podman to use the same user ID
> >>>>> in the container as outside (so that the Open MPI files in /tmp) can be
> >>>>> shared and I am also running without a network namespace.
> >>>>> 
> >>>>> So this is now with the full Podman provided isolation except the
> >>>>> network namespace. Thanks for you help!
> >>>>> 
> >>>>>       Adrian
> >>>>> 
> >>>>>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via 
> >>>>>> users wrote:
> >>>>>> Adrian,
> >>>>>> 
> >>>>>> 
> >>>>>> the MPI application relies on some environment variables (they 
> >>>>>> typically
> >>>>>> start with OMPI_ and PMIX_).
> >>>>>> 
> >>>>>> The MPI application internally uses a PMIx client that must be able to
> >>>>>> contact a PMIx server
> >>>>>> 
> >>>>>> (that is included in mpirun and the orted daemon(s) spawned on the 
> >>>>>> remote
> >>>>>> hosts).
> >>>>>> 
> >>>>>> located on the same host.
> >>>>>> 
> >>>>>> 
> >>>>>> If podman provides some isolation between the app inside the container 
> >>>>>> (e.g.
> >>>>>> /home/mpi/hello)
> >>>>>> 
> >>>>>> and the outside world (e.g. mpirun/orted), that won't be an easy ride.
> >>>>>> 
> >>>>>> 
> >>>>>> Cheers,
> >>>>>> 
> >>>>>> 
> >>>>>> Gilles
> >>>>>> 
> >>>>>> 
> >>>>>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote:
> >>>>>>> I did a quick test to see if I can use Podman in combination with Open
> >>>>>>> MPI:
> >>>>>>> 
> >>>>>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run 
> >>>>>>> quay.io/adrianreber/mpi-test /home/mpi/hello
> >>>>>>> 
> >>>>>>> Hello, world (1 procs total)
> >>>>>>>    --> Process #   0 of   1 is alive. ->789b8fb622ef
> >>>>>>> 
> >>>>>>> Hello, world (1 procs total)
> >>>>>>>    --> Process #   0 of   1 is alive. ->749eb4e1c01a
> >>>>>>> 
> >>>>>>> The test program (hello) is taken from 
> >>>>>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c
> >>>>>>> 
> >>>>>>> 
> >>>>>>> The problem with this is that each process thinks it is process 0 of 1
> >>>>>>> instead of
> >>>>>>> 
> >>>>>>> Hello, world (2 procs total)
> >>>>>>>    --> Process #   1 of   2 is alive.  ->test1
> >>>>>>>    --> Process #   0 of   2 is alive.  ->test2
> >>>>>>> 
> >>>>>>> My questions is how is the rank determined? What resources do I need 
> >>>>>>> to have
> >>>>>>> in my container to correctly determine the rank.
> >>>>>>> 
> >>>>>>> This is Podman 1.4.2 and Open MPI 4.0.1.
> >>>>>>> 
> >>>>>>>       Adrian
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> users@lists.open-mpi.org
> >>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>>>> 
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> users@lists.open-mpi.org
> >>>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users@lists.open-mpi.org
> >>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users@lists.open-mpi.org
> >>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>> _______________________________________________
> >>> users mailing list
> >>> users@lists.open-mpi.org
> >>> https://lists.open-mpi.org/mailman/listinfo/users
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> > 
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users

                Adrian

-- 
Adrian Reber <adr...@lisas.de>            http://lisas.de/~adrian/
I retain the right to change my mind, as always. Le Linus e mobile.

        - Linus Torvalds
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to