If that works, then it might be possible to include the namespace ID in the 
job-info provided by PMIx at startup - would have to investigate, so please 
confirm that the modex option works first.

> On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users 
> <users@lists.open-mpi.org> wrote:
> 
> Adrian,
> 
> 
> An option is to involve the modex.
> 
> each task would OPAL_MODEX_SEND() its own namespace ID, and then 
> OPAL_MODEX_RECV()
> 
> the one from its peers and decide whether CMA support can be enabled.
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> On 7/22/2019 4:53 PM, Adrian Reber via users wrote:
>> I had a look at it and not sure if it really makes sense.
>> 
>> In btl_vader_{put,get}.c it would be easy to check for the user
>> namespace ID of the other process, but the function would then just
>> return OPAL_ERROR a bit earlier instead of as a result of
>> process_vm_{read,write}v(). Nothing would really change.
>> 
>> A better place for the check would be mca_btl_vader_check_single_copy()
>> but I do not know if at this point the PID of the other processes is
>> already known. Not sure if I can check for the user namespace ID of the
>> other processes.
>> 
>> Any recommendations how to do this?
>> 
>>              Adrian
>> 
>> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote:
>>> Patches are always welcome. What would be great is a nice big warning that 
>>> CMA support is disabled because the processes are on different namespaces. 
>>> Ideally all MPI processes should be on the same namespace to ensure the 
>>> best performance.
>>> 
>>> -Nathan
>>> 
>>>> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users 
>>>> <users@lists.open-mpi.org> wrote:
>>>> 
>>>> For completeness I am mentioning my results also here.
>>>> 
>>>> To be able to mount file systems in the container it can only work if
>>>> user namespaces are used and even if the user IDs are all the same (in
>>>> each container and on the host), to be able to ptrace the kernel also
>>>> checks if the processes are in the same user namespace (in addition to
>>>> being owned by the same user). This check - same user namespace - fails
>>>> and so process_vm_readv() and process_vm_writev() will also fail.
>>>> 
>>>> So Open MPI's checks are currently not enough to detect if 'cma' can be
>>>> used. Checking for the same user namespace would also be necessary.
>>>> 
>>>> Is this a use case important enough to accept a patch for it?
>>>> 
>>>>        Adrian
>>>> 
>>>>> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote:
>>>>> Gilles,
>>>>> 
>>>>> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps
>>>>> indeed.
>>>>> 
>>>>> The default seems to be 'cma' and that seems to use process_vm_readv()
>>>>> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but
>>>>> telling Podman to give the process CAP_SYS_PTRACE with 
>>>>> '--cap-add=SYS_PTRACE'
>>>>> does not seem to be enough. Not sure yet if this related to the fact
>>>>> that Podman is running rootless. I will continue to investigate, but now
>>>>> I know where to look. Thanks!
>>>>> 
>>>>>        Adrian
>>>>> 
>>>>>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users 
>>>>>> wrote:
>>>>>> Adrian,
>>>>>> 
>>>>>> Can you try
>>>>>> mpirun --mca btl_vader_copy_mechanism none ...
>>>>>> 
>>>>>> Please double check the MCA parameter name, I am AFK
>>>>>> 
>>>>>> IIRC, the default copy mechanism used by vader directly accesses the 
>>>>>> remote process address space, and this requires some permission 
>>>>>> (ptrace?) that might be dropped by podman.
>>>>>> 
>>>>>> Note Open MPI might not detect both MPI tasks run on the same node 
>>>>>> because of podman.
>>>>>> If you use UCX, then btl/vader is not used at all (pml/ucx is used 
>>>>>> instead)
>>>>>> 
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Gilles
>>>>>> 
>>>>>> Sent from my iPod
>>>>>> 
>>>>>>> On Jul 12, 2019, at 18:33, Adrian Reber via users 
>>>>>>> <users@lists.open-mpi.org> wrote:
>>>>>>> 
>>>>>>> So upstream Podman was really fast and merged a PR which makes my
>>>>>>> wrapper unnecessary:
>>>>>>> 
>>>>>>> Add support for --env-host : 
>>>>>>> https://github.com/containers/libpod/pull/3557
>>>>>>> 
>>>>>>> As commented in the PR I can now start mpirun with Podman without a
>>>>>>> wrapper:
>>>>>>> 
>>>>>>> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun 
>>>>>>> podman run --env-host --security-opt label=disable -v 
>>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host 
>>>>>>> mpi-test /home/mpi/ring
>>>>>>> Rank 0 has cleared MPI_Init
>>>>>>> Rank 1 has cleared MPI_Init
>>>>>>> Rank 0 has completed ring
>>>>>>> Rank 0 has completed MPI_Barrier
>>>>>>> Rank 1 has completed ring
>>>>>>> Rank 1 has completed MPI_Barrier
>>>>>>> 
>>>>>>> This is example was using TCP and on an InfiniBand based system I have
>>>>>>> to map the InfiniBand devices into the container.
>>>>>>> 
>>>>>>> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base 
>>>>>>> /tmp/podman-mpirun podman run --env-host -v 
>>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable 
>>>>>>> --userns=keep-id --device /dev/infiniband/uverbs0 --device 
>>>>>>> /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host 
>>>>>>> mpi-test /home/mpi/ring
>>>>>>> Rank 0 has cleared MPI_Init
>>>>>>> Rank 1 has cleared MPI_Init
>>>>>>> Rank 0 has completed ring
>>>>>>> Rank 0 has completed MPI_Barrier
>>>>>>> Rank 1 has completed ring
>>>>>>> Rank 1 has completed MPI_Barrier
>>>>>>> 
>>>>>>> This is all running without root and only using Podman's rootless
>>>>>>> support.
>>>>>>> 
>>>>>>> Running multiple processes on one system, however, still gives me an
>>>>>>> error. If I disable vader I guess that Open MPI is using TCP for
>>>>>>> localhost communication and that works. But with vader it fails.
>>>>>>> 
>>>>>>> The first error message I get is a segfault:
>>>>>>> 
>>>>>>> [test1:00001] *** Process received signal ***
>>>>>>> [test1:00001] Signal: Segmentation fault (11)
>>>>>>> [test1:00001] Signal code: Address not mapped (1)
>>>>>>> [test1:00001] Failing at address: 0x7fb7b1552010
>>>>>>> [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80]
>>>>>>> [test1:00001] [ 1] 
>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b]
>>>>>>> [test1:00001] [ 2] 
>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb]
>>>>>>> [test1:00001] [ 3] 
>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086]
>>>>>>> [test1:00001] [ 4] 
>>>>>>> /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d]
>>>>>>> [test1:00001] [ 5] /home/mpi/ring[0x400b76]
>>>>>>> [test1:00001] [ 6] 
>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813]
>>>>>>> [test1:00001] [ 7] /home/mpi/ring[0x4008be]
>>>>>>> [test1:00001] *** End of error message ***
>>>>>>> 
>>>>>>> Guessing that vader uses shared memory this is expected to fail, with
>>>>>>> all the namespace isolations in place. Maybe not with a segfault, but
>>>>>>> each container has its own shared memory. So next step was to use the
>>>>>>> host's ipc and pid namespace and mount /dev/shm:
>>>>>>> 
>>>>>>> '-v /dev/shm:/dev/shm --ipc=host --pid=host'
>>>>>>> 
>>>>>>> Which does not segfault, but still does not look correct:
>>>>>>> 
>>>>>>> Rank 0 has cleared MPI_Init
>>>>>>> Rank 1 has cleared MPI_Init
>>>>>>> Rank 2 has cleared MPI_Init
>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>>>>> Rank 0 has completed ring
>>>>>>> Rank 2 has completed ring
>>>>>>> Rank 0 has completed MPI_Barrier
>>>>>>> Rank 1 has completed ring
>>>>>>> Rank 2 has completed MPI_Barrier
>>>>>>> Rank 1 has completed MPI_Barrier
>>>>>>> 
>>>>>>> This is using the Open MPI ring.c example with SIZE increased from 20 
>>>>>>> to 20000.
>>>>>>> 
>>>>>>> Any recommendations what vader needs to communicate correctly?
>>>>>>> 
>>>>>>>       Adrian
>>>>>>> 
>>>>>>>> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users wrote:
>>>>>>>> Gilles,
>>>>>>>> 
>>>>>>>> thanks for pointing out the environment variables. I quickly created a
>>>>>>>> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables
>>>>>>>> (grep "\(PMIX\|OMPI\)"). Now it works:
>>>>>>>> 
>>>>>>>> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id 
>>>>>>>> --net=host mpi-test /home/mpi/hello
>>>>>>>> 
>>>>>>>> Hello, world (2 procs total)
>>>>>>>>   --> Process #   0 of   2 is alive. ->test1
>>>>>>>>   --> Process #   1 of   2 is alive. ->test2
>>>>>>>> 
>>>>>>>> I need to tell Podman to mount /tmp from the host into the container, 
>>>>>>>> as
>>>>>>>> I am running rootless I also need to tell Podman to use the same user 
>>>>>>>> ID
>>>>>>>> in the container as outside (so that the Open MPI files in /tmp) can be
>>>>>>>> shared and I am also running without a network namespace.
>>>>>>>> 
>>>>>>>> So this is now with the full Podman provided isolation except the
>>>>>>>> network namespace. Thanks for you help!
>>>>>>>> 
>>>>>>>>       Adrian
>>>>>>>> 
>>>>>>>>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via 
>>>>>>>>> users wrote:
>>>>>>>>> Adrian,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> the MPI application relies on some environment variables (they 
>>>>>>>>> typically
>>>>>>>>> start with OMPI_ and PMIX_).
>>>>>>>>> 
>>>>>>>>> The MPI application internally uses a PMIx client that must be able to
>>>>>>>>> contact a PMIx server
>>>>>>>>> 
>>>>>>>>> (that is included in mpirun and the orted daemon(s) spawned on the 
>>>>>>>>> remote
>>>>>>>>> hosts).
>>>>>>>>> 
>>>>>>>>> located on the same host.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> If podman provides some isolation between the app inside the 
>>>>>>>>> container (e.g.
>>>>>>>>> /home/mpi/hello)
>>>>>>>>> 
>>>>>>>>> and the outside world (e.g. mpirun/orted), that won't be an easy ride.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Gilles
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote:
>>>>>>>>>> I did a quick test to see if I can use Podman in combination with 
>>>>>>>>>> Open
>>>>>>>>>> MPI:
>>>>>>>>>> 
>>>>>>>>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run 
>>>>>>>>>> quay.io/adrianreber/mpi-test /home/mpi/hello
>>>>>>>>>> 
>>>>>>>>>> Hello, world (1 procs total)
>>>>>>>>>>    --> Process #   0 of   1 is alive. ->789b8fb622ef
>>>>>>>>>> 
>>>>>>>>>> Hello, world (1 procs total)
>>>>>>>>>>    --> Process #   0 of   1 is alive. ->749eb4e1c01a
>>>>>>>>>> 
>>>>>>>>>> The test program (hello) is taken from 
>>>>>>>>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> The problem with this is that each process thinks it is process 0 of 
>>>>>>>>>> 1
>>>>>>>>>> instead of
>>>>>>>>>> 
>>>>>>>>>> Hello, world (2 procs total)
>>>>>>>>>>    --> Process #   1 of   2 is alive.  ->test1
>>>>>>>>>>    --> Process #   0 of   2 is alive.  ->test2
>>>>>>>>>> 
>>>>>>>>>> My questions is how is the rank determined? What resources do I need 
>>>>>>>>>> to have
>>>>>>>>>> in my container to correctly determine the rank.
>>>>>>>>>> 
>>>>>>>>>> This is Podman 1.4.2 and Open MPI 4.0.1.
>>>>>>>>>> 
>>>>>>>>>>       Adrian
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users@lists.open-mpi.org
>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users@lists.open-mpi.org
>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>              Adrian
>> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to