On Jul 24, 2019, at 5:16 PM, Ralph Castain via users <users@lists.open-mpi.org> wrote: > > It doesn't work that way, as you discovered. You need to add this information > at the same place where vader currently calls modex send, and then retrieve > it at the same place vader currently calls modex recv. Those macros don't do > an immediate send/recv like you are thinking - the send simply adds the value > to an aggregated payload, then the "fence" call distributes that payload to > everyone, and then the read extracts the requested piece from that payload.
Just to expand on what Ralph said, think of it like this: 1. each component/module does a modex "send", which just memcopies the data blob 2. the "fence()" is deep within ompi_mpi_init(), which does the actual data exchange of all the module blobs in an efficient manner 3. each component/module can then later do a modex "receive", which just memcopies the relevant blob from the module blobs that were actually received in step #2 (BTW, "modex" = "module exchange") > >> On Jul 24, 2019, at 5:23 AM, Adrian Reber <adr...@lisas.de> wrote: >> >> On Mon, Jul 22, 2019 at 04:30:50PM +0000, Ralph Castain wrote: >>>> On Jul 22, 2019, at 9:20 AM, Adrian Reber <adr...@lisas.de> wrote: >>>> >>>> I have most of the code ready, but I still have troubles doing >>>> OPAL_MODEX_RECV. I am using the following lines, based on the code from >>>> orte/test/mpi/pmix.c: >>>> >>>> OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", &value, OPAL_INT); >>>> >>>> This sets rc to 0. For receiving: >>>> >>>> OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", &wildcard_rank, &ptr, OPAL_INT); >>> >>> You need to replace "wildcard_rank" with the process name of the proc who >>> published the "user_ns_id" key. If/when we have mpirun provide the value, >>> then you can retrieve it from the wildcard rank as it will be coming from >>> the system and not an application proc >> >> So I can get the user namespace ID from all involved processes back to >> the main process (MCA_BTL_VADER_LOCAL_RANK == 0). But now only this >> process knows that the user namespace IDs are different and I have >> trouble using MODEX to send the information (do not use cma) back to the >> other involved processes. It seems am not able to used MODEX_{SEND,RECV} >> at the same time. One process sent and waits then on receive from the >> other processes. Something like this works >> >> PROC 0 PROC 1 >> recv() sent() >> >> >> Bit this does not work: >> >> PROC 0 PROC 1 >> recv() sent() >> sent() recv() >> >> If I start the recv() immediately after the send() on PROC 1 no messages >> are delivered anymore and everything hangs, even if different MODEX keys >> are used. It seems like MODEX can not fetch messages in another order >> than it was sent. Is that so? >> >> Not sure how to tell the other processes to not use CMA, while some >> processes are still transmitting their user namespace ID to PROC 0. >> >> Adrian >> >>>> and rc is always set to -13. Is this how it is supposed to work, or do I >>>> have to do it differently? >>>> >>>> Adrian >>>> >>>> On Mon, Jul 22, 2019 at 02:03:20PM +0000, Ralph Castain via users wrote: >>>>> If that works, then it might be possible to include the namespace ID in >>>>> the job-info provided by PMIx at startup - would have to investigate, so >>>>> please confirm that the modex option works first. >>>>> >>>>>> On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users >>>>>> <users@lists.open-mpi.org> wrote: >>>>>> >>>>>> Adrian, >>>>>> >>>>>> >>>>>> An option is to involve the modex. >>>>>> >>>>>> each task would OPAL_MODEX_SEND() its own namespace ID, and then >>>>>> OPAL_MODEX_RECV() >>>>>> >>>>>> the one from its peers and decide whether CMA support can be enabled. >>>>>> >>>>>> >>>>>> Cheers, >>>>>> >>>>>> >>>>>> Gilles >>>>>> >>>>>> On 7/22/2019 4:53 PM, Adrian Reber via users wrote: >>>>>>> I had a look at it and not sure if it really makes sense. >>>>>>> >>>>>>> In btl_vader_{put,get}.c it would be easy to check for the user >>>>>>> namespace ID of the other process, but the function would then just >>>>>>> return OPAL_ERROR a bit earlier instead of as a result of >>>>>>> process_vm_{read,write}v(). Nothing would really change. >>>>>>> >>>>>>> A better place for the check would be mca_btl_vader_check_single_copy() >>>>>>> but I do not know if at this point the PID of the other processes is >>>>>>> already known. Not sure if I can check for the user namespace ID of the >>>>>>> other processes. >>>>>>> >>>>>>> Any recommendations how to do this? >>>>>>> >>>>>>> Adrian >>>>>>> >>>>>>> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: >>>>>>>> Patches are always welcome. What would be great is a nice big warning >>>>>>>> that CMA support is disabled because the processes are on different >>>>>>>> namespaces. Ideally all MPI processes should be on the same namespace >>>>>>>> to ensure the best performance. >>>>>>>> >>>>>>>> -Nathan >>>>>>>> >>>>>>>>> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users >>>>>>>>> <users@lists.open-mpi.org> wrote: >>>>>>>>> >>>>>>>>> For completeness I am mentioning my results also here. >>>>>>>>> >>>>>>>>> To be able to mount file systems in the container it can only work if >>>>>>>>> user namespaces are used and even if the user IDs are all the same (in >>>>>>>>> each container and on the host), to be able to ptrace the kernel also >>>>>>>>> checks if the processes are in the same user namespace (in addition to >>>>>>>>> being owned by the same user). This check - same user namespace - >>>>>>>>> fails >>>>>>>>> and so process_vm_readv() and process_vm_writev() will also fail. >>>>>>>>> >>>>>>>>> So Open MPI's checks are currently not enough to detect if 'cma' can >>>>>>>>> be >>>>>>>>> used. Checking for the same user namespace would also be necessary. >>>>>>>>> >>>>>>>>> Is this a use case important enough to accept a patch for it? >>>>>>>>> >>>>>>>>> Adrian >>>>>>>>> >>>>>>>>>> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users >>>>>>>>>> wrote: >>>>>>>>>> Gilles, >>>>>>>>>> >>>>>>>>>> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' >>>>>>>>>> helps >>>>>>>>>> indeed. >>>>>>>>>> >>>>>>>>>> The default seems to be 'cma' and that seems to use >>>>>>>>>> process_vm_readv() >>>>>>>>>> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but >>>>>>>>>> telling Podman to give the process CAP_SYS_PTRACE with >>>>>>>>>> '--cap-add=SYS_PTRACE' >>>>>>>>>> does not seem to be enough. Not sure yet if this related to the fact >>>>>>>>>> that Podman is running rootless. I will continue to investigate, but >>>>>>>>>> now >>>>>>>>>> I know where to look. Thanks! >>>>>>>>>> >>>>>>>>>> Adrian >>>>>>>>>> >>>>>>>>>>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via >>>>>>>>>>> users wrote: >>>>>>>>>>> Adrian, >>>>>>>>>>> >>>>>>>>>>> Can you try >>>>>>>>>>> mpirun --mca btl_vader_copy_mechanism none ... >>>>>>>>>>> >>>>>>>>>>> Please double check the MCA parameter name, I am AFK >>>>>>>>>>> >>>>>>>>>>> IIRC, the default copy mechanism used by vader directly accesses >>>>>>>>>>> the remote process address space, and this requires some permission >>>>>>>>>>> (ptrace?) that might be dropped by podman. >>>>>>>>>>> >>>>>>>>>>> Note Open MPI might not detect both MPI tasks run on the same node >>>>>>>>>>> because of podman. >>>>>>>>>>> If you use UCX, then btl/vader is not used at all (pml/ucx is used >>>>>>>>>>> instead) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> >>>>>>>>>>> Gilles >>>>>>>>>>> >>>>>>>>>>> Sent from my iPod >>>>>>>>>>> >>>>>>>>>>>> On Jul 12, 2019, at 18:33, Adrian Reber via users >>>>>>>>>>>> <users@lists.open-mpi.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>> So upstream Podman was really fast and merged a PR which makes my >>>>>>>>>>>> wrapper unnecessary: >>>>>>>>>>>> >>>>>>>>>>>> Add support for --env-host : >>>>>>>>>>>> https://github.com/containers/libpod/pull/3557 >>>>>>>>>>>> >>>>>>>>>>>> As commented in the PR I can now start mpirun with Podman without a >>>>>>>>>>>> wrapper: >>>>>>>>>>>> >>>>>>>>>>>> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base >>>>>>>>>>>> /tmp/podman-mpirun podman run --env-host --security-opt >>>>>>>>>>>> label=disable -v /tmp/podman-mpirun:/tmp/podman-mpirun >>>>>>>>>>>> --userns=keep-id --net=host mpi-test /home/mpi/ring >>>>>>>>>>>> Rank 0 has cleared MPI_Init >>>>>>>>>>>> Rank 1 has cleared MPI_Init >>>>>>>>>>>> Rank 0 has completed ring >>>>>>>>>>>> Rank 0 has completed MPI_Barrier >>>>>>>>>>>> Rank 1 has completed ring >>>>>>>>>>>> Rank 1 has completed MPI_Barrier >>>>>>>>>>>> >>>>>>>>>>>> This is example was using TCP and on an InfiniBand based system I >>>>>>>>>>>> have >>>>>>>>>>>> to map the InfiniBand devices into the container. >>>>>>>>>>>> >>>>>>>>>>>> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca >>>>>>>>>>>> orte_tmpdir_base /tmp/podman-mpirun podman run --env-host -v >>>>>>>>>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable >>>>>>>>>>>> --userns=keep-id --device /dev/infiniband/uverbs0 --device >>>>>>>>>>>> /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host >>>>>>>>>>>> mpi-test /home/mpi/ring >>>>>>>>>>>> Rank 0 has cleared MPI_Init >>>>>>>>>>>> Rank 1 has cleared MPI_Init >>>>>>>>>>>> Rank 0 has completed ring >>>>>>>>>>>> Rank 0 has completed MPI_Barrier >>>>>>>>>>>> Rank 1 has completed ring >>>>>>>>>>>> Rank 1 has completed MPI_Barrier >>>>>>>>>>>> >>>>>>>>>>>> This is all running without root and only using Podman's rootless >>>>>>>>>>>> support. >>>>>>>>>>>> >>>>>>>>>>>> Running multiple processes on one system, however, still gives me >>>>>>>>>>>> an >>>>>>>>>>>> error. If I disable vader I guess that Open MPI is using TCP for >>>>>>>>>>>> localhost communication and that works. But with vader it fails. >>>>>>>>>>>> >>>>>>>>>>>> The first error message I get is a segfault: >>>>>>>>>>>> >>>>>>>>>>>> [test1:00001] *** Process received signal *** >>>>>>>>>>>> [test1:00001] Signal: Segmentation fault (11) >>>>>>>>>>>> [test1:00001] Signal code: Address not mapped (1) >>>>>>>>>>>> [test1:00001] Failing at address: 0x7fb7b1552010 >>>>>>>>>>>> [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80] >>>>>>>>>>>> [test1:00001] [ 1] >>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b] >>>>>>>>>>>> [test1:00001] [ 2] >>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb] >>>>>>>>>>>> [test1:00001] [ 3] >>>>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086] >>>>>>>>>>>> [test1:00001] [ 4] >>>>>>>>>>>> /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d] >>>>>>>>>>>> [test1:00001] [ 5] /home/mpi/ring[0x400b76] >>>>>>>>>>>> [test1:00001] [ 6] >>>>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813] >>>>>>>>>>>> [test1:00001] [ 7] /home/mpi/ring[0x4008be] >>>>>>>>>>>> [test1:00001] *** End of error message *** >>>>>>>>>>>> >>>>>>>>>>>> Guessing that vader uses shared memory this is expected to fail, >>>>>>>>>>>> with >>>>>>>>>>>> all the namespace isolations in place. Maybe not with a segfault, >>>>>>>>>>>> but >>>>>>>>>>>> each container has its own shared memory. So next step was to use >>>>>>>>>>>> the >>>>>>>>>>>> host's ipc and pid namespace and mount /dev/shm: >>>>>>>>>>>> >>>>>>>>>>>> '-v /dev/shm:/dev/shm --ipc=host --pid=host' >>>>>>>>>>>> >>>>>>>>>>>> Which does not segfault, but still does not look correct: >>>>>>>>>>>> >>>>>>>>>>>> Rank 0 has cleared MPI_Init >>>>>>>>>>>> Rank 1 has cleared MPI_Init >>>>>>>>>>>> Rank 2 has cleared MPI_Init >>>>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>>>>> [test1:17722] Read -1, expected 80000, errno = 1 >>>>>>>>>>>> Rank 0 has completed ring >>>>>>>>>>>> Rank 2 has completed ring >>>>>>>>>>>> Rank 0 has completed MPI_Barrier >>>>>>>>>>>> Rank 1 has completed ring >>>>>>>>>>>> Rank 2 has completed MPI_Barrier >>>>>>>>>>>> Rank 1 has completed MPI_Barrier >>>>>>>>>>>> >>>>>>>>>>>> This is using the Open MPI ring.c example with SIZE increased from >>>>>>>>>>>> 20 to 20000. >>>>>>>>>>>> >>>>>>>>>>>> Any recommendations what vader needs to communicate correctly? >>>>>>>>>>>> >>>>>>>>>>>> Adrian >>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> Gilles, >>>>>>>>>>>>> >>>>>>>>>>>>> thanks for pointing out the environment variables. I quickly >>>>>>>>>>>>> created a >>>>>>>>>>>>> wrapper which tells Podman to re-export all OMPI_ and PMIX_ >>>>>>>>>>>>> variables >>>>>>>>>>>>> (grep "\(PMIX\|OMPI\)"). Now it works: >>>>>>>>>>>>> >>>>>>>>>>>>> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp >>>>>>>>>>>>> --userns=keep-id --net=host mpi-test /home/mpi/hello >>>>>>>>>>>>> >>>>>>>>>>>>> Hello, world (2 procs total) >>>>>>>>>>>>> --> Process # 0 of 2 is alive. ->test1 >>>>>>>>>>>>> --> Process # 1 of 2 is alive. ->test2 >>>>>>>>>>>>> >>>>>>>>>>>>> I need to tell Podman to mount /tmp from the host into the >>>>>>>>>>>>> container, as >>>>>>>>>>>>> I am running rootless I also need to tell Podman to use the same >>>>>>>>>>>>> user ID >>>>>>>>>>>>> in the container as outside (so that the Open MPI files in /tmp) >>>>>>>>>>>>> can be >>>>>>>>>>>>> shared and I am also running without a network namespace. >>>>>>>>>>>>> >>>>>>>>>>>>> So this is now with the full Podman provided isolation except the >>>>>>>>>>>>> network namespace. Thanks for you help! >>>>>>>>>>>>> >>>>>>>>>>>>> Adrian >>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet >>>>>>>>>>>>>> via users wrote: >>>>>>>>>>>>>> Adrian, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> the MPI application relies on some environment variables (they >>>>>>>>>>>>>> typically >>>>>>>>>>>>>> start with OMPI_ and PMIX_). >>>>>>>>>>>>>> >>>>>>>>>>>>>> The MPI application internally uses a PMIx client that must be >>>>>>>>>>>>>> able to >>>>>>>>>>>>>> contact a PMIx server >>>>>>>>>>>>>> >>>>>>>>>>>>>> (that is included in mpirun and the orted daemon(s) spawned on >>>>>>>>>>>>>> the remote >>>>>>>>>>>>>> hosts). >>>>>>>>>>>>>> >>>>>>>>>>>>>> located on the same host. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> If podman provides some isolation between the app inside the >>>>>>>>>>>>>> container (e.g. >>>>>>>>>>>>>> /home/mpi/hello) >>>>>>>>>>>>>> >>>>>>>>>>>>>> and the outside world (e.g. mpirun/orted), that won't be an easy >>>>>>>>>>>>>> ride. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Gilles >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote: >>>>>>>>>>>>>>> I did a quick test to see if I can use Podman in combination >>>>>>>>>>>>>>> with Open >>>>>>>>>>>>>>> MPI: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run >>>>>>>>>>>>>>> quay.io/adrianreber/mpi-test /home/mpi/hello >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello, world (1 procs total) >>>>>>>>>>>>>>> --> Process # 0 of 1 is alive. ->789b8fb622ef >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello, world (1 procs total) >>>>>>>>>>>>>>> --> Process # 0 of 1 is alive. ->749eb4e1c01a >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The test program (hello) is taken from >>>>>>>>>>>>>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The problem with this is that each process thinks it is process >>>>>>>>>>>>>>> 0 of 1 >>>>>>>>>>>>>>> instead of >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello, world (2 procs total) >>>>>>>>>>>>>>> --> Process # 1 of 2 is alive. ->test1 >>>>>>>>>>>>>>> --> Process # 0 of 2 is alive. ->test2 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> My questions is how is the rank determined? What resources do I >>>>>>>>>>>>>>> need to have >>>>>>>>>>>>>>> in my container to correctly determine the rank. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is Podman 1.4.2 and Open MPI 4.0.1. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Adrian >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> users@lists.open-mpi.org >>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>> Adrian >>>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://lists.open-mpi.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users