Re: [OMPI users] How is the rank determined (Open MPI and Podman)
On Wed, Jul 24, 2019 at 09:46:13PM +, Jeff Squyres (jsquyres) wrote: > On Jul 24, 2019, at 5:16 PM, Ralph Castain via users > wrote: > > > > It doesn't work that way, as you discovered. You need to add this > > information at the same place where vader currently calls modex send, and > > then retrieve it at the same place vader currently calls modex recv. Those > > macros don't do an immediate send/recv like you are thinking - the send > > simply adds the value to an aggregated payload, then the "fence" call > > distributes that payload to everyone, and then the read extracts the > > requested piece from that payload. > > Just to expand on what Ralph said, think of it like this: > > 1. each component/module does a modex "send", which just memcopies the data > blob > > 2. the "fence()" is deep within ompi_mpi_init(), which does the actual data > exchange of all the module blobs in an efficient manner > > 3. each component/module can then later do a modex "receive", which just > memcopies the relevant blob from the module blobs that were actually received > in step #2 Thanks for the clarifications how this works. I was expecting something completely different. > (BTW, "modex" = "module exchange") Ah, also good to know. I was not really sure what it means. I opened a pull request completely without modex : https://github.com/open-mpi/ompi/pull/6844 I would have preferred to have the user namespace detection earlier than the actual {put,get}(), but I was not able to make it work. I hope I included all relevant information about my changes in the PR. Let's continue the discussion there. Thanks everyone for the support. Adrian > >> On Jul 24, 2019, at 5:23 AM, Adrian Reber wrote: > >> > >> On Mon, Jul 22, 2019 at 04:30:50PM +, Ralph Castain wrote: > On Jul 22, 2019, at 9:20 AM, Adrian Reber wrote: > > I have most of the code ready, but I still have troubles doing > OPAL_MODEX_RECV. I am using the following lines, based on the code from > orte/test/mpi/pmix.c: > > OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", , > OPAL_INT); > > This sets rc to 0. For receiving: > > OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", _rank, , OPAL_INT); > >>> > >>> You need to replace "wildcard_rank" with the process name of the proc who > >>> published the "user_ns_id" key. If/when we have mpirun provide the value, > >>> then you can retrieve it from the wildcard rank as it will be coming from > >>> the system and not an application proc > >> > >> So I can get the user namespace ID from all involved processes back to > >> the main process (MCA_BTL_VADER_LOCAL_RANK == 0). But now only this > >> process knows that the user namespace IDs are different and I have > >> trouble using MODEX to send the information (do not use cma) back to the > >> other involved processes. It seems am not able to used MODEX_{SEND,RECV} > >> at the same time. One process sent and waits then on receive from the > >> other processes. Something like this works > >> > >>PROC 0 PROC 1 > >>recv() sent() > >> > >> > >> Bit this does not work: > >> > >>PROC 0 PROC 1 > >>recv() sent() > >>sent() recv() > >> > >> If I start the recv() immediately after the send() on PROC 1 no messages > >> are delivered anymore and everything hangs, even if different MODEX keys > >> are used. It seems like MODEX can not fetch messages in another order > >> than it was sent. Is that so? > >> > >> Not sure how to tell the other processes to not use CMA, while some > >> processes are still transmitting their user namespace ID to PROC 0. > >> > >>Adrian > >> > and rc is always set to -13. Is this how it is supposed to work, or do I > have to do it differently? > > Adrian > > On Mon, Jul 22, 2019 at 02:03:20PM +, Ralph Castain via users wrote: > > If that works, then it might be possible to include the namespace ID in > > the job-info provided by PMIx at startup - would have to investigate, > > so please confirm that the modex option works first. > > > >> On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users > >> wrote: > >> > >> Adrian, > >> > >> > >> An option is to involve the modex. > >> > >> each task would OPAL_MODEX_SEND() its own namespace ID, and then > >> OPAL_MODEX_RECV() > >> > >> the one from its peers and decide whether CMA support can be enabled. > >> > >> > >> Cheers, > >> > >> > >> Gilles > >> > >> On 7/22/2019 4:53 PM, Adrian Reber via users wrote: > >>> I had a look at it and not sure if it really makes sense. > >>> > >>> In btl_vader_{put,get}.c it would be easy to check for the user > >>> namespace ID of the other process, but the
Re: [OMPI users] How is the rank determined (Open MPI and Podman)
On Jul 24, 2019, at 5:16 PM, Ralph Castain via users wrote: > > It doesn't work that way, as you discovered. You need to add this information > at the same place where vader currently calls modex send, and then retrieve > it at the same place vader currently calls modex recv. Those macros don't do > an immediate send/recv like you are thinking - the send simply adds the value > to an aggregated payload, then the "fence" call distributes that payload to > everyone, and then the read extracts the requested piece from that payload. Just to expand on what Ralph said, think of it like this: 1. each component/module does a modex "send", which just memcopies the data blob 2. the "fence()" is deep within ompi_mpi_init(), which does the actual data exchange of all the module blobs in an efficient manner 3. each component/module can then later do a modex "receive", which just memcopies the relevant blob from the module blobs that were actually received in step #2 (BTW, "modex" = "module exchange") > >> On Jul 24, 2019, at 5:23 AM, Adrian Reber wrote: >> >> On Mon, Jul 22, 2019 at 04:30:50PM +, Ralph Castain wrote: On Jul 22, 2019, at 9:20 AM, Adrian Reber wrote: I have most of the code ready, but I still have troubles doing OPAL_MODEX_RECV. I am using the following lines, based on the code from orte/test/mpi/pmix.c: OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", , OPAL_INT); This sets rc to 0. For receiving: OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", _rank, , OPAL_INT); >>> >>> You need to replace "wildcard_rank" with the process name of the proc who >>> published the "user_ns_id" key. If/when we have mpirun provide the value, >>> then you can retrieve it from the wildcard rank as it will be coming from >>> the system and not an application proc >> >> So I can get the user namespace ID from all involved processes back to >> the main process (MCA_BTL_VADER_LOCAL_RANK == 0). But now only this >> process knows that the user namespace IDs are different and I have >> trouble using MODEX to send the information (do not use cma) back to the >> other involved processes. It seems am not able to used MODEX_{SEND,RECV} >> at the same time. One process sent and waits then on receive from the >> other processes. Something like this works >> >>PROC 0 PROC 1 >>recv() sent() >> >> >> Bit this does not work: >> >>PROC 0 PROC 1 >>recv() sent() >>sent() recv() >> >> If I start the recv() immediately after the send() on PROC 1 no messages >> are delivered anymore and everything hangs, even if different MODEX keys >> are used. It seems like MODEX can not fetch messages in another order >> than it was sent. Is that so? >> >> Not sure how to tell the other processes to not use CMA, while some >> processes are still transmitting their user namespace ID to PROC 0. >> >> Adrian >> and rc is always set to -13. Is this how it is supposed to work, or do I have to do it differently? Adrian On Mon, Jul 22, 2019 at 02:03:20PM +, Ralph Castain via users wrote: > If that works, then it might be possible to include the namespace ID in > the job-info provided by PMIx at startup - would have to investigate, so > please confirm that the modex option works first. > >> On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users >> wrote: >> >> Adrian, >> >> >> An option is to involve the modex. >> >> each task would OPAL_MODEX_SEND() its own namespace ID, and then >> OPAL_MODEX_RECV() >> >> the one from its peers and decide whether CMA support can be enabled. >> >> >> Cheers, >> >> >> Gilles >> >> On 7/22/2019 4:53 PM, Adrian Reber via users wrote: >>> I had a look at it and not sure if it really makes sense. >>> >>> In btl_vader_{put,get}.c it would be easy to check for the user >>> namespace ID of the other process, but the function would then just >>> return OPAL_ERROR a bit earlier instead of as a result of >>> process_vm_{read,write}v(). Nothing would really change. >>> >>> A better place for the check would be mca_btl_vader_check_single_copy() >>> but I do not know if at this point the PID of the other processes is >>> already known. Not sure if I can check for the user namespace ID of the >>> other processes. >>> >>> Any recommendations how to do this? >>> >>> Adrian >>> >>> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: Patches are always welcome. What would be great is a nice big warning that CMA support is disabled because the processes are on different namespaces. Ideally all MPI processes should be on the same namespace
Re: [OMPI users] How is the rank determined (Open MPI and Podman)
It doesn't work that way, as you discovered. You need to add this information at the same place where vader currently calls modex send, and then retrieve it at the same place vader currently calls modex recv. Those macros don't do an immediate send/recv like you are thinking - the send simply adds the value to an aggregated payload, then the "fence" call distributes that payload to everyone, and then the read extracts the requested piece from that payload. > On Jul 24, 2019, at 5:23 AM, Adrian Reber wrote: > > On Mon, Jul 22, 2019 at 04:30:50PM +, Ralph Castain wrote: >>> On Jul 22, 2019, at 9:20 AM, Adrian Reber wrote: >>> >>> I have most of the code ready, but I still have troubles doing >>> OPAL_MODEX_RECV. I am using the following lines, based on the code from >>> orte/test/mpi/pmix.c: >>> >>> OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", , OPAL_INT); >>> >>> This sets rc to 0. For receiving: >>> >>> OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", _rank, , OPAL_INT); >> >> You need to replace "wildcard_rank" with the process name of the proc who >> published the "user_ns_id" key. If/when we have mpirun provide the value, >> then you can retrieve it from the wildcard rank as it will be coming from >> the system and not an application proc > > So I can get the user namespace ID from all involved processes back to > the main process (MCA_BTL_VADER_LOCAL_RANK == 0). But now only this > process knows that the user namespace IDs are different and I have > trouble using MODEX to send the information (do not use cma) back to the > other involved processes. It seems am not able to used MODEX_{SEND,RECV} > at the same time. One process sent and waits then on receive from the > other processes. Something like this works > > PROC 0 PROC 1 > recv() sent() > > > Bit this does not work: > > PROC 0 PROC 1 > recv() sent() > sent() recv() > > If I start the recv() immediately after the send() on PROC 1 no messages > are delivered anymore and everything hangs, even if different MODEX keys > are used. It seems like MODEX can not fetch messages in another order > than it was sent. Is that so? > > Not sure how to tell the other processes to not use CMA, while some > processes are still transmitting their user namespace ID to PROC 0. > > Adrian > >>> and rc is always set to -13. Is this how it is supposed to work, or do I >>> have to do it differently? >>> >>> Adrian >>> >>> On Mon, Jul 22, 2019 at 02:03:20PM +, Ralph Castain via users wrote: If that works, then it might be possible to include the namespace ID in the job-info provided by PMIx at startup - would have to investigate, so please confirm that the modex option works first. > On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users > wrote: > > Adrian, > > > An option is to involve the modex. > > each task would OPAL_MODEX_SEND() its own namespace ID, and then > OPAL_MODEX_RECV() > > the one from its peers and decide whether CMA support can be enabled. > > > Cheers, > > > Gilles > > On 7/22/2019 4:53 PM, Adrian Reber via users wrote: >> I had a look at it and not sure if it really makes sense. >> >> In btl_vader_{put,get}.c it would be easy to check for the user >> namespace ID of the other process, but the function would then just >> return OPAL_ERROR a bit earlier instead of as a result of >> process_vm_{read,write}v(). Nothing would really change. >> >> A better place for the check would be mca_btl_vader_check_single_copy() >> but I do not know if at this point the PID of the other processes is >> already known. Not sure if I can check for the user namespace ID of the >> other processes. >> >> Any recommendations how to do this? >> >> Adrian >> >> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: >>> Patches are always welcome. What would be great is a nice big warning >>> that CMA support is disabled because the processes are on different >>> namespaces. Ideally all MPI processes should be on the same namespace >>> to ensure the best performance. >>> >>> -Nathan >>> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users wrote: For completeness I am mentioning my results also here. To be able to mount file systems in the container it can only work if user namespaces are used and even if the user IDs are all the same (in each container and on the host), to be able to ptrace the kernel also checks if the processes are in the same user namespace (in addition to being owned by the same user). This check - same user namespace - fails and so
Re: [OMPI users] How is the rank determined (Open MPI and Podman)
Just add it to the existing modex. -Nathan > On Jul 22, 2019, at 12:20 PM, Adrian Reber via users > wrote: > > I have most of the code ready, but I still have troubles doing > OPAL_MODEX_RECV. I am using the following lines, based on the code from > orte/test/mpi/pmix.c: > > OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", , OPAL_INT); > > This sets rc to 0. For receiving: > > OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", _rank, , OPAL_INT); > > and rc is always set to -13. Is this how it is supposed to work, or do I > have to do it differently? > >Adrian > >> On Mon, Jul 22, 2019 at 02:03:20PM +, Ralph Castain via users wrote: >> If that works, then it might be possible to include the namespace ID in the >> job-info provided by PMIx at startup - would have to investigate, so please >> confirm that the modex option works first. >> >>> On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users >>> wrote: >>> >>> Adrian, >>> >>> >>> An option is to involve the modex. >>> >>> each task would OPAL_MODEX_SEND() its own namespace ID, and then >>> OPAL_MODEX_RECV() >>> >>> the one from its peers and decide whether CMA support can be enabled. >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> On 7/22/2019 4:53 PM, Adrian Reber via users wrote: I had a look at it and not sure if it really makes sense. In btl_vader_{put,get}.c it would be easy to check for the user namespace ID of the other process, but the function would then just return OPAL_ERROR a bit earlier instead of as a result of process_vm_{read,write}v(). Nothing would really change. A better place for the check would be mca_btl_vader_check_single_copy() but I do not know if at this point the PID of the other processes is already known. Not sure if I can check for the user namespace ID of the other processes. Any recommendations how to do this? Adrian > On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: > Patches are always welcome. What would be great is a nice big warning > that CMA support is disabled because the processes are on different > namespaces. Ideally all MPI processes should be on the same namespace to > ensure the best performance. > > -Nathan > >> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users >> wrote: >> >> For completeness I am mentioning my results also here. >> >> To be able to mount file systems in the container it can only work if >> user namespaces are used and even if the user IDs are all the same (in >> each container and on the host), to be able to ptrace the kernel also >> checks if the processes are in the same user namespace (in addition to >> being owned by the same user). This check - same user namespace - fails >> and so process_vm_readv() and process_vm_writev() will also fail. >> >> So Open MPI's checks are currently not enough to detect if 'cma' can be >> used. Checking for the same user namespace would also be necessary. >> >> Is this a use case important enough to accept a patch for it? >> >> Adrian >> >>> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote: >>> Gilles, >>> >>> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps >>> indeed. >>> >>> The default seems to be 'cma' and that seems to use process_vm_readv() >>> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but >>> telling Podman to give the process CAP_SYS_PTRACE with >>> '--cap-add=SYS_PTRACE' >>> does not seem to be enough. Not sure yet if this related to the fact >>> that Podman is running rootless. I will continue to investigate, but now >>> I know where to look. Thanks! >>> >>> Adrian >>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users wrote: Adrian, Can you try mpirun --mca btl_vader_copy_mechanism none ... Please double check the MCA parameter name, I am AFK IIRC, the default copy mechanism used by vader directly accesses the remote process address space, and this requires some permission (ptrace?) that might be dropped by podman. Note Open MPI might not detect both MPI tasks run on the same node because of podman. If you use UCX, then btl/vader is not used at all (pml/ucx is used instead) Cheers, Gilles Sent from my iPod > On Jul 12, 2019, at 18:33, Adrian Reber via users > wrote: > > So upstream Podman was really fast and merged a PR which makes my > wrapper unnecessary: > > Add support for --env-host : >
Re: [OMPI users] How is the rank determined (Open MPI and Podman)
I have most of the code ready, but I still have troubles doing OPAL_MODEX_RECV. I am using the following lines, based on the code from orte/test/mpi/pmix.c: OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", , OPAL_INT); This sets rc to 0. For receiving: OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", _rank, , OPAL_INT); and rc is always set to -13. Is this how it is supposed to work, or do I have to do it differently? Adrian On Mon, Jul 22, 2019 at 02:03:20PM +, Ralph Castain via users wrote: > If that works, then it might be possible to include the namespace ID in the > job-info provided by PMIx at startup - would have to investigate, so please > confirm that the modex option works first. > > > On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users > > wrote: > > > > Adrian, > > > > > > An option is to involve the modex. > > > > each task would OPAL_MODEX_SEND() its own namespace ID, and then > > OPAL_MODEX_RECV() > > > > the one from its peers and decide whether CMA support can be enabled. > > > > > > Cheers, > > > > > > Gilles > > > > On 7/22/2019 4:53 PM, Adrian Reber via users wrote: > >> I had a look at it and not sure if it really makes sense. > >> > >> In btl_vader_{put,get}.c it would be easy to check for the user > >> namespace ID of the other process, but the function would then just > >> return OPAL_ERROR a bit earlier instead of as a result of > >> process_vm_{read,write}v(). Nothing would really change. > >> > >> A better place for the check would be mca_btl_vader_check_single_copy() > >> but I do not know if at this point the PID of the other processes is > >> already known. Not sure if I can check for the user namespace ID of the > >> other processes. > >> > >> Any recommendations how to do this? > >> > >>Adrian > >> > >> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: > >>> Patches are always welcome. What would be great is a nice big warning > >>> that CMA support is disabled because the processes are on different > >>> namespaces. Ideally all MPI processes should be on the same namespace to > >>> ensure the best performance. > >>> > >>> -Nathan > >>> > On Jul 21, 2019, at 2:53 PM, Adrian Reber via users > wrote: > > For completeness I am mentioning my results also here. > > To be able to mount file systems in the container it can only work if > user namespaces are used and even if the user IDs are all the same (in > each container and on the host), to be able to ptrace the kernel also > checks if the processes are in the same user namespace (in addition to > being owned by the same user). This check - same user namespace - fails > and so process_vm_readv() and process_vm_writev() will also fail. > > So Open MPI's checks are currently not enough to detect if 'cma' can be > used. Checking for the same user namespace would also be necessary. > > Is this a use case important enough to accept a patch for it? > > Adrian > > > On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote: > > Gilles, > > > > thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps > > indeed. > > > > The default seems to be 'cma' and that seems to use process_vm_readv() > > and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but > > telling Podman to give the process CAP_SYS_PTRACE with > > '--cap-add=SYS_PTRACE' > > does not seem to be enough. Not sure yet if this related to the fact > > that Podman is running rootless. I will continue to investigate, but now > > I know where to look. Thanks! > > > >Adrian > > > >> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via > >> users wrote: > >> Adrian, > >> > >> Can you try > >> mpirun --mca btl_vader_copy_mechanism none ... > >> > >> Please double check the MCA parameter name, I am AFK > >> > >> IIRC, the default copy mechanism used by vader directly accesses the > >> remote process address space, and this requires some permission > >> (ptrace?) that might be dropped by podman. > >> > >> Note Open MPI might not detect both MPI tasks run on the same node > >> because of podman. > >> If you use UCX, then btl/vader is not used at all (pml/ucx is used > >> instead) > >> > >> > >> Cheers, > >> > >> Gilles > >> > >> Sent from my iPod > >> > >>> On Jul 12, 2019, at 18:33, Adrian Reber via users > >>> wrote: > >>> > >>> So upstream Podman was really fast and merged a PR which makes my > >>> wrapper unnecessary: > >>> > >>> Add support for --env-host : > >>> https://github.com/containers/libpod/pull/3557 > >>> > >>> As commented in the PR I can now start mpirun with Podman without a > >>> wrapper: >
Re: [OMPI users] How is the rank determined (Open MPI and Podman)
If that works, then it might be possible to include the namespace ID in the job-info provided by PMIx at startup - would have to investigate, so please confirm that the modex option works first. > On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users > wrote: > > Adrian, > > > An option is to involve the modex. > > each task would OPAL_MODEX_SEND() its own namespace ID, and then > OPAL_MODEX_RECV() > > the one from its peers and decide whether CMA support can be enabled. > > > Cheers, > > > Gilles > > On 7/22/2019 4:53 PM, Adrian Reber via users wrote: >> I had a look at it and not sure if it really makes sense. >> >> In btl_vader_{put,get}.c it would be easy to check for the user >> namespace ID of the other process, but the function would then just >> return OPAL_ERROR a bit earlier instead of as a result of >> process_vm_{read,write}v(). Nothing would really change. >> >> A better place for the check would be mca_btl_vader_check_single_copy() >> but I do not know if at this point the PID of the other processes is >> already known. Not sure if I can check for the user namespace ID of the >> other processes. >> >> Any recommendations how to do this? >> >> Adrian >> >> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: >>> Patches are always welcome. What would be great is a nice big warning that >>> CMA support is disabled because the processes are on different namespaces. >>> Ideally all MPI processes should be on the same namespace to ensure the >>> best performance. >>> >>> -Nathan >>> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users wrote: For completeness I am mentioning my results also here. To be able to mount file systems in the container it can only work if user namespaces are used and even if the user IDs are all the same (in each container and on the host), to be able to ptrace the kernel also checks if the processes are in the same user namespace (in addition to being owned by the same user). This check - same user namespace - fails and so process_vm_readv() and process_vm_writev() will also fail. So Open MPI's checks are currently not enough to detect if 'cma' can be used. Checking for the same user namespace would also be necessary. Is this a use case important enough to accept a patch for it? Adrian > On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote: > Gilles, > > thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps > indeed. > > The default seems to be 'cma' and that seems to use process_vm_readv() > and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but > telling Podman to give the process CAP_SYS_PTRACE with > '--cap-add=SYS_PTRACE' > does not seem to be enough. Not sure yet if this related to the fact > that Podman is running rootless. I will continue to investigate, but now > I know where to look. Thanks! > >Adrian > >> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users >> wrote: >> Adrian, >> >> Can you try >> mpirun --mca btl_vader_copy_mechanism none ... >> >> Please double check the MCA parameter name, I am AFK >> >> IIRC, the default copy mechanism used by vader directly accesses the >> remote process address space, and this requires some permission >> (ptrace?) that might be dropped by podman. >> >> Note Open MPI might not detect both MPI tasks run on the same node >> because of podman. >> If you use UCX, then btl/vader is not used at all (pml/ucx is used >> instead) >> >> >> Cheers, >> >> Gilles >> >> Sent from my iPod >> >>> On Jul 12, 2019, at 18:33, Adrian Reber via users >>> wrote: >>> >>> So upstream Podman was really fast and merged a PR which makes my >>> wrapper unnecessary: >>> >>> Add support for --env-host : >>> https://github.com/containers/libpod/pull/3557 >>> >>> As commented in the PR I can now start mpirun with Podman without a >>> wrapper: >>> >>> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun >>> podman run --env-host --security-opt label=disable -v >>> /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host >>> mpi-test /home/mpi/ring >>> Rank 0 has cleared MPI_Init >>> Rank 1 has cleared MPI_Init >>> Rank 0 has completed ring >>> Rank 0 has completed MPI_Barrier >>> Rank 1 has completed ring >>> Rank 1 has completed MPI_Barrier >>> >>> This is example was using TCP and on an InfiniBand based system I have >>> to map the InfiniBand devices into the container. >>> >>> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base >>> /tmp/podman-mpirun podman run --env-host
Re: [OMPI users] How is the rank determined (Open MPI and Podman)
Adrian, An option is to involve the modex. each task would OPAL_MODEX_SEND() its own namespace ID, and then OPAL_MODEX_RECV() the one from its peers and decide whether CMA support can be enabled. Cheers, Gilles On 7/22/2019 4:53 PM, Adrian Reber via users wrote: I had a look at it and not sure if it really makes sense. In btl_vader_{put,get}.c it would be easy to check for the user namespace ID of the other process, but the function would then just return OPAL_ERROR a bit earlier instead of as a result of process_vm_{read,write}v(). Nothing would really change. A better place for the check would be mca_btl_vader_check_single_copy() but I do not know if at this point the PID of the other processes is already known. Not sure if I can check for the user namespace ID of the other processes. Any recommendations how to do this? Adrian On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: Patches are always welcome. What would be great is a nice big warning that CMA support is disabled because the processes are on different namespaces. Ideally all MPI processes should be on the same namespace to ensure the best performance. -Nathan On Jul 21, 2019, at 2:53 PM, Adrian Reber via users wrote: For completeness I am mentioning my results also here. To be able to mount file systems in the container it can only work if user namespaces are used and even if the user IDs are all the same (in each container and on the host), to be able to ptrace the kernel also checks if the processes are in the same user namespace (in addition to being owned by the same user). This check - same user namespace - fails and so process_vm_readv() and process_vm_writev() will also fail. So Open MPI's checks are currently not enough to detect if 'cma' can be used. Checking for the same user namespace would also be necessary. Is this a use case important enough to accept a patch for it? Adrian On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote: Gilles, thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps indeed. The default seems to be 'cma' and that seems to use process_vm_readv() and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but telling Podman to give the process CAP_SYS_PTRACE with '--cap-add=SYS_PTRACE' does not seem to be enough. Not sure yet if this related to the fact that Podman is running rootless. I will continue to investigate, but now I know where to look. Thanks! Adrian On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users wrote: Adrian, Can you try mpirun --mca btl_vader_copy_mechanism none ... Please double check the MCA parameter name, I am AFK IIRC, the default copy mechanism used by vader directly accesses the remote process address space, and this requires some permission (ptrace?) that might be dropped by podman. Note Open MPI might not detect both MPI tasks run on the same node because of podman. If you use UCX, then btl/vader is not used at all (pml/ucx is used instead) Cheers, Gilles Sent from my iPod On Jul 12, 2019, at 18:33, Adrian Reber via users wrote: So upstream Podman was really fast and merged a PR which makes my wrapper unnecessary: Add support for --env-host : https://github.com/containers/libpod/pull/3557 As commented in the PR I can now start mpirun with Podman without a wrapper: $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun podman run --env-host --security-opt label=disable -v /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test /home/mpi/ring Rank 0 has cleared MPI_Init Rank 1 has cleared MPI_Init Rank 0 has completed ring Rank 0 has completed MPI_Barrier Rank 1 has completed ring Rank 1 has completed MPI_Barrier This is example was using TCP and on an InfiniBand based system I have to map the InfiniBand devices into the container. $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun podman run --env-host -v /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable --userns=keep-id --device /dev/infiniband/uverbs0 --device /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test /home/mpi/ring Rank 0 has cleared MPI_Init Rank 1 has cleared MPI_Init Rank 0 has completed ring Rank 0 has completed MPI_Barrier Rank 1 has completed ring Rank 1 has completed MPI_Barrier This is all running without root and only using Podman's rootless support. Running multiple processes on one system, however, still gives me an error. If I disable vader I guess that Open MPI is using TCP for localhost communication and that works. But with vader it fails. The first error message I get is a segfault: [test1:1] *** Process received signal *** [test1:1] Signal: Segmentation fault (11) [test1:1] Signal code: Address not mapped (1) [test1:1] Failing at address: 0x7fb7b1552010 [test1:1] [ 0]
Re: [OMPI users] How is the rank determined (Open MPI and Podman)
I had a look at it and not sure if it really makes sense. In btl_vader_{put,get}.c it would be easy to check for the user namespace ID of the other process, but the function would then just return OPAL_ERROR a bit earlier instead of as a result of process_vm_{read,write}v(). Nothing would really change. A better place for the check would be mca_btl_vader_check_single_copy() but I do not know if at this point the PID of the other processes is already known. Not sure if I can check for the user namespace ID of the other processes. Any recommendations how to do this? Adrian On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote: > Patches are always welcome. What would be great is a nice big warning that > CMA support is disabled because the processes are on different namespaces. > Ideally all MPI processes should be on the same namespace to ensure the best > performance. > > -Nathan > > > On Jul 21, 2019, at 2:53 PM, Adrian Reber via users > > wrote: > > > > For completeness I am mentioning my results also here. > > > > To be able to mount file systems in the container it can only work if > > user namespaces are used and even if the user IDs are all the same (in > > each container and on the host), to be able to ptrace the kernel also > > checks if the processes are in the same user namespace (in addition to > > being owned by the same user). This check - same user namespace - fails > > and so process_vm_readv() and process_vm_writev() will also fail. > > > > So Open MPI's checks are currently not enough to detect if 'cma' can be > > used. Checking for the same user namespace would also be necessary. > > > > Is this a use case important enough to accept a patch for it? > > > >Adrian > > > >> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote: > >> Gilles, > >> > >> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps > >> indeed. > >> > >> The default seems to be 'cma' and that seems to use process_vm_readv() > >> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but > >> telling Podman to give the process CAP_SYS_PTRACE with > >> '--cap-add=SYS_PTRACE' > >> does not seem to be enough. Not sure yet if this related to the fact > >> that Podman is running rootless. I will continue to investigate, but now > >> I know where to look. Thanks! > >> > >>Adrian > >> > >>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users > >>> wrote: > >>> Adrian, > >>> > >>> Can you try > >>> mpirun --mca btl_vader_copy_mechanism none ... > >>> > >>> Please double check the MCA parameter name, I am AFK > >>> > >>> IIRC, the default copy mechanism used by vader directly accesses the > >>> remote process address space, and this requires some permission (ptrace?) > >>> that might be dropped by podman. > >>> > >>> Note Open MPI might not detect both MPI tasks run on the same node > >>> because of podman. > >>> If you use UCX, then btl/vader is not used at all (pml/ucx is used > >>> instead) > >>> > >>> > >>> Cheers, > >>> > >>> Gilles > >>> > >>> Sent from my iPod > >>> > On Jul 12, 2019, at 18:33, Adrian Reber via users > wrote: > > So upstream Podman was really fast and merged a PR which makes my > wrapper unnecessary: > > Add support for --env-host : > https://github.com/containers/libpod/pull/3557 > > As commented in the PR I can now start mpirun with Podman without a > wrapper: > > $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun > podman run --env-host --security-opt label=disable -v > /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host > mpi-test /home/mpi/ring > Rank 0 has cleared MPI_Init > Rank 1 has cleared MPI_Init > Rank 0 has completed ring > Rank 0 has completed MPI_Barrier > Rank 1 has completed ring > Rank 1 has completed MPI_Barrier > > This is example was using TCP and on an InfiniBand based system I have > to map the InfiniBand devices into the container. > > $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base > /tmp/podman-mpirun podman run --env-host -v > /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable > --userns=keep-id --device /dev/infiniband/uverbs0 --device > /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host > mpi-test /home/mpi/ring > Rank 0 has cleared MPI_Init > Rank 1 has cleared MPI_Init > Rank 0 has completed ring > Rank 0 has completed MPI_Barrier > Rank 1 has completed ring > Rank 1 has completed MPI_Barrier > > This is all running without root and only using Podman's rootless > support. > > Running multiple processes on one system, however, still gives me an > error. If I disable vader I guess that Open MPI is using TCP for >
Re: [OMPI users] How it the rank determined (Open MPI and Podman)
Patches are always welcome. What would be great is a nice big warning that CMA support is disabled because the processes are on different namespaces. Ideally all MPI processes should be on the same namespace to ensure the best performance. -Nathan > On Jul 21, 2019, at 2:53 PM, Adrian Reber via users > wrote: > > For completeness I am mentioning my results also here. > > To be able to mount file systems in the container it can only work if > user namespaces are used and even if the user IDs are all the same (in > each container and on the host), to be able to ptrace the kernel also > checks if the processes are in the same user namespace (in addition to > being owned by the same user). This check - same user namespace - fails > and so process_vm_readv() and process_vm_writev() will also fail. > > So Open MPI's checks are currently not enough to detect if 'cma' can be > used. Checking for the same user namespace would also be necessary. > > Is this a use case important enough to accept a patch for it? > >Adrian > >> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote: >> Gilles, >> >> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps >> indeed. >> >> The default seems to be 'cma' and that seems to use process_vm_readv() >> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but >> telling Podman to give the process CAP_SYS_PTRACE with '--cap-add=SYS_PTRACE' >> does not seem to be enough. Not sure yet if this related to the fact >> that Podman is running rootless. I will continue to investigate, but now >> I know where to look. Thanks! >> >>Adrian >> >>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users >>> wrote: >>> Adrian, >>> >>> Can you try >>> mpirun --mca btl_vader_copy_mechanism none ... >>> >>> Please double check the MCA parameter name, I am AFK >>> >>> IIRC, the default copy mechanism used by vader directly accesses the remote >>> process address space, and this requires some permission (ptrace?) that >>> might be dropped by podman. >>> >>> Note Open MPI might not detect both MPI tasks run on the same node because >>> of podman. >>> If you use UCX, then btl/vader is not used at all (pml/ucx is used instead) >>> >>> >>> Cheers, >>> >>> Gilles >>> >>> Sent from my iPod >>> On Jul 12, 2019, at 18:33, Adrian Reber via users wrote: So upstream Podman was really fast and merged a PR which makes my wrapper unnecessary: Add support for --env-host : https://github.com/containers/libpod/pull/3557 As commented in the PR I can now start mpirun with Podman without a wrapper: $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun podman run --env-host --security-opt label=disable -v /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test /home/mpi/ring Rank 0 has cleared MPI_Init Rank 1 has cleared MPI_Init Rank 0 has completed ring Rank 0 has completed MPI_Barrier Rank 1 has completed ring Rank 1 has completed MPI_Barrier This is example was using TCP and on an InfiniBand based system I have to map the InfiniBand devices into the container. $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun podman run --env-host -v /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable --userns=keep-id --device /dev/infiniband/uverbs0 --device /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test /home/mpi/ring Rank 0 has cleared MPI_Init Rank 1 has cleared MPI_Init Rank 0 has completed ring Rank 0 has completed MPI_Barrier Rank 1 has completed ring Rank 1 has completed MPI_Barrier This is all running without root and only using Podman's rootless support. Running multiple processes on one system, however, still gives me an error. If I disable vader I guess that Open MPI is using TCP for localhost communication and that works. But with vader it fails. The first error message I get is a segfault: [test1:1] *** Process received signal *** [test1:1] Signal: Segmentation fault (11) [test1:1] Signal code: Address not mapped (1) [test1:1] Failing at address: 0x7fb7b1552010 [test1:1] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80] [test1:1] [ 1] /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b] [test1:1] [ 2] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb] [test1:1] [ 3] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086] [test1:1] [ 4] /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d] [test1:1] [
Re: [OMPI users] How it the rank determined (Open MPI and Podman)
For completeness I am mentioning my results also here. To be able to mount file systems in the container it can only work if user namespaces are used and even if the user IDs are all the same (in each container and on the host), to be able to ptrace the kernel also checks if the processes are in the same user namespace (in addition to being owned by the same user). This check - same user namespace - fails and so process_vm_readv() and process_vm_writev() will also fail. So Open MPI's checks are currently not enough to detect if 'cma' can be used. Checking for the same user namespace would also be necessary. Is this a use case important enough to accept a patch for it? Adrian On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote: > Gilles, > > thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps > indeed. > > The default seems to be 'cma' and that seems to use process_vm_readv() > and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but > telling Podman to give the process CAP_SYS_PTRACE with '--cap-add=SYS_PTRACE' > does not seem to be enough. Not sure yet if this related to the fact > that Podman is running rootless. I will continue to investigate, but now > I know where to look. Thanks! > > Adrian > > On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users wrote: > > Adrian, > > > > Can you try > > mpirun --mca btl_vader_copy_mechanism none ... > > > > Please double check the MCA parameter name, I am AFK > > > > IIRC, the default copy mechanism used by vader directly accesses the remote > > process address space, and this requires some permission (ptrace?) that > > might be dropped by podman. > > > > Note Open MPI might not detect both MPI tasks run on the same node because > > of podman. > > If you use UCX, then btl/vader is not used at all (pml/ucx is used instead) > > > > > > Cheers, > > > > Gilles > > > > Sent from my iPod > > > > > On Jul 12, 2019, at 18:33, Adrian Reber via users > > > wrote: > > > > > > So upstream Podman was really fast and merged a PR which makes my > > > wrapper unnecessary: > > > > > > Add support for --env-host : > > > https://github.com/containers/libpod/pull/3557 > > > > > > As commented in the PR I can now start mpirun with Podman without a > > > wrapper: > > > > > > $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun > > > podman run --env-host --security-opt label=disable -v > > > /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host > > > mpi-test /home/mpi/ring > > > Rank 0 has cleared MPI_Init > > > Rank 1 has cleared MPI_Init > > > Rank 0 has completed ring > > > Rank 0 has completed MPI_Barrier > > > Rank 1 has completed ring > > > Rank 1 has completed MPI_Barrier > > > > > > This is example was using TCP and on an InfiniBand based system I have > > > to map the InfiniBand devices into the container. > > > > > > $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base > > > /tmp/podman-mpirun podman run --env-host -v > > > /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable > > > --userns=keep-id --device /dev/infiniband/uverbs0 --device > > > /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host > > > mpi-test /home/mpi/ring > > > Rank 0 has cleared MPI_Init > > > Rank 1 has cleared MPI_Init > > > Rank 0 has completed ring > > > Rank 0 has completed MPI_Barrier > > > Rank 1 has completed ring > > > Rank 1 has completed MPI_Barrier > > > > > > This is all running without root and only using Podman's rootless > > > support. > > > > > > Running multiple processes on one system, however, still gives me an > > > error. If I disable vader I guess that Open MPI is using TCP for > > > localhost communication and that works. But with vader it fails. > > > > > > The first error message I get is a segfault: > > > > > > [test1:1] *** Process received signal *** > > > [test1:1] Signal: Segmentation fault (11) > > > [test1:1] Signal code: Address not mapped (1) > > > [test1:1] Failing at address: 0x7fb7b1552010 > > > [test1:1] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80] > > > [test1:1] [ 1] > > > /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b] > > > [test1:1] [ 2] > > > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb] > > > [test1:1] [ 3] > > > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086] > > > [test1:1] [ 4] > > > /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d] > > > [test1:1] [ 5] /home/mpi/ring[0x400b76] > > > [test1:1] [ 6] > > > /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813] > > > [test1:1] [ 7] /home/mpi/ring[0x4008be] > > > [test1:1] *** End of error message *** > > > > > > Guessing that vader uses shared memory this is expected to fail,
Re: [OMPI users] How it the rank determined (Open MPI and Podman)
Gilles, thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps indeed. The default seems to be 'cma' and that seems to use process_vm_readv() and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but telling Podman to give the process CAP_SYS_PTRACE with '--cap-add=SYS_PTRACE' does not seem to be enough. Not sure yet if this related to the fact that Podman is running rootless. I will continue to investigate, but now I know where to look. Thanks! Adrian On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users wrote: > Adrian, > > Can you try > mpirun --mca btl_vader_copy_mechanism none ... > > Please double check the MCA parameter name, I am AFK > > IIRC, the default copy mechanism used by vader directly accesses the remote > process address space, and this requires some permission (ptrace?) that might > be dropped by podman. > > Note Open MPI might not detect both MPI tasks run on the same node because of > podman. > If you use UCX, then btl/vader is not used at all (pml/ucx is used instead) > > > Cheers, > > Gilles > > Sent from my iPod > > > On Jul 12, 2019, at 18:33, Adrian Reber via users > > wrote: > > > > So upstream Podman was really fast and merged a PR which makes my > > wrapper unnecessary: > > > > Add support for --env-host : https://github.com/containers/libpod/pull/3557 > > > > As commented in the PR I can now start mpirun with Podman without a > > wrapper: > > > > $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun > > podman run --env-host --security-opt label=disable -v > > /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test > > /home/mpi/ring > > Rank 0 has cleared MPI_Init > > Rank 1 has cleared MPI_Init > > Rank 0 has completed ring > > Rank 0 has completed MPI_Barrier > > Rank 1 has completed ring > > Rank 1 has completed MPI_Barrier > > > > This is example was using TCP and on an InfiniBand based system I have > > to map the InfiniBand devices into the container. > > > > $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base > > /tmp/podman-mpirun podman run --env-host -v > > /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable > > --userns=keep-id --device /dev/infiniband/uverbs0 --device > > /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test > > /home/mpi/ring > > Rank 0 has cleared MPI_Init > > Rank 1 has cleared MPI_Init > > Rank 0 has completed ring > > Rank 0 has completed MPI_Barrier > > Rank 1 has completed ring > > Rank 1 has completed MPI_Barrier > > > > This is all running without root and only using Podman's rootless > > support. > > > > Running multiple processes on one system, however, still gives me an > > error. If I disable vader I guess that Open MPI is using TCP for > > localhost communication and that works. But with vader it fails. > > > > The first error message I get is a segfault: > > > > [test1:1] *** Process received signal *** > > [test1:1] Signal: Segmentation fault (11) > > [test1:1] Signal code: Address not mapped (1) > > [test1:1] Failing at address: 0x7fb7b1552010 > > [test1:1] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80] > > [test1:1] [ 1] > > /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b] > > [test1:1] [ 2] > > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb] > > [test1:1] [ 3] > > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086] > > [test1:1] [ 4] > > /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d] > > [test1:1] [ 5] /home/mpi/ring[0x400b76] > > [test1:1] [ 6] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813] > > [test1:1] [ 7] /home/mpi/ring[0x4008be] > > [test1:1] *** End of error message *** > > > > Guessing that vader uses shared memory this is expected to fail, with > > all the namespace isolations in place. Maybe not with a segfault, but > > each container has its own shared memory. So next step was to use the > > host's ipc and pid namespace and mount /dev/shm: > > > > '-v /dev/shm:/dev/shm --ipc=host --pid=host' > > > > Which does not segfault, but still does not look correct: > > > > Rank 0 has cleared MPI_Init > > Rank 1 has cleared MPI_Init > > Rank 2 has cleared MPI_Init > > [test1:17722] Read -1, expected 8, errno = 1 > > [test1:17722] Read -1, expected 8, errno = 1 > > [test1:17722] Read -1, expected 8, errno = 1 > > [test1:17722] Read -1, expected 8, errno = 1 > > [test1:17722] Read -1, expected 8, errno = 1 > > [test1:17722] Read -1, expected 8, errno = 1 > > [test1:17722] Read -1, expected 8, errno = 1 > > [test1:17722] Read -1, expected 8, errno = 1 > > [test1:17722] Read -1, expected 8, errno = 1 > > [test1:17722] Read -1, expected 8, errno = 1 > > [test1:17722] Read -1, expected
Re: [OMPI users] How it the rank determined (Open MPI and Podman)
Adrian, Can you try mpirun --mca btl_vader_copy_mechanism none ... Please double check the MCA parameter name, I am AFK IIRC, the default copy mechanism used by vader directly accesses the remote process address space, and this requires some permission (ptrace?) that might be dropped by podman. Note Open MPI might not detect both MPI tasks run on the same node because of podman. If you use UCX, then btl/vader is not used at all (pml/ucx is used instead) Cheers, Gilles Sent from my iPod > On Jul 12, 2019, at 18:33, Adrian Reber via users > wrote: > > So upstream Podman was really fast and merged a PR which makes my > wrapper unnecessary: > > Add support for --env-host : https://github.com/containers/libpod/pull/3557 > > As commented in the PR I can now start mpirun with Podman without a > wrapper: > > $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun podman > run --env-host --security-opt label=disable -v > /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test > /home/mpi/ring > Rank 0 has cleared MPI_Init > Rank 1 has cleared MPI_Init > Rank 0 has completed ring > Rank 0 has completed MPI_Barrier > Rank 1 has completed ring > Rank 1 has completed MPI_Barrier > > This is example was using TCP and on an InfiniBand based system I have > to map the InfiniBand devices into the container. > > $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base > /tmp/podman-mpirun podman run --env-host -v > /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable > --userns=keep-id --device /dev/infiniband/uverbs0 --device > /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test > /home/mpi/ring > Rank 0 has cleared MPI_Init > Rank 1 has cleared MPI_Init > Rank 0 has completed ring > Rank 0 has completed MPI_Barrier > Rank 1 has completed ring > Rank 1 has completed MPI_Barrier > > This is all running without root and only using Podman's rootless > support. > > Running multiple processes on one system, however, still gives me an > error. If I disable vader I guess that Open MPI is using TCP for > localhost communication and that works. But with vader it fails. > > The first error message I get is a segfault: > > [test1:1] *** Process received signal *** > [test1:1] Signal: Segmentation fault (11) > [test1:1] Signal code: Address not mapped (1) > [test1:1] Failing at address: 0x7fb7b1552010 > [test1:1] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80] > [test1:1] [ 1] > /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b] > [test1:1] [ 2] > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb] > [test1:1] [ 3] > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086] > [test1:1] [ 4] > /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d] > [test1:1] [ 5] /home/mpi/ring[0x400b76] > [test1:1] [ 6] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813] > [test1:1] [ 7] /home/mpi/ring[0x4008be] > [test1:1] *** End of error message *** > > Guessing that vader uses shared memory this is expected to fail, with > all the namespace isolations in place. Maybe not with a segfault, but > each container has its own shared memory. So next step was to use the > host's ipc and pid namespace and mount /dev/shm: > > '-v /dev/shm:/dev/shm --ipc=host --pid=host' > > Which does not segfault, but still does not look correct: > > Rank 0 has cleared MPI_Init > Rank 1 has cleared MPI_Init > Rank 2 has cleared MPI_Init > [test1:17722] Read -1, expected 8, errno = 1 > [test1:17722] Read -1, expected 8, errno = 1 > [test1:17722] Read -1, expected 8, errno = 1 > [test1:17722] Read -1, expected 8, errno = 1 > [test1:17722] Read -1, expected 8, errno = 1 > [test1:17722] Read -1, expected 8, errno = 1 > [test1:17722] Read -1, expected 8, errno = 1 > [test1:17722] Read -1, expected 8, errno = 1 > [test1:17722] Read -1, expected 8, errno = 1 > [test1:17722] Read -1, expected 8, errno = 1 > [test1:17722] Read -1, expected 8, errno = 1 > Rank 0 has completed ring > Rank 2 has completed ring > Rank 0 has completed MPI_Barrier > Rank 1 has completed ring > Rank 2 has completed MPI_Barrier > Rank 1 has completed MPI_Barrier > > This is using the Open MPI ring.c example with SIZE increased from 20 to > 2. > > Any recommendations what vader needs to communicate correctly? > >Adrian > >> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users wrote: >> Gilles, >> >> thanks for pointing out the environment variables. I quickly created a >> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables >> (grep "\(PMIX\|OMPI\)"). Now it works: >> >> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id >> --net=host mpi-test /home/mpi/hello >> >> Hello, world (2
Re: [OMPI users] How it the rank determined (Open MPI and Podman)
So upstream Podman was really fast and merged a PR which makes my wrapper unnecessary: Add support for --env-host : https://github.com/containers/libpod/pull/3557 As commented in the PR I can now start mpirun with Podman without a wrapper: $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun podman run --env-host --security-opt label=disable -v /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test /home/mpi/ring Rank 0 has cleared MPI_Init Rank 1 has cleared MPI_Init Rank 0 has completed ring Rank 0 has completed MPI_Barrier Rank 1 has completed ring Rank 1 has completed MPI_Barrier This is example was using TCP and on an InfiniBand based system I have to map the InfiniBand devices into the container. $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun podman run --env-host -v /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable --userns=keep-id --device /dev/infiniband/uverbs0 --device /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test /home/mpi/ring Rank 0 has cleared MPI_Init Rank 1 has cleared MPI_Init Rank 0 has completed ring Rank 0 has completed MPI_Barrier Rank 1 has completed ring Rank 1 has completed MPI_Barrier This is all running without root and only using Podman's rootless support. Running multiple processes on one system, however, still gives me an error. If I disable vader I guess that Open MPI is using TCP for localhost communication and that works. But with vader it fails. The first error message I get is a segfault: [test1:1] *** Process received signal *** [test1:1] Signal: Segmentation fault (11) [test1:1] Signal code: Address not mapped (1) [test1:1] Failing at address: 0x7fb7b1552010 [test1:1] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80] [test1:1] [ 1] /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b] [test1:1] [ 2] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb] [test1:1] [ 3] /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086] [test1:1] [ 4] /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d] [test1:1] [ 5] /home/mpi/ring[0x400b76] [test1:1] [ 6] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813] [test1:1] [ 7] /home/mpi/ring[0x4008be] [test1:1] *** End of error message *** Guessing that vader uses shared memory this is expected to fail, with all the namespace isolations in place. Maybe not with a segfault, but each container has its own shared memory. So next step was to use the host's ipc and pid namespace and mount /dev/shm: '-v /dev/shm:/dev/shm --ipc=host --pid=host' Which does not segfault, but still does not look correct: Rank 0 has cleared MPI_Init Rank 1 has cleared MPI_Init Rank 2 has cleared MPI_Init [test1:17722] Read -1, expected 8, errno = 1 [test1:17722] Read -1, expected 8, errno = 1 [test1:17722] Read -1, expected 8, errno = 1 [test1:17722] Read -1, expected 8, errno = 1 [test1:17722] Read -1, expected 8, errno = 1 [test1:17722] Read -1, expected 8, errno = 1 [test1:17722] Read -1, expected 8, errno = 1 [test1:17722] Read -1, expected 8, errno = 1 [test1:17722] Read -1, expected 8, errno = 1 [test1:17722] Read -1, expected 8, errno = 1 [test1:17722] Read -1, expected 8, errno = 1 Rank 0 has completed ring Rank 2 has completed ring Rank 0 has completed MPI_Barrier Rank 1 has completed ring Rank 2 has completed MPI_Barrier Rank 1 has completed MPI_Barrier This is using the Open MPI ring.c example with SIZE increased from 20 to 2. Any recommendations what vader needs to communicate correctly? Adrian On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users wrote: > Gilles, > > thanks for pointing out the environment variables. I quickly created a > wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables > (grep "\(PMIX\|OMPI\)"). Now it works: > > $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id > --net=host mpi-test /home/mpi/hello > > Hello, world (2 procs total) > --> Process # 0 of 2 is alive. ->test1 > --> Process # 1 of 2 is alive. ->test2 > > I need to tell Podman to mount /tmp from the host into the container, as > I am running rootless I also need to tell Podman to use the same user ID > in the container as outside (so that the Open MPI files in /tmp) can be > shared and I am also running without a network namespace. > > So this is now with the full Podman provided isolation except the > network namespace. Thanks for you help! > > Adrian > > On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via users wrote: > > Adrian, > > > > > > the MPI application relies on some environment variables (they typically > > start with OMPI_ and PMIX_). > > > > The MPI
Re: [OMPI users] How it the rank determined (Open MPI and Podman)
Not really a relevant reply, however Nomad has task drivers for Docker and Singularity https://www.hashicorp.com/blog/singularity-and-hashicorp-nomad-a-perfect-fit I'm not sure if it woul dbe easier to set up an MPI enviroment with Nomad though On Thu, 11 Jul 2019 at 11:08, Adrian Reber via users < users@lists.open-mpi.org> wrote: > Gilles, > > thanks for pointing out the environment variables. I quickly created a > wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables > (grep "\(PMIX\|OMPI\)"). Now it works: > > $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id > --net=host mpi-test /home/mpi/hello > > Hello, world (2 procs total) > --> Process # 0 of 2 is alive. ->test1 > --> Process # 1 of 2 is alive. ->test2 > > I need to tell Podman to mount /tmp from the host into the container, as > I am running rootless I also need to tell Podman to use the same user ID > in the container as outside (so that the Open MPI files in /tmp) can be > shared and I am also running without a network namespace. > > So this is now with the full Podman provided isolation except the > network namespace. Thanks for you help! > > Adrian > > On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via users > wrote: > > Adrian, > > > > > > the MPI application relies on some environment variables (they typically > > start with OMPI_ and PMIX_). > > > > The MPI application internally uses a PMIx client that must be able to > > contact a PMIx server > > > > (that is included in mpirun and the orted daemon(s) spawned on the remote > > hosts). > > > > located on the same host. > > > > > > If podman provides some isolation between the app inside the container > (e.g. > > /home/mpi/hello) > > > > and the outside world (e.g. mpirun/orted), that won't be an easy ride. > > > > > > Cheers, > > > > > > Gilles > > > > > > On 7/11/2019 4:35 PM, Adrian Reber via users wrote: > > > I did a quick test to see if I can use Podman in combination with Open > > > MPI: > > > > > > [test@test1 ~]$ mpirun --hostfile ~/hosts podman run > quay.io/adrianreber/mpi-test /home/mpi/hello > > > > > > Hello, world (1 procs total) > > > --> Process # 0 of 1 is alive. ->789b8fb622ef > > > > > > Hello, world (1 procs total) > > > --> Process # 0 of 1 is alive. ->749eb4e1c01a > > > > > > The test program (hello) is taken from > https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c > > > > > > > > > The problem with this is that each process thinks it is process 0 of 1 > > > instead of > > > > > > Hello, world (2 procs total) > > > --> Process # 1 of 2 is alive. ->test1 > > > --> Process # 0 of 2 is alive. ->test2 > > > > > > My questions is how is the rank determined? What resources do I need > to have > > > in my container to correctly determine the rank. > > > > > > This is Podman 1.4.2 and Open MPI 4.0.1. > > > > > > Adrian > > > ___ > > > users mailing list > > > users@lists.open-mpi.org > > > https://lists.open-mpi.org/mailman/listinfo/users > > > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] How it the rank determined (Open MPI and Podman)
Gilles, thanks for pointing out the environment variables. I quickly created a wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables (grep "\(PMIX\|OMPI\)"). Now it works: $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id --net=host mpi-test /home/mpi/hello Hello, world (2 procs total) --> Process # 0 of 2 is alive. ->test1 --> Process # 1 of 2 is alive. ->test2 I need to tell Podman to mount /tmp from the host into the container, as I am running rootless I also need to tell Podman to use the same user ID in the container as outside (so that the Open MPI files in /tmp) can be shared and I am also running without a network namespace. So this is now with the full Podman provided isolation except the network namespace. Thanks for you help! Adrian On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via users wrote: > Adrian, > > > the MPI application relies on some environment variables (they typically > start with OMPI_ and PMIX_). > > The MPI application internally uses a PMIx client that must be able to > contact a PMIx server > > (that is included in mpirun and the orted daemon(s) spawned on the remote > hosts). > > located on the same host. > > > If podman provides some isolation between the app inside the container (e.g. > /home/mpi/hello) > > and the outside world (e.g. mpirun/orted), that won't be an easy ride. > > > Cheers, > > > Gilles > > > On 7/11/2019 4:35 PM, Adrian Reber via users wrote: > > I did a quick test to see if I can use Podman in combination with Open > > MPI: > > > > [test@test1 ~]$ mpirun --hostfile ~/hosts podman run > > quay.io/adrianreber/mpi-test /home/mpi/hello > > > > Hello, world (1 procs total) > > --> Process # 0 of 1 is alive. ->789b8fb622ef > > > > Hello, world (1 procs total) > > --> Process # 0 of 1 is alive. ->749eb4e1c01a > > > > The test program (hello) is taken from > > https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c > > > > > > The problem with this is that each process thinks it is process 0 of 1 > > instead of > > > > Hello, world (2 procs total) > > --> Process # 1 of 2 is alive. ->test1 > > --> Process # 0 of 2 is alive. ->test2 > > > > My questions is how is the rank determined? What resources do I need to have > > in my container to correctly determine the rank. > > > > This is Podman 1.4.2 and Open MPI 4.0.1. > > > > Adrian > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] How it the rank determined (Open MPI and Podman)
Adrian, the MPI application relies on some environment variables (they typically start with OMPI_ and PMIX_). The MPI application internally uses a PMIx client that must be able to contact a PMIx server (that is included in mpirun and the orted daemon(s) spawned on the remote hosts). located on the same host. If podman provides some isolation between the app inside the container (e.g. /home/mpi/hello) and the outside world (e.g. mpirun/orted), that won't be an easy ride. Cheers, Gilles On 7/11/2019 4:35 PM, Adrian Reber via users wrote: I did a quick test to see if I can use Podman in combination with Open MPI: [test@test1 ~]$ mpirun --hostfile ~/hosts podman run quay.io/adrianreber/mpi-test /home/mpi/hello Hello, world (1 procs total) --> Process # 0 of 1 is alive. ->789b8fb622ef Hello, world (1 procs total) --> Process # 0 of 1 is alive. ->749eb4e1c01a The test program (hello) is taken from https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c The problem with this is that each process thinks it is process 0 of 1 instead of Hello, world (2 procs total) --> Process # 1 of 2 is alive. ->test1 --> Process # 0 of 2 is alive. ->test2 My questions is how is the rank determined? What resources do I need to have in my container to correctly determine the rank. This is Podman 1.4.2 and Open MPI 4.0.1. Adrian ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] How it the rank determined (Open MPI and Podman)
I did a quick test to see if I can use Podman in combination with Open MPI: [test@test1 ~]$ mpirun --hostfile ~/hosts podman run quay.io/adrianreber/mpi-test /home/mpi/hello Hello, world (1 procs total) --> Process # 0 of 1 is alive. ->789b8fb622ef Hello, world (1 procs total) --> Process # 0 of 1 is alive. ->749eb4e1c01a The test program (hello) is taken from https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c The problem with this is that each process thinks it is process 0 of 1 instead of Hello, world (2 procs total) --> Process # 1 of 2 is alive. ->test1 --> Process # 0 of 2 is alive. ->test2 My questions is how is the rank determined? What resources do I need to have in my container to correctly determine the rank. This is Podman 1.4.2 and Open MPI 4.0.1. Adrian ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users