Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
Hi!! Thanks for reply.Actaully there was some problem with the my downloaded version of openmpi.But when I downloaded everything again and did all configure and make statements again it worked fine. Thanks a lot . And next time I will make sure that I give all details. Thanks Pooja > This is unfortunately not enough information to provide any help -- > the (lots of output) parts are pretty important. Can you provide all > the information cited here: > > http://www.open-mpi.org/community/help/ > > > On Apr 14, 2007, at 11:36 PM, po...@cc.gatech.edu wrote: > >> Hi!!! >> Thanks for help!!! >> >> Right now I am just trying to install the normal openmpi(without >> using all >> development header files). >> But it is still giving me some error. >> I have downloaded the developer version from the openmpi.org site. >> Then I gave >> ./configure --prefix=/net/hc293/pooja/dev_openmpi >> (lots of out put) >> make all install >> (lots of output ) >> and error :ld returned 1 exit status >> make[2]: *** [libopen-pal.la] Error 1 >> make[2]: Leaving directory `/net/hc293/pooja/openmpi-1.2.1a0r14362- >> dev/opal' >> make[1]: *** [all-recursive] Error 1 >> make[1]: Leaving directory `/net/hc293/pooja/openmpi-1.2.1a0r14362- >> dev/opal' >> make: *** [all-recursive] Error 1 >> >> >> >> Also the dev_openmpi folder is empty. >> >> SO I am not able to complie normal ring_c.c example also. >> >> Please help >> >> Thanks and Regards >> Pooja >> >> >> >> >> >> >>> Configure with the --with-devel-headers switch. This will install >>> all the developer headers. >>> >>> If you care, check out "./configure --help" -- that shows all the >>> options available to the configure script (including --with-devel- >>> headers). >>> >>> >>> On Apr 13, 2007, at 7:36 PM, po...@cc.gatech.edu wrote: >>> Hi I have downloaded the developer version of source code by downloading a nightly Subversion snapshot tarball.And have installed the openmpi. Using ./configure --prefix=/usr/local make all install. But I want to install with all the development headers.So that I can write an application that can use Ompi internal headers. Thanks and Regards Pooja > On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote: > >> I can't help you with the BTL question. On the others: > > Yes, you can "sorta" call BTL's directly from application programs > (are you trying to use MPI alongside other communication libraries, > and using the BTL components as a sample?), but there are issues > involved with this. > > First, you need to install Open MPI with all the development > headers. Open MPI normally only installs "mpi.h" and a small > number > of other heads; installing *all* the headers will allow you to > write > applications that use OMPI's internal headers (such as btl.h) while > developing outside of the Open MPI source tree. > > Second, you probably won't want to access the BTL's directly. To > make this make sense, here's how the code is organized (even if the > specific call sequence is not exactly this layered for performance/ > optimization reasons): > > MPI layer (e.g., MPI_SEND) > -> PML > -> BML > -> BTL > > You have two choices: > > 1. Go through the PML instead (this is what we do in the MPI > collectives, for example) -- but this imposes MPI semantics on > sending and receiving, which assumedly you are trying to avoid. > Check out ompi/mca/pml/pml.h. > > 2. Go through the BML instead -- the BTL Management Layer. This is > essentially a multiplexor for all the BTLs that have been > instantiated. I'm guessing that this is what you want to do > (remember that OMPI has true multi-device support; using the BML > and > multiple BTLs is one of the ways that we do this). Have a look at > ompi/mca/bml/bml.h for the interface. > > There is also currently no mechanism to get the BML and BTL > pointers > that were instantiated by the PML. However, if you're just doing > proof-of-concept code, you can extract these directly from the MPI > layer's global variables to see how this stuff works. > > To have full interoperability of the underlying BTLs and between > multiple upper-layer communication libraries (e.g., between OMPI > and > something else) is something that we have talked about a little, > but > have not done much work on. > > To see the BTL interface (just for completeness), see ompi/mca/btl/ > btl.h. > > You can probably see the pattern here... In all of Open MPI's > frameworks, the public interface is in /mca// > .h, where is one of opal, orte, or ompi, and > is the name of the framework. > >> 1. states are reported via the orte/mca/smr framework. You will >> see >
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
This is unfortunately not enough information to provide any help -- the (lots of output) parts are pretty important. Can you provide all the information cited here: http://www.open-mpi.org/community/help/ On Apr 14, 2007, at 11:36 PM, po...@cc.gatech.edu wrote: Hi!!! Thanks for help!!! Right now I am just trying to install the normal openmpi(without using all development header files). But it is still giving me some error. I have downloaded the developer version from the openmpi.org site. Then I gave ./configure --prefix=/net/hc293/pooja/dev_openmpi (lots of out put) make all install (lots of output ) and error :ld returned 1 exit status make[2]: *** [libopen-pal.la] Error 1 make[2]: Leaving directory `/net/hc293/pooja/openmpi-1.2.1a0r14362- dev/opal' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/net/hc293/pooja/openmpi-1.2.1a0r14362- dev/opal' make: *** [all-recursive] Error 1 Also the dev_openmpi folder is empty. SO I am not able to complie normal ring_c.c example also. Please help Thanks and Regards Pooja Configure with the --with-devel-headers switch. This will install all the developer headers. If you care, check out "./configure --help" -- that shows all the options available to the configure script (including --with-devel- headers). On Apr 13, 2007, at 7:36 PM, po...@cc.gatech.edu wrote: Hi I have downloaded the developer version of source code by downloading a nightly Subversion snapshot tarball.And have installed the openmpi. Using ./configure --prefix=/usr/local make all install. But I want to install with all the development headers.So that I can write an application that can use Ompi internal headers. Thanks and Regards Pooja On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote: I can't help you with the BTL question. On the others: Yes, you can "sorta" call BTL's directly from application programs (are you trying to use MPI alongside other communication libraries, and using the BTL components as a sample?), but there are issues involved with this. First, you need to install Open MPI with all the development headers. Open MPI normally only installs "mpi.h" and a small number of other heads; installing *all* the headers will allow you to write applications that use OMPI's internal headers (such as btl.h) while developing outside of the Open MPI source tree. Second, you probably won't want to access the BTL's directly. To make this make sense, here's how the code is organized (even if the specific call sequence is not exactly this layered for performance/ optimization reasons): MPI layer (e.g., MPI_SEND) -> PML -> BML -> BTL You have two choices: 1. Go through the PML instead (this is what we do in the MPI collectives, for example) -- but this imposes MPI semantics on sending and receiving, which assumedly you are trying to avoid. Check out ompi/mca/pml/pml.h. 2. Go through the BML instead -- the BTL Management Layer. This is essentially a multiplexor for all the BTLs that have been instantiated. I'm guessing that this is what you want to do (remember that OMPI has true multi-device support; using the BML and multiple BTLs is one of the ways that we do this). Have a look at ompi/mca/bml/bml.h for the interface. There is also currently no mechanism to get the BML and BTL pointers that were instantiated by the PML. However, if you're just doing proof-of-concept code, you can extract these directly from the MPI layer's global variables to see how this stuff works. To have full interoperability of the underlying BTLs and between multiple upper-layer communication libraries (e.g., between OMPI and something else) is something that we have talked about a little, but have not done much work on. To see the BTL interface (just for completeness), see ompi/mca/btl/ btl.h. You can probably see the pattern here... In all of Open MPI's frameworks, the public interface is in /mca// .h, where is one of opal, orte, or ompi, and is the name of the framework. 1. states are reported via the orte/mca/smr framework. You will see the states listed in orte/mca/smr/smr_types.h. We track both process and job states. Hopefully, the state names will be somewhat self- explanatory and indicative of the order in which they are traversed. The job states are set when *all* of the processes in the job reach the corresponding state. Note that these are very coarse-grained process-level states (e.g., is a given process running or not?). It's not clear what kind of states you were asking about -- the Open MPI code base has many internal state machines for various message passing and other mechanisms. What information are you looking for, specifically? 2. I'm not sure what you mean by mapping MPI processes to "physical" processes, but I assume you mean how do we assign MPI ranks to processes on specific nodes. You will find that done in the orte/mca/rmaps framework. We currently only have one component
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
Hi!!! Thanks for help!!! Right now I am just trying to install the normal openmpi(without using all development header files). But it is still giving me some error. I have downloaded the developer version from the openmpi.org site. Then I gave ./configure --prefix=/net/hc293/pooja/dev_openmpi (lots of out put) make all install (lots of output ) and error :ld returned 1 exit status make[2]: *** [libopen-pal.la] Error 1 make[2]: Leaving directory `/net/hc293/pooja/openmpi-1.2.1a0r14362-dev/opal' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/net/hc293/pooja/openmpi-1.2.1a0r14362-dev/opal' make: *** [all-recursive] Error 1 Also the dev_openmpi folder is empty. SO I am not able to complie normal ring_c.c example also. Please help Thanks and Regards Pooja > Configure with the --with-devel-headers switch. This will install > all the developer headers. > > If you care, check out "./configure --help" -- that shows all the > options available to the configure script (including --with-devel- > headers). > > > On Apr 13, 2007, at 7:36 PM, po...@cc.gatech.edu wrote: > >> Hi >> >> I have downloaded the developer version of source code by >> downloading a >> nightly Subversion snapshot tarball.And have installed the openmpi. >> Using >> >> ./configure --prefix=/usr/local >> make all install. >> >> But I want to install with all the development headers.So that I >> can write >> an application that can use Ompi internal headers. >> >> >> Thanks and Regards >> Pooja >> >> >> >> >> >>> On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote: >>> I can't help you with the BTL question. On the others: >>> >>> Yes, you can "sorta" call BTL's directly from application programs >>> (are you trying to use MPI alongside other communication libraries, >>> and using the BTL components as a sample?), but there are issues >>> involved with this. >>> >>> First, you need to install Open MPI with all the development >>> headers. Open MPI normally only installs "mpi.h" and a small number >>> of other heads; installing *all* the headers will allow you to write >>> applications that use OMPI's internal headers (such as btl.h) while >>> developing outside of the Open MPI source tree. >>> >>> Second, you probably won't want to access the BTL's directly. To >>> make this make sense, here's how the code is organized (even if the >>> specific call sequence is not exactly this layered for performance/ >>> optimization reasons): >>> >>> MPI layer (e.g., MPI_SEND) >>> -> PML >>> -> BML >>> -> BTL >>> >>> You have two choices: >>> >>> 1. Go through the PML instead (this is what we do in the MPI >>> collectives, for example) -- but this imposes MPI semantics on >>> sending and receiving, which assumedly you are trying to avoid. >>> Check out ompi/mca/pml/pml.h. >>> >>> 2. Go through the BML instead -- the BTL Management Layer. This is >>> essentially a multiplexor for all the BTLs that have been >>> instantiated. I'm guessing that this is what you want to do >>> (remember that OMPI has true multi-device support; using the BML and >>> multiple BTLs is one of the ways that we do this). Have a look at >>> ompi/mca/bml/bml.h for the interface. >>> >>> There is also currently no mechanism to get the BML and BTL pointers >>> that were instantiated by the PML. However, if you're just doing >>> proof-of-concept code, you can extract these directly from the MPI >>> layer's global variables to see how this stuff works. >>> >>> To have full interoperability of the underlying BTLs and between >>> multiple upper-layer communication libraries (e.g., between OMPI and >>> something else) is something that we have talked about a little, but >>> have not done much work on. >>> >>> To see the BTL interface (just for completeness), see ompi/mca/btl/ >>> btl.h. >>> >>> You can probably see the pattern here... In all of Open MPI's >>> frameworks, the public interface is in /mca// >>> .h, where is one of opal, orte, or ompi, and >>> is the name of the framework. >>> 1. states are reported via the orte/mca/smr framework. You will see the states listed in orte/mca/smr/smr_types.h. We track both process and job states. Hopefully, the state names will be somewhat self- explanatory and indicative of the order in which they are traversed. The job states are set when *all* of the processes in the job reach the corresponding state. >>> >>> Note that these are very coarse-grained process-level states (e.g., >>> is a given process running or not?). It's not clear what kind of >>> states you were asking about -- the Open MPI code base has many >>> internal state machines for various message passing and other >>> mechanisms. >>> >>> What information are you looking for, specifically? >>> 2. I'm not sure what you mean by mapping MPI processes to "physical" processes, but I assume you mean how do we assign MPI ranks to processes on specific nodes. You will find that
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
Configure with the --with-devel-headers switch. This will install all the developer headers. If you care, check out "./configure --help" -- that shows all the options available to the configure script (including --with-devel- headers). On Apr 13, 2007, at 7:36 PM, po...@cc.gatech.edu wrote: Hi I have downloaded the developer version of source code by downloading a nightly Subversion snapshot tarball.And have installed the openmpi. Using ./configure --prefix=/usr/local make all install. But I want to install with all the development headers.So that I can write an application that can use Ompi internal headers. Thanks and Regards Pooja On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote: I can't help you with the BTL question. On the others: Yes, you can "sorta" call BTL's directly from application programs (are you trying to use MPI alongside other communication libraries, and using the BTL components as a sample?), but there are issues involved with this. First, you need to install Open MPI with all the development headers. Open MPI normally only installs "mpi.h" and a small number of other heads; installing *all* the headers will allow you to write applications that use OMPI's internal headers (such as btl.h) while developing outside of the Open MPI source tree. Second, you probably won't want to access the BTL's directly. To make this make sense, here's how the code is organized (even if the specific call sequence is not exactly this layered for performance/ optimization reasons): MPI layer (e.g., MPI_SEND) -> PML -> BML -> BTL You have two choices: 1. Go through the PML instead (this is what we do in the MPI collectives, for example) -- but this imposes MPI semantics on sending and receiving, which assumedly you are trying to avoid. Check out ompi/mca/pml/pml.h. 2. Go through the BML instead -- the BTL Management Layer. This is essentially a multiplexor for all the BTLs that have been instantiated. I'm guessing that this is what you want to do (remember that OMPI has true multi-device support; using the BML and multiple BTLs is one of the ways that we do this). Have a look at ompi/mca/bml/bml.h for the interface. There is also currently no mechanism to get the BML and BTL pointers that were instantiated by the PML. However, if you're just doing proof-of-concept code, you can extract these directly from the MPI layer's global variables to see how this stuff works. To have full interoperability of the underlying BTLs and between multiple upper-layer communication libraries (e.g., between OMPI and something else) is something that we have talked about a little, but have not done much work on. To see the BTL interface (just for completeness), see ompi/mca/btl/ btl.h. You can probably see the pattern here... In all of Open MPI's frameworks, the public interface is in /mca// .h, where is one of opal, orte, or ompi, and is the name of the framework. 1. states are reported via the orte/mca/smr framework. You will see the states listed in orte/mca/smr/smr_types.h. We track both process and job states. Hopefully, the state names will be somewhat self- explanatory and indicative of the order in which they are traversed. The job states are set when *all* of the processes in the job reach the corresponding state. Note that these are very coarse-grained process-level states (e.g., is a given process running or not?). It's not clear what kind of states you were asking about -- the Open MPI code base has many internal state machines for various message passing and other mechanisms. What information are you looking for, specifically? 2. I'm not sure what you mean by mapping MPI processes to "physical" processes, but I assume you mean how do we assign MPI ranks to processes on specific nodes. You will find that done in the orte/mca/rmaps framework. We currently only have one component in that framework - the round- robin implementation - that maps either by slot or by node, as indicated by the user. That code is fairly heavily commented, so you hopefully can understand what it is doing. Hope that helps! Ralph On 4/1/07 1:32 PM, "po...@cc.gatech.edu" wrote: Hi I am Pooja and I am working on a course project which requires me -> to track the internal state changes of MPI and need me to figure out how does ORTE maps MPi Process to actual physical processes ->Also I need to find way to get BTL transports work directly with MPI level calls. I just want to know is this posible and if yes what procedure I should follow or I should look into which files (for change). Please Help Thanks and Regards Pooja ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
Hi I have downloaded the developer version of source code by downloading a nightly Subversion snapshot tarball.And have installed the openmpi. Using ./configure --prefix=/usr/local make all install. But I want to install with all the development headers.So that I can write an application that can use Ompi internal headers. Thanks and Regards Pooja > On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote: > >> I can't help you with the BTL question. On the others: > > Yes, you can "sorta" call BTL's directly from application programs > (are you trying to use MPI alongside other communication libraries, > and using the BTL components as a sample?), but there are issues > involved with this. > > First, you need to install Open MPI with all the development > headers. Open MPI normally only installs "mpi.h" and a small number > of other heads; installing *all* the headers will allow you to write > applications that use OMPI's internal headers (such as btl.h) while > developing outside of the Open MPI source tree. > > Second, you probably won't want to access the BTL's directly. To > make this make sense, here's how the code is organized (even if the > specific call sequence is not exactly this layered for performance/ > optimization reasons): > > MPI layer (e.g., MPI_SEND) > -> PML > -> BML > -> BTL > > You have two choices: > > 1. Go through the PML instead (this is what we do in the MPI > collectives, for example) -- but this imposes MPI semantics on > sending and receiving, which assumedly you are trying to avoid. > Check out ompi/mca/pml/pml.h. > > 2. Go through the BML instead -- the BTL Management Layer. This is > essentially a multiplexor for all the BTLs that have been > instantiated. I'm guessing that this is what you want to do > (remember that OMPI has true multi-device support; using the BML and > multiple BTLs is one of the ways that we do this). Have a look at > ompi/mca/bml/bml.h for the interface. > > There is also currently no mechanism to get the BML and BTL pointers > that were instantiated by the PML. However, if you're just doing > proof-of-concept code, you can extract these directly from the MPI > layer's global variables to see how this stuff works. > > To have full interoperability of the underlying BTLs and between > multiple upper-layer communication libraries (e.g., between OMPI and > something else) is something that we have talked about a little, but > have not done much work on. > > To see the BTL interface (just for completeness), see ompi/mca/btl/ > btl.h. > > You can probably see the pattern here... In all of Open MPI's > frameworks, the public interface is in /mca// > .h, where is one of opal, orte, or ompi, and > is the name of the framework. > >> 1. states are reported via the orte/mca/smr framework. You will see >> the >> states listed in orte/mca/smr/smr_types.h. We track both process >> and job >> states. Hopefully, the state names will be somewhat self- >> explanatory and >> indicative of the order in which they are traversed. The job states >> are set >> when *all* of the processes in the job reach the corresponding state. > > Note that these are very coarse-grained process-level states (e.g., > is a given process running or not?). It's not clear what kind of > states you were asking about -- the Open MPI code base has many > internal state machines for various message passing and other > mechanisms. > > What information are you looking for, specifically? > >> 2. I'm not sure what you mean by mapping MPI processes to "physical" >> processes, but I assume you mean how do we assign MPI ranks to >> processes on >> specific nodes. You will find that done in the orte/mca/rmaps >> framework. We >> currently only have one component in that framework - the round-robin >> implementation - that maps either by slot or by node, as indicated >> by the >> user. That code is fairly heavily commented, so you hopefully can >> understand >> what it is doing. >> >> Hope that helps! >> Ralph >> >> >> On 4/1/07 1:32 PM, "po...@cc.gatech.edu" wrote: >> >>> Hi >>> I am Pooja and I am working on a course project which requires me >>> -> to track the internal state changes of MPI and need me to >>> figure out >>> how does ORTE maps MPi Process to actual physical processes >>> ->Also I need to find way to get BTL transports work directly with >>> MPI >>> level calls. >>> I just want to know is this posible and if yes what procedure I >>> should >>> follow or I should look into which files (for change). >>> >>> >>> Please Help >>> >>> Thanks and Regards >>> Pooja >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mai
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
On Apr 2, 2007, at 10:23 AM, Jeff Squyres wrote: On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote: I can't help you with the BTL question. On the others: 2. Go through the BML instead -- the BTL Management Layer. This is essentially a multiplexor for all the BTLs that have been instantiated. I'm guessing that this is what you want to do (remember that OMPI has true multi-device support; using the BML and multiple BTLs is one of the ways that we do this). Have a look at ompi/mca/bml/bml.h for the interface. There is also currently no mechanism to get the BML and BTL pointers that were instantiated by the PML. However, if you're just doing proof-of-concept code, you can extract these directly from the MPI layer's global variables to see how this stuff works. To have full interoperability of the underlying BTLs and between multiple upper-layer communication libraries (e.g., between OMPI and something else) is something that we have talked about a little, but have not done much work on. To see the BTL interface (just for completeness), see ompi/mca/btl/ btl.h. Jumping in late to the conversation, and on an unimportant point for what Pooja really wants to do, but... The BTL really can't be used directly at this point -- you have to use the BML interface to get data pointers and the like. There's never any need to grab anything from the PML or global structures. The BML information is contained on a pointer on the ompi_proc_t structure associated with each peer. The list of peers can be accessed with the ompi_proc_world() call. Hope this helps, Brian
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
On Apr 3, 2007, at 4:57 PM, po...@cc.gatech.edu wrote: I need to find when the underlying network is free. Means I dont need to go into the details of how MPi_send is implemented. Ah, ok. That explains a lot. What I want to know is when the MPI_Send is started .Or rather when MPi does not use the underlying network. I need to find timing for 1) When the application issue send command This (and #5) can be implemented with a PMPI-based intercept library (I assume that by "command", you mean "API function call"). 2) When Mpi actually issues send command 3) When does BTl perform atual transfer(send) What are you looking to distinguish here? I.e., what is the difference between 1 and 2 vs. 3? Open MPI has an MPI_Send() function in C that does some error checking and then invokes an underlying "send" function (via function pointer) to a plugin that starts doing the setup for the MPI semantics for the send. Eventually, another function pointer is used to invoke the "send" function in the BTL to actually send the message. More setup is performed down in the BTL (usually dealing with setting up data structures to invoke the underlying network/OS/ driver "send" function that starts the network send), and then we invoke some underlying OS/kernel-bypass function to start the network transfer. Note that all we can guarantee is that the transfer start sometime after that -- there's no way to know *exactly* when it starts because the underlying kernel driver may choose to defer it for a while based on flow control, available resources, etc. Specifically, similar to one of my prior e-mails, the calling structure is something like this: MPI_Send() --> PML plugin (usually the "ob1" plugin) --> BTL plugin (one of the components in the ompi/mca/btl/ directory) --> underlying OS/kernel-bypass function 4) When doe send complete By "complete", what exactly are you looking for? There's several definitions possible here: - when any of the "send" functions listed above returns - when the underlying network driver tells us that it is complete (a.k.a. "local completion" -- it *DOES NOT* imply that the receiver has even started to receive the message, nor that the message has even left the host yet) - when he receiver ACK's receiving the message - when MPI_Send() returns FWIW, we usually measure local completion time because that's all that we can know (because the underlying network driver makes its own decisions about when messages are put out on the network, etc., and we [i.e., any user-level networking software] don't have visibility of that information). 5) Who was thr receiver. etc. this was an example of MPi_send. like this I need to know MPI_Isend,broadcast etc. I guess this can be done using PMPI. Some of this can, yes. But PMPI can do it during profile stages while I want all this data during runtime. I don't quite understand this statement -- PMPI is a run-time profiling system. All it does is insert your shim PMPI layer between the user's application and the "real" MPI layer. So that I can improve the performance of the system while using that ideal time. What I'm piecing together from your e-mails is that you want to use MPI in conjunction with using the network directly, either through the BTLs or some other communication library (i.e., both MPI and your other communication library will share the same network resources) and you're trying to find out when MPI is not using a particular network resource so that you can use it with your other communication library in order to maximize utilization and minimize contention / congestion. Is that correct? Is that right? Well I/o used is Lustre (its ROMIO). Note that ROMIO uses a fair bit of MPI sending and receiving before using the underlying file system. So you'll have at least 2 layers of software to explore to find out when the network is free/busy. What I mean by I/O node is nodes that does input and ouput processing i.e they write to lustre and compute node just transfer data to i/o node to write it in Lustre.Compute node does not have memory at all.So when ever they have something to write it gets transfered to I/o node. and then I/o node does read and write. Ok. I'm guessing/assuming that this is multiplexing that is either done in ROMIO or in Lustre itself. So when MPi_send is not issued the the network(Infiniband interconnect) can be used for some other transfer. Makes sense. Can anyone help me wih how to go abt tracing this at run time? The BTL plugin that you will be concerned with is the "openib" BTL (in the Open MPI source tree: ompi/mca/btl/openib/*) -- assuming that you are using an OpenFabrics/OFED-based network driver on your nodes (if you're using an older mvapi-based network driver, you'll use the mvapi BTL: ompi/mca/btl/mvapi/* -- but I would not recommend this because all c
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
On Apr 3, 2007, at 3:07 PM, Li-Ta Lo wrote: Well, that's a good question. At the moment, the only environments where we encounter multiple cores treat each core as a separate "slot" when they assign resources. We don't currently provide an option that says "map by two", so the only way to do what you describe would be to manually specify the mapping, slot by slot. I also don't understand how Paffinity work for this case. When orted launch N processes on a node, does it have control on how those processes are started and mapped to the core/processor? Or is it the case that O.S. puts the process on whatever cores it picks and the paffinity module will try to "pin" the process on the core (picked by O.S.)? Check out these 3 FAQ entries: http://www.open-mpi.org/faq/?category=tuning#paffinity-defs http://www.open-mpi.org/faq/?category=tuning#maffinity-defs http://www.open-mpi.org/faq/?category=tuning#using-paffinity We *only* have 1 lame way of doing paffinity right now -- we start pinning processes to processors starting with processor ID 0. If someone cares to suggest some alternative notation/option for requesting that kind of mapping flexibility, I'm certainly willing to implement it (it would be rather trivial to do "map by N", but might be more complicated if you want other things). What is the current syntax of the config file/command line? Can we do something like array index in those script languages e.g. [0:N:2]? mailman/listinfo.cgi/devel There is no syntax for the command line -- this is a discussion that we developers have gotten into deadlock over several times. It's a problem that we'd like to solve, but every time we talk about it, we deadlock and then move on to other higher-priority items. :-\ I take it to mean that "[0:N:2]" (ditching the [] would probably be good, because those would need to be escaped on the command line -- probably "--paffinity 0:N:2" or something would be sufficient) would be "start with core 0, end with core N, and step by 2 cores". Right? This is fine, and similar things have been suggested before. The problem with it is when you want to specify by socket, and not by core. Additionally, there can be an ambiguity in Linux -- core 0 is always the first core on the first socket. But where is core 1? It could be the 2nd core on the 1st socket, or it could be the 1st core on the 2nd socket -- it depends on BIOS settings (IIRC). Additionally, Solaris processor ID number does not necessarily start with 0, nor is it necessarily contiguous. So we probably need an OMPI-specific syntax that specifically calls out cores and sockets and doesn't rely on making assumptions about the underlying numbering/labeling (analogous to LAM's C/N notation). But then the problem gets even harder, because we need to also mix this in with slots and nodes. I.e., what does --byslot and --bynode mean in conjunction with this syntax? Should they be illegal? How can you specify a sequence of specific cores where you want processes to go if they're in an irregular pattern? What does it mean to oversubscribe in these scenarios? ...these are some of the questions that we would debate about. We haven't really found a good syntax that answers all of them. Galen Shipman had a promising syntax at one point, but I've lost the specs of it... If you wander down to his office, he might be able to dig it up for you...? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
Hi I need to find when the underlying network is free. Means I dont need to go into the details of how MPi_send is implemented. What I want to know is when the MPI_Send is started .Or rather when MPi does not use the underlying network. I need to find timing for 1) When the application issue send command 2)When Mpi actually issues send command 3) When does BTl perform atual transfer(send) 4)When doe send complete 5) Who was thr receiver. etc. this was an example of MPi_send. like this I need to know MPI_Isend,broadcast etc. I guess this can be done using PMPI. But PMPI can do it during profile stages while I want all this data during runtime. So that I can improve the performance of the system while using that ideal time. Well I/o used is Lustre (its ROMIO). What I mean by I/O node is nodes that does input and ouput processing i.e they write to lustre and compute node just transfer data to i/o node to write it in Lustre.Compute node does not have memory at all.So when ever they have something to write it gets transfered to I/o node. and then I/o node does read and write. So when MPi_send is not issued the the network(Infiniband interconnect) can be used for some other transfer. Can anyone help me wih how to go abt tracing this at run time? Please help Pooja > On Apr 3, 2007, at 9:07 AM, po...@cc.gatech.edu wrote: > >> Actually I am working on the course project in which I am running a >> huge >> computational intensive code. >> I am running this code on cluster. >> Now my work is to find out when does the process send control messages >> (e.g. compute process to I/O process indicating I/O data is ready) > > By "I/O", do you mean stdin/stdout/stderr, or other file I/O? > > If you mean stdin/stdout/stderr, this is handled by the IOF (I/O > Forwarding) framework/components in Open MPI. It's somewhat > complicated, system-level code involving logically multiplexing data > sent across pipes to sockets (i.e., local process(es) to remote > process(es)). > > If you mean MPI-2 file I/O, you want to look at the ROMIO package; it > handles all the MPI-2 API for I/O. > > Or do you mean "I/O" such as normal MPI messages (such as those > generated by MPI_SEND and MPI_RECV)? FWIW, we normally refer to > these as MPI messages, not really "I/O" (we typically reserve the > term "I/O" for file IO and/or stdin/stdout/stderr). > > Which do you mean? > >> and when does they send actual data (e.g I/O nodes fetching actual >> data >> that is to be transfered.) > > This seems to imply that you're talking about parallel/network > filesystems. I have to admit that I'm now quite confused about what > you're asking for. :-) > >> And I have to log the timing and duration in other file. > > If you need to log the timing and duration of MPI calls, this is > pretty easy to do with the PMPI interface -- you can intercept all > MPI calls, log whatever information you want to log, invoke the > underlying MPI function to do the real work, and then log the duration. > >> For this I need to know the States of Open MPi (Control messges) >> So that I can simply put print statements in Open MPi code and find >> out >> how it works. > > I would [strongly] advise using a debugger. Printf statements will > only take you so far, and can be quite confusing in a parallel > scenario -- especially when they can alter the timing of the system > (i.e., Heisenburg kinds of effects). > >> For this reason I was asking to know the state changes or atleast >> the way >> to find it out. > > I'm still not clear on what state changes you're asking about. > > From this e-mail and your prior e-mails, it *seems* like you're > asking about how data gets from MPI_SEND in one process to MPI_RECV > in another process. Is that right? > > If so, I would not characterize the code that does this as a state > machine in the traditional sense. Sure, as a computer program, it > technically *is* a state machine that changes states according to > assembly instructions, registers, etc., but we did not use generic > state machine abstractions throughout the code base. In many places, > there's simply a linear sequence of events -- not a re-entrant state > machine. > > So if you're asking how a user message gets from MPI_SEND in one > process to MPI_RECV in another, we can describe that (it's a very > complicated answer that depends on many factors, actually -- it is > *not* a straightforward answer, not only because OMPI deals with many > device/network types, but also because there can be many variables > decided at run time that determine how a message is sent from a > process to a peer). > > So before we go any further -- can you, as precisely as possible, > describe exactly what information you're looking for? > >> Also my proff asked me to look into BTl transport layer to be used >> with >> MPi Api. > > I described that in a prior e-mail. > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list >
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
On Tue, 2007-04-03 at 12:33 -0600, Ralph H Castain wrote: > > > On 4/3/07 9:32 AM, "Li-Ta Lo" wrote: > > > On Sun, 2007-04-01 at 13:12 -0600, Ralph Castain wrote: > > > >> > >> 2. I'm not sure what you mean by mapping MPI processes to "physical" > >> processes, but I assume you mean how do we assign MPI ranks to processes on > >> specific nodes. You will find that done in the orte/mca/rmaps framework. We > >> currently only have one component in that framework - the round-robin > >> implementation - that maps either by slot or by node, as indicated by the > >> user. That code is fairly heavily commented, so you hopefully can > >> understand > >> what it is doing. > >> > > > > How does this work in a multi-core environment? the optimal way may be > > putting processes on every other "slot" on a two cores system? > > Well, that's a good question. At the moment, the only environments where we > encounter multiple cores treat each core as a separate "slot" when they > assign resources. We don't currently provide an option that says "map by > two", so the only way to do what you describe would be to manually specify > the mapping, slot by slot. > I also don't understand how Paffinity work for this case. When orted launch N processes on a node, does it have control on how those processes are started and mapped to the core/processor? Or is it the case that O.S. puts the process on whatever cores it picks and the paffinity module will try to "pin" the process on the core (picked by O.S.)? > Not very pretty. > > If someone cares to suggest some alternative notation/option for requesting > that kind of mapping flexibility, I'm certainly willing to implement it (it > would be rather trivial to do "map by N", but might be more complicated if > you want other things). > What is the current syntax of the config file/command line? Can we do something like array index in those script languages e.g. [0:N:2]? Ollie
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
On 4/3/07 9:32 AM, "Li-Ta Lo" wrote: > On Sun, 2007-04-01 at 13:12 -0600, Ralph Castain wrote: > >> >> 2. I'm not sure what you mean by mapping MPI processes to "physical" >> processes, but I assume you mean how do we assign MPI ranks to processes on >> specific nodes. You will find that done in the orte/mca/rmaps framework. We >> currently only have one component in that framework - the round-robin >> implementation - that maps either by slot or by node, as indicated by the >> user. That code is fairly heavily commented, so you hopefully can understand >> what it is doing. >> > > How does this work in a multi-core environment? the optimal way may be > putting processes on every other "slot" on a two cores system? Well, that's a good question. At the moment, the only environments where we encounter multiple cores treat each core as a separate "slot" when they assign resources. We don't currently provide an option that says "map by two", so the only way to do what you describe would be to manually specify the mapping, slot by slot. Not very pretty. If someone cares to suggest some alternative notation/option for requesting that kind of mapping flexibility, I'm certainly willing to implement it (it would be rather trivial to do "map by N", but might be more complicated if you want other things). Ralph > > Ollie > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
On Sun, 2007-04-01 at 13:12 -0600, Ralph Castain wrote: > > 2. I'm not sure what you mean by mapping MPI processes to "physical" > processes, but I assume you mean how do we assign MPI ranks to processes on > specific nodes. You will find that done in the orte/mca/rmaps framework. We > currently only have one component in that framework - the round-robin > implementation - that maps either by slot or by node, as indicated by the > user. That code is fairly heavily commented, so you hopefully can understand > what it is doing. > How does this work in a multi-core environment? the optimal way may be putting processes on every other "slot" on a two cores system? Ollie
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
On Apr 3, 2007, at 9:07 AM, po...@cc.gatech.edu wrote: Actually I am working on the course project in which I am running a huge computational intensive code. I am running this code on cluster. Now my work is to find out when does the process send control messages (e.g. compute process to I/O process indicating I/O data is ready) By "I/O", do you mean stdin/stdout/stderr, or other file I/O? If you mean stdin/stdout/stderr, this is handled by the IOF (I/O Forwarding) framework/components in Open MPI. It's somewhat complicated, system-level code involving logically multiplexing data sent across pipes to sockets (i.e., local process(es) to remote process(es)). If you mean MPI-2 file I/O, you want to look at the ROMIO package; it handles all the MPI-2 API for I/O. Or do you mean "I/O" such as normal MPI messages (such as those generated by MPI_SEND and MPI_RECV)? FWIW, we normally refer to these as MPI messages, not really "I/O" (we typically reserve the term "I/O" for file IO and/or stdin/stdout/stderr). Which do you mean? and when does they send actual data (e.g I/O nodes fetching actual data that is to be transfered.) This seems to imply that you're talking about parallel/network filesystems. I have to admit that I'm now quite confused about what you're asking for. :-) And I have to log the timing and duration in other file. If you need to log the timing and duration of MPI calls, this is pretty easy to do with the PMPI interface -- you can intercept all MPI calls, log whatever information you want to log, invoke the underlying MPI function to do the real work, and then log the duration. For this I need to know the States of Open MPi (Control messges) So that I can simply put print statements in Open MPi code and find out how it works. I would [strongly] advise using a debugger. Printf statements will only take you so far, and can be quite confusing in a parallel scenario -- especially when they can alter the timing of the system (i.e., Heisenburg kinds of effects). For this reason I was asking to know the state changes or atleast the way to find it out. I'm still not clear on what state changes you're asking about. From this e-mail and your prior e-mails, it *seems* like you're asking about how data gets from MPI_SEND in one process to MPI_RECV in another process. Is that right? If so, I would not characterize the code that does this as a state machine in the traditional sense. Sure, as a computer program, it technically *is* a state machine that changes states according to assembly instructions, registers, etc., but we did not use generic state machine abstractions throughout the code base. In many places, there's simply a linear sequence of events -- not a re-entrant state machine. So if you're asking how a user message gets from MPI_SEND in one process to MPI_RECV in another, we can describe that (it's a very complicated answer that depends on many factors, actually -- it is *not* a straightforward answer, not only because OMPI deals with many device/network types, but also because there can be many variables decided at run time that determine how a message is sent from a process to a peer). So before we go any further -- can you, as precisely as possible, describe exactly what information you're looking for? Also my proff asked me to look into BTl transport layer to be used with MPi Api. I described that in a prior e-mail. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
Hi, Actually I am working on the course project in which I am running a huge computational intensive code. I am running this code on cluster. Now my work is to find out when does the process send control messages (e.g. compute process to I/O process indicating I/O data is ready) and when does they send actual data (e.g I/O nodes fetching actual data that is to be transfered.) And I have to log the timing and duration in other file. For this I need to know the States of Open MPi (Control messges) So that I can simply put print statements in Open MPi code and find out how it works. For this reason I was asking to know the state changes or atleast the way to find it out. Also my proff asked me to look into BTl transport layer to be used with MPi Api. I hope you will help. Thanks and Regards Pooja > On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote: > >> I can't help you with the BTL question. On the others: > > Yes, you can "sorta" call BTL's directly from application programs > (are you trying to use MPI alongside other communication libraries, > and using the BTL components as a sample?), but there are issues > involved with this. > > First, you need to install Open MPI with all the development > headers. Open MPI normally only installs "mpi.h" and a small number > of other heads; installing *all* the headers will allow you to write > applications that use OMPI's internal headers (such as btl.h) while > developing outside of the Open MPI source tree. > > Second, you probably won't want to access the BTL's directly. To > make this make sense, here's how the code is organized (even if the > specific call sequence is not exactly this layered for performance/ > optimization reasons): > > MPI layer (e.g., MPI_SEND) > -> PML > -> BML > -> BTL > > You have two choices: > > 1. Go through the PML instead (this is what we do in the MPI > collectives, for example) -- but this imposes MPI semantics on > sending and receiving, which assumedly you are trying to avoid. > Check out ompi/mca/pml/pml.h. > > 2. Go through the BML instead -- the BTL Management Layer. This is > essentially a multiplexor for all the BTLs that have been > instantiated. I'm guessing that this is what you want to do > (remember that OMPI has true multi-device support; using the BML and > multiple BTLs is one of the ways that we do this). Have a look at > ompi/mca/bml/bml.h for the interface. > > There is also currently no mechanism to get the BML and BTL pointers > that were instantiated by the PML. However, if you're just doing > proof-of-concept code, you can extract these directly from the MPI > layer's global variables to see how this stuff works. > > To have full interoperability of the underlying BTLs and between > multiple upper-layer communication libraries (e.g., between OMPI and > something else) is something that we have talked about a little, but > have not done much work on. > > To see the BTL interface (just for completeness), see ompi/mca/btl/ > btl.h. > > You can probably see the pattern here... In all of Open MPI's > frameworks, the public interface is in /mca// > .h, where is one of opal, orte, or ompi, and > is the name of the framework. > >> 1. states are reported via the orte/mca/smr framework. You will see >> the >> states listed in orte/mca/smr/smr_types.h. We track both process >> and job >> states. Hopefully, the state names will be somewhat self- >> explanatory and >> indicative of the order in which they are traversed. The job states >> are set >> when *all* of the processes in the job reach the corresponding state. > > Note that these are very coarse-grained process-level states (e.g., > is a given process running or not?). It's not clear what kind of > states you were asking about -- the Open MPI code base has many > internal state machines for various message passing and other > mechanisms. > > What information are you looking for, specifically? > >> 2. I'm not sure what you mean by mapping MPI processes to "physical" >> processes, but I assume you mean how do we assign MPI ranks to >> processes on >> specific nodes. You will find that done in the orte/mca/rmaps >> framework. We >> currently only have one component in that framework - the round-robin >> implementation - that maps either by slot or by node, as indicated >> by the >> user. That code is fairly heavily commented, so you hopefully can >> understand >> what it is doing. >> >> Hope that helps! >> Ralph >> >> >> On 4/1/07 1:32 PM, "po...@cc.gatech.edu" wrote: >> >>> Hi >>> I am Pooja and I am working on a course project which requires me >>> -> to track the internal state changes of MPI and need me to >>> figure out >>> how does ORTE maps MPi Process to actual physical processes >>> ->Also I need to find way to get BTL transports work directly with >>> MPI >>> level calls. >>> I just want to know is this posible and if yes what procedure I >>> should >>> follow or I should look into which files (for change). >>> >>> >>> Please Hel
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote: I can't help you with the BTL question. On the others: Yes, you can "sorta" call BTL's directly from application programs (are you trying to use MPI alongside other communication libraries, and using the BTL components as a sample?), but there are issues involved with this. First, you need to install Open MPI with all the development headers. Open MPI normally only installs "mpi.h" and a small number of other heads; installing *all* the headers will allow you to write applications that use OMPI's internal headers (such as btl.h) while developing outside of the Open MPI source tree. Second, you probably won't want to access the BTL's directly. To make this make sense, here's how the code is organized (even if the specific call sequence is not exactly this layered for performance/ optimization reasons): MPI layer (e.g., MPI_SEND) -> PML -> BML -> BTL You have two choices: 1. Go through the PML instead (this is what we do in the MPI collectives, for example) -- but this imposes MPI semantics on sending and receiving, which assumedly you are trying to avoid. Check out ompi/mca/pml/pml.h. 2. Go through the BML instead -- the BTL Management Layer. This is essentially a multiplexor for all the BTLs that have been instantiated. I'm guessing that this is what you want to do (remember that OMPI has true multi-device support; using the BML and multiple BTLs is one of the ways that we do this). Have a look at ompi/mca/bml/bml.h for the interface. There is also currently no mechanism to get the BML and BTL pointers that were instantiated by the PML. However, if you're just doing proof-of-concept code, you can extract these directly from the MPI layer's global variables to see how this stuff works. To have full interoperability of the underlying BTLs and between multiple upper-layer communication libraries (e.g., between OMPI and something else) is something that we have talked about a little, but have not done much work on. To see the BTL interface (just for completeness), see ompi/mca/btl/ btl.h. You can probably see the pattern here... In all of Open MPI's frameworks, the public interface is in /mca// .h, where is one of opal, orte, or ompi, and is the name of the framework. 1. states are reported via the orte/mca/smr framework. You will see the states listed in orte/mca/smr/smr_types.h. We track both process and job states. Hopefully, the state names will be somewhat self- explanatory and indicative of the order in which they are traversed. The job states are set when *all* of the processes in the job reach the corresponding state. Note that these are very coarse-grained process-level states (e.g., is a given process running or not?). It's not clear what kind of states you were asking about -- the Open MPI code base has many internal state machines for various message passing and other mechanisms. What information are you looking for, specifically? 2. I'm not sure what you mean by mapping MPI processes to "physical" processes, but I assume you mean how do we assign MPI ranks to processes on specific nodes. You will find that done in the orte/mca/rmaps framework. We currently only have one component in that framework - the round-robin implementation - that maps either by slot or by node, as indicated by the user. That code is fairly heavily commented, so you hopefully can understand what it is doing. Hope that helps! Ralph On 4/1/07 1:32 PM, "po...@cc.gatech.edu" wrote: Hi I am Pooja and I am working on a course project which requires me -> to track the internal state changes of MPI and need me to figure out how does ORTE maps MPi Process to actual physical processes ->Also I need to find way to get BTL transports work directly with MPI level calls. I just want to know is this posible and if yes what procedure I should follow or I should look into which files (for change). Please Help Thanks and Regards Pooja ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
Hi Pooja What did you do to make your prof dislike you so much??? :-) These are, to say the least, major tasks you have been describing. I've seen developers on our team spend months trying to really understand even one or two of the issues raised in your various emails, let alone make any kind of changes... I can't help you with the BTL question. On the others: 1. states are reported via the orte/mca/smr framework. You will see the states listed in orte/mca/smr/smr_types.h. We track both process and job states. Hopefully, the state names will be somewhat self-explanatory and indicative of the order in which they are traversed. The job states are set when *all* of the processes in the job reach the corresponding state. 2. I'm not sure what you mean by mapping MPI processes to "physical" processes, but I assume you mean how do we assign MPI ranks to processes on specific nodes. You will find that done in the orte/mca/rmaps framework. We currently only have one component in that framework - the round-robin implementation - that maps either by slot or by node, as indicated by the user. That code is fairly heavily commented, so you hopefully can understand what it is doing. Hope that helps! Ralph On 4/1/07 1:32 PM, "po...@cc.gatech.edu" wrote: > Hi > I am Pooja and I am working on a course project which requires me > -> to track the internal state changes of MPI and need me to figure out > how does ORTE maps MPi Process to actual physical processes > ->Also I need to find way to get BTL transports work directly with MPI > level calls. > I just want to know is this posible and if yes what procedure I should > follow or I should look into which files (for change). > > > Please Help > > Thanks and Regards > Pooja > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel