Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-15 Thread pooja
Hi!!
Thanks for reply.Actaully there was some problem with the my downloaded
version of openmpi.But when I downloaded everything again and did all
configure and make statements again it worked fine.

Thanks a lot .
And next time I will make sure that I give all details.

Thanks
Pooja




> This is unfortunately not enough information to provide any help --
> the (lots of output) parts are pretty important.  Can you provide all
> the information cited here:
>
>  http://www.open-mpi.org/community/help/
>
>
> On Apr 14, 2007, at 11:36 PM, po...@cc.gatech.edu wrote:
>
>> Hi!!!
>> Thanks for help!!!
>>
>> Right now I am just trying to install the normal openmpi(without
>> using all
>> development header files).
>> But it is still giving me some error.
>> I have downloaded the developer version from the openmpi.org site.
>> Then I gave
>> ./configure --prefix=/net/hc293/pooja/dev_openmpi
>> (lots of out put)
>> make all install
>> (lots of output )
>> and error :ld returned 1 exit status
>> make[2]: *** [libopen-pal.la] Error 1
>> make[2]: Leaving directory `/net/hc293/pooja/openmpi-1.2.1a0r14362-
>> dev/opal'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory `/net/hc293/pooja/openmpi-1.2.1a0r14362-
>> dev/opal'
>> make: *** [all-recursive] Error 1
>>
>>
>>
>> Also the dev_openmpi folder is empty.
>>
>> SO I am not able to complie normal ring_c.c example also.
>>
>> Please help
>>
>> Thanks and Regards
>> Pooja
>>
>>
>>
>>
>>
>>
>>> Configure with the --with-devel-headers switch.  This will install
>>> all the developer headers.
>>>
>>> If you care, check out "./configure --help" -- that shows all the
>>> options available to the configure script (including --with-devel-
>>> headers).
>>>
>>>
>>> On Apr 13, 2007, at 7:36 PM, po...@cc.gatech.edu wrote:
>>>
 Hi

 I have downloaded the developer version of source code by
 downloading a
 nightly Subversion snapshot tarball.And have installed the openmpi.
 Using

 ./configure --prefix=/usr/local
 make all install.

 But I want to install with all the development headers.So that I
 can write
 an application that can use Ompi internal headers.


 Thanks and Regards
 Pooja





> On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote:
>
>> I can't help you with the BTL question. On the others:
>
> Yes, you can "sorta" call BTL's directly from application programs
> (are you trying to use MPI alongside other communication libraries,
> and using the BTL components as a sample?), but there are issues
> involved with this.
>
> First, you need to install Open MPI with all the development
> headers.  Open MPI normally only installs "mpi.h" and a small
> number
> of other heads; installing *all* the headers will allow you to
> write
> applications that use OMPI's internal headers (such as btl.h) while
> developing outside of the Open MPI source tree.
>
> Second, you probably won't want to access the BTL's directly.  To
> make this make sense, here's how the code is organized (even if the
> specific call sequence is not exactly this layered for performance/
> optimization reasons):
>
> MPI layer (e.g., MPI_SEND)
>   -> PML
> -> BML
>   -> BTL
>
> You have two choices:
>
> 1. Go through the PML instead (this is what we do in the MPI
> collectives, for example) -- but this imposes MPI semantics on
> sending and receiving, which assumedly you are trying to avoid.
> Check out ompi/mca/pml/pml.h.
>
> 2. Go through the BML instead -- the BTL Management Layer.  This is
> essentially a multiplexor for all the BTLs that have been
> instantiated.  I'm guessing that this is what you want to do
> (remember that OMPI has true multi-device support; using the BML
> and
> multiple BTLs is one of the ways that we do this).  Have a look at
> ompi/mca/bml/bml.h for the interface.
>
> There is also currently no mechanism to get the BML and BTL
> pointers
> that were instantiated by the PML.  However, if you're just doing
> proof-of-concept code, you can extract these directly from the MPI
> layer's global variables to see how this stuff works.
>
> To have full interoperability of the underlying BTLs and between
> multiple upper-layer communication libraries (e.g., between OMPI
> and
> something else) is something that we have talked about a little,
> but
> have not done much work on.
>
> To see the BTL interface (just for completeness), see ompi/mca/btl/
> btl.h.
>
> You can probably see the pattern here...  In all of Open MPI's
> frameworks, the public interface is in /mca//
> .h, where  is one of opal, orte, or ompi, and
>  is the name of the framework.
>
>> 1. states are reported via the orte/mca/smr framework. You will
>> see

Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-13 Thread pooja
Hi

I have downloaded the developer version of source code by downloading a
nightly Subversion snapshot tarball.And have installed the openmpi.
Using

./configure --prefix=/usr/local
make all install.

But I want to install with all the development headers.So that I can write
an application that can use Ompi internal headers.


Thanks and Regards
Pooja





> On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote:
>
>> I can't help you with the BTL question. On the others:
>
> Yes, you can "sorta" call BTL's directly from application programs
> (are you trying to use MPI alongside other communication libraries,
> and using the BTL components as a sample?), but there are issues
> involved with this.
>
> First, you need to install Open MPI with all the development
> headers.  Open MPI normally only installs "mpi.h" and a small number
> of other heads; installing *all* the headers will allow you to write
> applications that use OMPI's internal headers (such as btl.h) while
> developing outside of the Open MPI source tree.
>
> Second, you probably won't want to access the BTL's directly.  To
> make this make sense, here's how the code is organized (even if the
> specific call sequence is not exactly this layered for performance/
> optimization reasons):
>
> MPI layer (e.g., MPI_SEND)
>   -> PML
> -> BML
>   -> BTL
>
> You have two choices:
>
> 1. Go through the PML instead (this is what we do in the MPI
> collectives, for example) -- but this imposes MPI semantics on
> sending and receiving, which assumedly you are trying to avoid.
> Check out ompi/mca/pml/pml.h.
>
> 2. Go through the BML instead -- the BTL Management Layer.  This is
> essentially a multiplexor for all the BTLs that have been
> instantiated.  I'm guessing that this is what you want to do
> (remember that OMPI has true multi-device support; using the BML and
> multiple BTLs is one of the ways that we do this).  Have a look at
> ompi/mca/bml/bml.h for the interface.
>
> There is also currently no mechanism to get the BML and BTL pointers
> that were instantiated by the PML.  However, if you're just doing
> proof-of-concept code, you can extract these directly from the MPI
> layer's global variables to see how this stuff works.
>
> To have full interoperability of the underlying BTLs and between
> multiple upper-layer communication libraries (e.g., between OMPI and
> something else) is something that we have talked about a little, but
> have not done much work on.
>
> To see the BTL interface (just for completeness), see ompi/mca/btl/
> btl.h.
>
> You can probably see the pattern here...  In all of Open MPI's
> frameworks, the public interface is in /mca//
> .h, where  is one of opal, orte, or ompi, and
>  is the name of the framework.
>
>> 1. states are reported via the orte/mca/smr framework. You will see
>> the
>> states listed in orte/mca/smr/smr_types.h. We track both process
>> and job
>> states. Hopefully, the state names will be somewhat self-
>> explanatory and
>> indicative of the order in which they are traversed. The job states
>> are set
>> when *all* of the processes in the job reach the corresponding state.
>
> Note that these are very coarse-grained process-level states (e.g.,
> is a given process running or not?).  It's not clear what kind of
> states you were asking about -- the Open MPI code base has many
> internal state machines for various message passing and other
> mechanisms.
>
> What information are you looking for, specifically?
>
>> 2. I'm not sure what you mean by mapping MPI processes to "physical"
>> processes, but I assume you mean how do we assign MPI ranks to
>> processes on
>> specific nodes. You will find that done in the orte/mca/rmaps
>> framework. We
>> currently only have one component in that framework - the round-robin
>> implementation - that maps either by slot or by node, as indicated
>> by the
>> user. That code is fairly heavily commented, so you hopefully can
>> understand
>> what it is doing.
>>
>> Hope that helps!
>> Ralph
>>
>>
>> On 4/1/07 1:32 PM, "po...@cc.gatech.edu"  wrote:
>>
>>> Hi
>>> I am Pooja and I am working on a course project which requires me
>>> -> to track the internal state changes of MPI and need me to
>>> figure out
>>> how does ORTE maps MPi Process to actual physical processes
>>> ->Also I need to find way to get BTL transports work directly with
>>> MPI
>>> level calls.
>>> I just want to know is this posible and if yes what procedure I
>>> should
>>> follow or I should look into which files (for change).
>>>
>>>
>>> Please Help
>>>
>>> Thanks and Regards
>>> Pooja
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> 

Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-04 Thread Brian Barrett

On Apr 2, 2007, at 10:23 AM, Jeff Squyres wrote:


On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote:


I can't help you with the BTL question. On the others:


2. Go through the BML instead -- the BTL Management Layer.  This is
essentially a multiplexor for all the BTLs that have been
instantiated.  I'm guessing that this is what you want to do
(remember that OMPI has true multi-device support; using the BML and
multiple BTLs is one of the ways that we do this).  Have a look at
ompi/mca/bml/bml.h for the interface.

There is also currently no mechanism to get the BML and BTL pointers
that were instantiated by the PML.  However, if you're just doing
proof-of-concept code, you can extract these directly from the MPI
layer's global variables to see how this stuff works.

To have full interoperability of the underlying BTLs and between
multiple upper-layer communication libraries (e.g., between OMPI and
something else) is something that we have talked about a little, but
have not done much work on.

To see the BTL interface (just for completeness), see ompi/mca/btl/
btl.h.


Jumping in late to the conversation, and on an unimportant point for  
what Pooja really wants to do, but...


The BTL really can't be used directly at this point -- you have to  
use the BML interface to get data pointers and the like.  There's  
never any need to grab anything from the PML or global structures.   
The BML information is contained on a pointer on the ompi_proc_t  
structure associated with each peer.  The list of peers can be  
accessed with the ompi_proc_world() call.


Hope this helps,

Brian



Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-04 Thread Jeff Squyres

On Apr 3, 2007, at 4:57 PM, po...@cc.gatech.edu wrote:

I need to find when the underlying network is free. Means I dont  
need to

go into the details of how MPi_send is implemented.


Ah, ok.  That explains a lot.

What I want to know is when the MPI_Send is started .Or rather when  
MPi

does not use the underlying network.

I need to find timing for
1) When the application issue send command


This (and #5) can be implemented with a PMPI-based intercept library  
(I assume that by "command", you mean "API function call").



2) When Mpi actually issues send command
3) When does BTl perform atual transfer(send)


What are you looking to distinguish here?  I.e., what is the  
difference between 1 and 2 vs. 3?


Open MPI has an MPI_Send() function in C that does some error  
checking and then invokes an underlying "send" function (via function  
pointer) to a plugin that starts doing the setup for the MPI  
semantics for the send.  Eventually, another function pointer is used  
to invoke the "send" function in the BTL to actually send the  
message.  More setup is performed down in the BTL (usually dealing  
with setting up data structures to invoke the underlying network/OS/ 
driver "send" function that starts the network send), and then we  
invoke some underlying OS/kernel-bypass function to start the network  
transfer.  Note that all we can guarantee is that the transfer start  
sometime after that -- there's no way to know *exactly* when it  
starts because the underlying kernel driver may choose to defer it  
for a while based on flow control, available resources, etc.


Specifically, similar to one of my prior e-mails, the calling  
structure is something like this:


MPI_Send()
  --> PML plugin (usually the "ob1" plugin)
 --> BTL plugin (one of the components in the ompi/mca/btl/  
directory)

--> underlying OS/kernel-bypass function


4) When doe send complete


By "complete", what exactly are you looking for?  There's several  
definitions possible here:


- when any of the "send" functions listed above returns
- when the underlying network driver tells us that it is complete  
(a.k.a. "local completion" -- it *DOES NOT* imply that the receiver  
has even started to receive the message, nor that the message has  
even left the host yet)

- when he receiver ACK's receiving the message
- when MPI_Send() returns

FWIW, we usually measure local completion time because that's all  
that we can know (because the underlying network driver makes its own  
decisions about when messages are put out on the network, etc., and  
we [i.e., any user-level networking software] don't have visibility  
of that information).



5) Who was thr receiver.
etc. this was an example of MPi_send.
like this I need to know MPI_Isend,broadcast etc.

I guess this can be done using PMPI.


Some of this can, yes.

But PMPI can do it during profile stages while I want all this data  
during

runtime.


I don't quite understand this statement -- PMPI is a run-time  
profiling system.  All it does is insert your shim PMPI layer between  
the user's application and the "real" MPI layer.


So that I can improve the performance of the system while using  
that ideal

time.


What I'm piecing together from your e-mails is that you want to use  
MPI in conjunction with using the network directly, either through  
the BTLs or some other communication library (i.e., both MPI and your  
other communication library will share the same network resources)  
and you're trying to find out when MPI is not using a particular  
network resource so that you can use it with your other communication  
library in order to maximize utilization and minimize contention /  
congestion.  Is that correct?


Is that right?


Well I/o used is Lustre (its ROMIO).


Note that ROMIO uses a fair bit of MPI sending and receiving before  
using the underlying file system.  So you'll have at least 2 layers  
of software to explore to find out when the network is free/busy.


What I mean by I/O node is nodes that does input and ouput  
processing i.e
they write to lustre and compute node just transfer data to i/o  
node to
write it in Lustre.Compute node does not have memory at all.So when  
ever

they have something to write it gets transfered to I/o node.
and then I/o node does read and write.


Ok.  I'm guessing/assuming that this is multiplexing that is either  
done in ROMIO or in Lustre itself.


So when MPi_send is not issued the  the network(Infiniband  
interconnect)

can be used for some other transfer.


Makes sense.


Can anyone help me wih how to go abt tracing this at run time?


The BTL plugin that you will be concerned with is the "openib" BTL  
(in the Open MPI source tree: ompi/mca/btl/openib/*) -- assuming that  
you are using an OpenFabrics/OFED-based network driver on your nodes  
(if you're using an older mvapi-based network driver, you'll use the  
mvapi BTL: ompi/mca/btl/mvapi/* -- but I would not recommend this  
because all 

Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-04 Thread Jeff Squyres

On Apr 3, 2007, at 3:07 PM, Li-Ta Lo wrote:

Well, that's a good question. At the moment, the only environments  
where we
encounter multiple cores treat each core as a separate "slot" when  
they
assign resources. We don't currently provide an option that says  
"map by
two", so the only way to do what you describe would be to manually  
specify

the mapping, slot by slot.


I also don't understand how Paffinity work for this case. When orted
launch N processes on a node, does it have control on how those
processes are started and mapped to the core/processor? Or is it
the case that O.S. puts the process on whatever cores it picks and
the paffinity module will try to "pin" the process on the core (picked
by O.S.)?


Check out these 3 FAQ entries:

http://www.open-mpi.org/faq/?category=tuning#paffinity-defs
http://www.open-mpi.org/faq/?category=tuning#maffinity-defs
http://www.open-mpi.org/faq/?category=tuning#using-paffinity

We *only* have 1 lame way of doing paffinity right now -- we start  
pinning processes to processors starting with processor ID 0.


If someone cares to suggest some alternative notation/option for  
requesting
that kind of mapping flexibility, I'm certainly willing to  
implement it (it
would be rather trivial to do "map by N", but might be more  
complicated if

you want other things).


What is the current syntax of the config file/command line? Can we do
something like array index in those script languages e.g. [0:N:2]?
mailman/listinfo.cgi/devel


There is no syntax for the command line -- this is a discussion that  
we developers have gotten into deadlock over several times.  It's a  
problem that we'd like to solve, but every time we talk about it, we  
deadlock and then move on to other higher-priority items.  :-\


I take it to mean that "[0:N:2]" (ditching the [] would probably be  
good, because those would need to be escaped on the command line --  
probably "--paffinity 0:N:2" or something would be sufficient) would  
be "start with core 0, end with core N, and step by 2 cores".  Right?


This is fine, and similar things have been suggested before.  The  
problem with it is when you want to specify by socket, and not by  
core.  Additionally, there can be an ambiguity in Linux -- core 0 is  
always the first core on the first socket.  But where is core 1?  It  
could be the 2nd core on the 1st socket, or it could be the 1st core  
on the 2nd socket -- it depends on BIOS settings (IIRC).   
Additionally, Solaris processor ID number does not necessarily start  
with 0, nor is it necessarily contiguous.


So we probably need an OMPI-specific syntax that specifically calls  
out cores and sockets and doesn't rely on making assumptions about  
the underlying numbering/labeling (analogous to LAM's C/N notation).


But then the problem gets even harder, because we need to also mix  
this in with slots and nodes.  I.e., what does --byslot and --bynode  
mean in conjunction with this syntax?  Should they be illegal?


How can you specify a sequence of specific cores where you want  
processes to go if they're in an irregular pattern?


What does it mean to oversubscribe in these scenarios?

...these are some of the questions that we would debate about.  We  
haven't really found a good syntax that answers all of them.  Galen  
Shipman had a promising syntax at one point, but I've lost the specs  
of it...  If you wander down to his office, he might be able to dig  
it up for you...?


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-03 Thread pooja
Hi
I need to find when the underlying network is free. Means I dont need to
go into the details of how MPi_send is implemented.

What I want to know is when the MPI_Send is started .Or rather when MPi
does not use the underlying network.

I need to find timing for
1) When the application issue send command
2)When Mpi actually issues send command
3) When does BTl perform atual transfer(send)
4)When doe send complete
5) Who was thr receiver.
etc. this was an example of MPi_send.
like this I need to know MPI_Isend,broadcast etc.

I guess this can be done using PMPI.
But PMPI can do it during profile stages while I want all this data during
runtime.
So that I can improve the performance of the system while using that ideal
time.

Well I/o used is Lustre (its ROMIO).
What I mean by I/O node is nodes that does input and ouput processing i.e
they write to lustre and compute node just transfer data to i/o node to
write it in Lustre.Compute node does not have memory at all.So when ever
they have something to write it gets transfered to I/o node.
and then I/o node does read and write.


So when MPi_send is not issued the  the network(Infiniband interconnect)
can be used for some other transfer.

Can anyone help me wih how to go abt tracing this at run time?

Please help
Pooja









> On Apr 3, 2007, at 9:07 AM, po...@cc.gatech.edu wrote:
>
>> Actually I am working on the course project in which I am running a
>> huge
>> computational intensive code.
>> I am running this code on cluster.
>> Now my work is to find out when does the process send control messages
>> (e.g. compute process to I/O  process indicating I/O data is ready)
>
> By "I/O", do you mean stdin/stdout/stderr, or other file I/O?
>
> If you mean stdin/stdout/stderr, this is handled by the IOF (I/O
> Forwarding) framework/components in Open MPI.  It's somewhat
> complicated, system-level code involving logically multiplexing data
> sent across pipes to sockets (i.e., local process(es) to remote
> process(es)).
>
> If you mean MPI-2 file I/O, you want to look at the ROMIO package; it
> handles all the MPI-2 API for I/O.
>
> Or do you mean "I/O" such as normal MPI messages (such as those
> generated by MPI_SEND and MPI_RECV)?  FWIW, we normally refer to
> these as MPI messages, not really "I/O" (we typically reserve the
> term "I/O" for file IO and/or stdin/stdout/stderr).
>
> Which do you mean?
>
>> and when does they send actual data (e.g I/O nodes fetching actual
>> data
>> that is to be transfered.)
>
> This seems to imply that you're talking about parallel/network
> filesystems.  I have to admit that I'm now quite confused about what
> you're asking for.  :-)
>
>> And I have to log the timing and duration in other file.
>
> If you need to log the timing and duration of MPI calls, this is
> pretty easy to do with the PMPI interface -- you can intercept all
> MPI calls, log whatever information you want to log, invoke the
> underlying MPI function to do the real work, and then log the duration.
>
>> For this I need to know the States of Open MPi (Control messges)
>> So that I can simply put print statements in Open MPi code and find
>> out
>> how it works.
>
> I would [strongly] advise using a debugger.  Printf statements will
> only take you so far, and can be quite confusing in a parallel
> scenario -- especially when they can alter the timing of the system
> (i.e., Heisenburg kinds of effects).
>
>> For this reason I was asking to know the state changes or atleast
>> the way
>> to find it out.
>
> I'm still not clear on what state changes you're asking about.
>
>  From this e-mail and your prior e-mails, it *seems* like you're
> asking about how data gets from MPI_SEND in one process to MPI_RECV
> in another process.  Is that right?
>
> If so, I would not characterize the code that does this as a state
> machine in the traditional sense.  Sure, as a computer program, it
> technically *is* a state machine that changes states according to
> assembly instructions, registers, etc., but we did not use generic
> state machine abstractions throughout the code base.  In many places,
> there's simply a linear sequence of events -- not a re-entrant state
> machine.
>
> So if you're asking how a user message gets from MPI_SEND in one
> process to MPI_RECV in another, we can describe that (it's a very
> complicated answer that depends on many factors, actually -- it is
> *not* a straightforward answer, not only because OMPI deals with many
> device/network types, but also because there can be many variables
> decided at run time that determine how a message is sent from a
> process to a peer).
>
> So before we go any further -- can you, as precisely as possible,
> describe exactly what information you're looking for?
>
>> Also my proff asked me to look into BTl transport layer to be used
>> with
>> MPi Api.
>
> I described that in a prior e-mail.
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> 

Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-03 Thread Li-Ta Lo
On Tue, 2007-04-03 at 12:33 -0600, Ralph H Castain wrote:
> 
> 
> On 4/3/07 9:32 AM, "Li-Ta Lo"  wrote:
> 
> > On Sun, 2007-04-01 at 13:12 -0600, Ralph Castain wrote:
> > 
> >> 
> >> 2. I'm not sure what you mean by mapping MPI processes to "physical"
> >> processes, but I assume you mean how do we assign MPI ranks to processes on
> >> specific nodes. You will find that done in the orte/mca/rmaps framework. We
> >> currently only have one component in that framework - the round-robin
> >> implementation - that maps either by slot or by node, as indicated by the
> >> user. That code is fairly heavily commented, so you hopefully can 
> >> understand
> >> what it is doing.
> >> 
> > 
> > How does this work in a multi-core environment? the optimal way may be
> > putting processes on every other "slot" on a two cores system?
> 
> Well, that's a good question. At the moment, the only environments where we
> encounter multiple cores treat each core as a separate "slot" when they
> assign resources. We don't currently provide an option that says "map by
> two", so the only way to do what you describe would be to manually specify
> the mapping, slot by slot.
> 

I also don't understand how Paffinity work for this case. When orted
launch N processes on a node, does it have control on how those 
processes are started and mapped to the core/processor? Or is it
the case that O.S. puts the process on whatever cores it picks and
the paffinity module will try to "pin" the process on the core (picked
by O.S.)?

> Not very pretty.
> 
> If someone cares to suggest some alternative notation/option for requesting
> that kind of mapping flexibility, I'm certainly willing to implement it (it
> would be rather trivial to do "map by N", but might be more complicated if
> you want other things).
> 

What is the current syntax of the config file/command line? Can we do 
something like array index in those script languages e.g. [0:N:2]?

Ollie




Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-03 Thread Ralph H Castain



On 4/3/07 9:32 AM, "Li-Ta Lo"  wrote:

> On Sun, 2007-04-01 at 13:12 -0600, Ralph Castain wrote:
> 
>> 
>> 2. I'm not sure what you mean by mapping MPI processes to "physical"
>> processes, but I assume you mean how do we assign MPI ranks to processes on
>> specific nodes. You will find that done in the orte/mca/rmaps framework. We
>> currently only have one component in that framework - the round-robin
>> implementation - that maps either by slot or by node, as indicated by the
>> user. That code is fairly heavily commented, so you hopefully can understand
>> what it is doing.
>> 
> 
> How does this work in a multi-core environment? the optimal way may be
> putting processes on every other "slot" on a two cores system?

Well, that's a good question. At the moment, the only environments where we
encounter multiple cores treat each core as a separate "slot" when they
assign resources. We don't currently provide an option that says "map by
two", so the only way to do what you describe would be to manually specify
the mapping, slot by slot.

Not very pretty.

If someone cares to suggest some alternative notation/option for requesting
that kind of mapping flexibility, I'm certainly willing to implement it (it
would be rather trivial to do "map by N", but might be more complicated if
you want other things).

Ralph

> 
> Ollie
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-03 Thread Li-Ta Lo
On Sun, 2007-04-01 at 13:12 -0600, Ralph Castain wrote:

> 
> 2. I'm not sure what you mean by mapping MPI processes to "physical"
> processes, but I assume you mean how do we assign MPI ranks to processes on
> specific nodes. You will find that done in the orte/mca/rmaps framework. We
> currently only have one component in that framework - the round-robin
> implementation - that maps either by slot or by node, as indicated by the
> user. That code is fairly heavily commented, so you hopefully can understand
> what it is doing.
> 

How does this work in a multi-core environment? the optimal way may be
putting processes on every other "slot" on a two cores system?

Ollie




Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-03 Thread Jeff Squyres

On Apr 3, 2007, at 9:07 AM, po...@cc.gatech.edu wrote:

Actually I am working on the course project in which I am running a  
huge

computational intensive code.
I am running this code on cluster.
Now my work is to find out when does the process send control messages
(e.g. compute process to I/O  process indicating I/O data is ready)


By "I/O", do you mean stdin/stdout/stderr, or other file I/O?

If you mean stdin/stdout/stderr, this is handled by the IOF (I/O  
Forwarding) framework/components in Open MPI.  It's somewhat  
complicated, system-level code involving logically multiplexing data  
sent across pipes to sockets (i.e., local process(es) to remote  
process(es)).


If you mean MPI-2 file I/O, you want to look at the ROMIO package; it  
handles all the MPI-2 API for I/O.


Or do you mean "I/O" such as normal MPI messages (such as those  
generated by MPI_SEND and MPI_RECV)?  FWIW, we normally refer to  
these as MPI messages, not really "I/O" (we typically reserve the  
term "I/O" for file IO and/or stdin/stdout/stderr).


Which do you mean?

and when does they send actual data (e.g I/O nodes fetching actual  
data

that is to be transfered.)


This seems to imply that you're talking about parallel/network  
filesystems.  I have to admit that I'm now quite confused about what  
you're asking for.  :-)



And I have to log the timing and duration in other file.


If you need to log the timing and duration of MPI calls, this is  
pretty easy to do with the PMPI interface -- you can intercept all  
MPI calls, log whatever information you want to log, invoke the  
underlying MPI function to do the real work, and then log the duration.



For this I need to know the States of Open MPi (Control messges)
So that I can simply put print statements in Open MPi code and find  
out

how it works.


I would [strongly] advise using a debugger.  Printf statements will  
only take you so far, and can be quite confusing in a parallel  
scenario -- especially when they can alter the timing of the system  
(i.e., Heisenburg kinds of effects).


For this reason I was asking to know the state changes or atleast  
the way

to find it out.


I'm still not clear on what state changes you're asking about.

From this e-mail and your prior e-mails, it *seems* like you're  
asking about how data gets from MPI_SEND in one process to MPI_RECV  
in another process.  Is that right?


If so, I would not characterize the code that does this as a state  
machine in the traditional sense.  Sure, as a computer program, it  
technically *is* a state machine that changes states according to  
assembly instructions, registers, etc., but we did not use generic  
state machine abstractions throughout the code base.  In many places,  
there's simply a linear sequence of events -- not a re-entrant state  
machine.


So if you're asking how a user message gets from MPI_SEND in one  
process to MPI_RECV in another, we can describe that (it's a very  
complicated answer that depends on many factors, actually -- it is  
*not* a straightforward answer, not only because OMPI deals with many  
device/network types, but also because there can be many variables  
decided at run time that determine how a message is sent from a  
process to a peer).


So before we go any further -- can you, as precisely as possible,  
describe exactly what information you're looking for?


Also my proff asked me to look into BTl transport layer to be used  
with

MPi Api.


I described that in a prior e-mail.

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-03 Thread pooja
Hi,

Actually I am working on the course project in which I am running a huge
computational intensive code.
I am running this code on cluster.
Now my work is to find out when does the process send control messages
(e.g. compute process to I/O  process indicating I/O data is ready)
and when does they send actual data (e.g I/O nodes fetching actual data
that is to be transfered.)
And I have to log the timing and duration in other file.

For this I need to know the States of Open MPi (Control messges)
So that I can simply put print statements in Open MPi code and find out
how it works.
For this reason I was asking to know the state changes or atleast the way
to find it out.
Also my proff asked me to look into BTl transport layer to be used with
MPi Api.


I hope you will help.


Thanks and Regards
Pooja


> On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote:
>
>> I can't help you with the BTL question. On the others:
>
> Yes, you can "sorta" call BTL's directly from application programs
> (are you trying to use MPI alongside other communication libraries,
> and using the BTL components as a sample?), but there are issues
> involved with this.
>
> First, you need to install Open MPI with all the development
> headers.  Open MPI normally only installs "mpi.h" and a small number
> of other heads; installing *all* the headers will allow you to write
> applications that use OMPI's internal headers (such as btl.h) while
> developing outside of the Open MPI source tree.
>
> Second, you probably won't want to access the BTL's directly.  To
> make this make sense, here's how the code is organized (even if the
> specific call sequence is not exactly this layered for performance/
> optimization reasons):
>
> MPI layer (e.g., MPI_SEND)
>   -> PML
> -> BML
>   -> BTL
>
> You have two choices:
>
> 1. Go through the PML instead (this is what we do in the MPI
> collectives, for example) -- but this imposes MPI semantics on
> sending and receiving, which assumedly you are trying to avoid.
> Check out ompi/mca/pml/pml.h.
>
> 2. Go through the BML instead -- the BTL Management Layer.  This is
> essentially a multiplexor for all the BTLs that have been
> instantiated.  I'm guessing that this is what you want to do
> (remember that OMPI has true multi-device support; using the BML and
> multiple BTLs is one of the ways that we do this).  Have a look at
> ompi/mca/bml/bml.h for the interface.
>
> There is also currently no mechanism to get the BML and BTL pointers
> that were instantiated by the PML.  However, if you're just doing
> proof-of-concept code, you can extract these directly from the MPI
> layer's global variables to see how this stuff works.
>
> To have full interoperability of the underlying BTLs and between
> multiple upper-layer communication libraries (e.g., between OMPI and
> something else) is something that we have talked about a little, but
> have not done much work on.
>
> To see the BTL interface (just for completeness), see ompi/mca/btl/
> btl.h.
>
> You can probably see the pattern here...  In all of Open MPI's
> frameworks, the public interface is in /mca//
> .h, where  is one of opal, orte, or ompi, and
>  is the name of the framework.
>
>> 1. states are reported via the orte/mca/smr framework. You will see
>> the
>> states listed in orte/mca/smr/smr_types.h. We track both process
>> and job
>> states. Hopefully, the state names will be somewhat self-
>> explanatory and
>> indicative of the order in which they are traversed. The job states
>> are set
>> when *all* of the processes in the job reach the corresponding state.
>
> Note that these are very coarse-grained process-level states (e.g.,
> is a given process running or not?).  It's not clear what kind of
> states you were asking about -- the Open MPI code base has many
> internal state machines for various message passing and other
> mechanisms.
>
> What information are you looking for, specifically?
>
>> 2. I'm not sure what you mean by mapping MPI processes to "physical"
>> processes, but I assume you mean how do we assign MPI ranks to
>> processes on
>> specific nodes. You will find that done in the orte/mca/rmaps
>> framework. We
>> currently only have one component in that framework - the round-robin
>> implementation - that maps either by slot or by node, as indicated
>> by the
>> user. That code is fairly heavily commented, so you hopefully can
>> understand
>> what it is doing.
>>
>> Hope that helps!
>> Ralph
>>
>>
>> On 4/1/07 1:32 PM, "po...@cc.gatech.edu"  wrote:
>>
>>> Hi
>>> I am Pooja and I am working on a course project which requires me
>>> -> to track the internal state changes of MPI and need me to
>>> figure out
>>> how does ORTE maps MPi Process to actual physical processes
>>> ->Also I need to find way to get BTL transports work directly with
>>> MPI
>>> level calls.
>>> I just want to know is this posible and if yes what procedure I
>>> should
>>> follow or I should look into which files (for change).

Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-02 Thread Jeff Squyres

On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote:


I can't help you with the BTL question. On the others:


Yes, you can "sorta" call BTL's directly from application programs  
(are you trying to use MPI alongside other communication libraries,  
and using the BTL components as a sample?), but there are issues  
involved with this.


First, you need to install Open MPI with all the development  
headers.  Open MPI normally only installs "mpi.h" and a small number  
of other heads; installing *all* the headers will allow you to write  
applications that use OMPI's internal headers (such as btl.h) while  
developing outside of the Open MPI source tree.


Second, you probably won't want to access the BTL's directly.  To  
make this make sense, here's how the code is organized (even if the  
specific call sequence is not exactly this layered for performance/ 
optimization reasons):


MPI layer (e.g., MPI_SEND)
 -> PML
   -> BML
 -> BTL

You have two choices:

1. Go through the PML instead (this is what we do in the MPI  
collectives, for example) -- but this imposes MPI semantics on  
sending and receiving, which assumedly you are trying to avoid.   
Check out ompi/mca/pml/pml.h.


2. Go through the BML instead -- the BTL Management Layer.  This is  
essentially a multiplexor for all the BTLs that have been  
instantiated.  I'm guessing that this is what you want to do  
(remember that OMPI has true multi-device support; using the BML and  
multiple BTLs is one of the ways that we do this).  Have a look at  
ompi/mca/bml/bml.h for the interface.


There is also currently no mechanism to get the BML and BTL pointers  
that were instantiated by the PML.  However, if you're just doing  
proof-of-concept code, you can extract these directly from the MPI  
layer's global variables to see how this stuff works.


To have full interoperability of the underlying BTLs and between  
multiple upper-layer communication libraries (e.g., between OMPI and  
something else) is something that we have talked about a little, but  
have not done much work on.


To see the BTL interface (just for completeness), see ompi/mca/btl/ 
btl.h.


You can probably see the pattern here...  In all of Open MPI's  
frameworks, the public interface is in /mca// 
.h, where  is one of opal, orte, or ompi, and  
 is the name of the framework.


1. states are reported via the orte/mca/smr framework. You will see  
the
states listed in orte/mca/smr/smr_types.h. We track both process  
and job
states. Hopefully, the state names will be somewhat self- 
explanatory and
indicative of the order in which they are traversed. The job states  
are set

when *all* of the processes in the job reach the corresponding state.


Note that these are very coarse-grained process-level states (e.g.,  
is a given process running or not?).  It's not clear what kind of  
states you were asking about -- the Open MPI code base has many  
internal state machines for various message passing and other  
mechanisms.


What information are you looking for, specifically?


2. I'm not sure what you mean by mapping MPI processes to "physical"
processes, but I assume you mean how do we assign MPI ranks to  
processes on
specific nodes. You will find that done in the orte/mca/rmaps  
framework. We

currently only have one component in that framework - the round-robin
implementation - that maps either by slot or by node, as indicated  
by the
user. That code is fairly heavily commented, so you hopefully can  
understand

what it is doing.

Hope that helps!
Ralph


On 4/1/07 1:32 PM, "po...@cc.gatech.edu"  wrote:


Hi
I am Pooja and I am working on a course project which requires me
-> to track the internal state changes of MPI and need me to  
figure out

how does ORTE maps MPi Process to actual physical processes
->Also I need to find way to get BTL transports work directly with  
MPI

level calls.
I just want to know is this posible and if yes what procedure I  
should

follow or I should look into which files (for change).


Please Help

Thanks and Regards
Pooja

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

2007-04-01 Thread Ralph Castain
Hi Pooja

What did you do to make your prof dislike you so much??? :-) These are, to
say the least, major tasks you have been describing. I've seen developers on
our team spend months trying to really understand even one or two of the
issues raised in your various emails, let alone make any kind of changes...

I can't help you with the BTL question. On the others:

1. states are reported via the orte/mca/smr framework. You will see the
states listed in orte/mca/smr/smr_types.h. We track both process and job
states. Hopefully, the state names will be somewhat self-explanatory and
indicative of the order in which they are traversed. The job states are set
when *all* of the processes in the job reach the corresponding state.

2. I'm not sure what you mean by mapping MPI processes to "physical"
processes, but I assume you mean how do we assign MPI ranks to processes on
specific nodes. You will find that done in the orte/mca/rmaps framework. We
currently only have one component in that framework - the round-robin
implementation - that maps either by slot or by node, as indicated by the
user. That code is fairly heavily commented, so you hopefully can understand
what it is doing.

Hope that helps!
Ralph


On 4/1/07 1:32 PM, "po...@cc.gatech.edu"  wrote:

> Hi
> I am Pooja and I am working on a course project which requires me
> -> to track the internal state changes of MPI and need me to figure out
> how does ORTE maps MPi Process to actual physical processes
> ->Also I need to find way to get BTL transports work directly with MPI
> level calls.
> I just want to know is this posible and if yes what procedure I should
> follow or I should look into which files (for change).
> 
> 
> Please Help
> 
> Thanks and Regards
> Pooja
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel