[OMPI devel] Multi-environment builds

2007-07-09 Thread Ralph Castain
Yo all

I have been working on adding/clarifying support for several environments
and have encountered a problem that appears to be fairly common out there.
Namely, machines that have - over the course of history or for specific
reasons - installed libraries to support multiple environments. For example,
I can readily find machines that are running TM, but also have LSF and SLURM
libraries installed (although those environments are not "active" - the
libraries in some cases are old and stale, usually present because either
someone wanted to look at them or represent an old installation).

The problem is that our Open MPI build system automatically detects the
presence of those libraries, builds the corresponding components, and then
links those libraries into our system. Unfortunately, this causes two
side-effects:

1. we wind up building and loading a bunch of components that we cannot use
- which impacts memory footprint; and

2. not every component in every framework runs some library function to
determine if that environment is actually active. Hence, our selection logic
can sometimes get confused due to conflicting priorities, resulting in the
selection of components that cause the system to crash

A couple of solutions come immediately to mind:

1. The most obvious one (to me, at least) is to require that people provide
"--with-xx" when they build the system. Instead of automatically detecting
an include file and library, and then deciding that the existence of those
files dictates that we build support for that environment, we would only
build support for those environments that the builder specifies, and error
out of the build process if multiple conflicting environments are specified.
This raises the issue of what to do with rsh, but I think we can handle that
one by simply building it wherever possible.

2. We could laboriously go through all the components and ensure that they
check in their selection logic to see if that environment is active. This
still causes libraries to be loaded for nothing, but keeps the automatic
nature of the build system. We would have to deal with those environments
that may not have a "safe" function we can call to see if they are "alive",
or have old/stale libraries that may have differing behavior in their APIs,
but perhaps those are few enough to not be a big problem.

Any thoughts on this? It seems like we should solve this as it is becoming
more prevalent (at least on the machines I test on).

Ralph




Re: [OMPI devel] One-sided operations with Portals

2007-07-09 Thread Glendenning, Lisa
Hi Jeff,

Questions regarding HP's contract with SNL can be directed to Debra
Leitka, who is the Sandia Contract Representative (SCR). Debra's contact
info is:

Debra Leitka
Phone: 284-8818
Email: dlei...@sandia.gov

The work that I will be doing falls under this contract.  

Thanks,
Lisa Glendenning


-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On
Behalf Of Jeff Squyres
Sent: Monday, July 09, 2007 7:51 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] One-sided operations with Portals

It is probably worth clarifying to find out for sure (i.e., have the
appropriate legal representatives investigate to find out who owns the
IP).  It is an explicit goal of the Open MPI project to have a traceable
code pedigree that is properly licensed.

Thanks.


On Jul 9, 2007, at 9:42 AM, Glendenning, Lisa wrote:

> This work would be done under a contract with Sandia National 
> Laboratories.  I believe that makes it SNL's IP.
>
>
> -Original Message-
> From: devel-boun...@open-mpi.org [mailto:devel-bounces@open- mpi.org] 
> On Behalf Of Jeff Squyres
> Sent: Friday, July 06, 2007 12:03 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] One-sided operations with Portals
>
> On Jul 5, 2007, at 11:16 PM, Glendenning, Lisa wrote:
>
>> Ron Brightwell at SNL has asked me to look into optimizing Open MPI's

>> one-sided operations over Portals.  Does anyone have any guidance or 
>> thoughts for this?
>
> Does this mean that HP is considering joining the Open MPI project?
> In order to contribute code, a signed copy of the Open MPI 3rd Party 
> Contribution agreement must be submitted (see http://www.open-mpi.org/

> community/contribute/).
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] fake rdma flag again?

2007-07-09 Thread Brian Barrett

Hi all -

I've finally committed a version of the rdma one-sided component that  
1) works and 2) in certain situations actually does rdma.  I'll make  
it the default when the BTLs are used as soon as one last bug is  
fixed in the DDT engine.


However, there is still one outstanding issue.  Some BTLs (like  
Portals or MX) advertise the ability to do a put but place  
restrictions on the put that only work for OB1.  For example, both  
can only do an RDMA that starts where the prepare_dst() / prepare_src 
() call said the target buffer was.  This isn't a problem for OB1,  
but kind of defeats the purpose of one-sided ;). There's also a  
reference count (I believe) in the Portals put/get code that would  
make life interesting if a descriptor was doing multiple RDMA ops at  
once.


I was thinking that the easy way to solve this was to add a flag  
(FAKE_RDMA was the current running favorite, since we've used it  
before for different meaning :) ) to the components that have  
behaviors that work for OB1, but not a generalized rdma interface.  I  
was wondering what people thought of this idea and if they had any  
preference for naming the flag.


Brian


Re: [OMPI devel] "New" IB vendor and MTU question

2007-07-09 Thread Jeff Squyres

On Jul 9, 2007, at 3:17 PM, Peter Kjellstrom wrote:

Our new HP cluster has 25208 HCAs (Mellanox Arbel) but a new vendor- 
id... We
have 0x1708 (presumably HP, hardware wise Cisco (Mellanox)) to add  
to the


Added in r15316; thanks for pointing it out.

existing list in share/openmpi/mca-btl-openib-hca-params.ini that  
currently

contains:
 # Mellanox  0x2c9
 # Cisco 0x5ad
 # Silverstorm   0x66a
 # Voltaire  0x8f1

Somewhat related question 1: Is there a blessed way to map these  
ids back to

strings?


Not via C API, no.  But the IEEE OUI web page can be used to look up  
these values:


http://standards.ieee.org/regauth/oui/

question 2: Is 1024 really the best MTU for DDR Arbel? I seem to  
remember this

being 2048...


I *believe* that that value came from Mellanox, but I don't remember  
offhand.  But it could also be a "doesn't really matter either way"  
issue.  You might want to try both with your apps and see if there's  
a performance difference.  Let us know what happens.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Ob1 segfault

2007-07-09 Thread Gleb Natapov
On Mon, Jul 09, 2007 at 10:41:52AM -0400, Tim Prins wrote:
> Gleb Natapov wrote:
> > On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote:
> >> On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:
> >>> On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
>  While looking into another problem I ran into an issue which made ob1
>  segfault on me. Using gm, and running the test test_dan1 in the onesided
>  test suite, if I limit the gm freelist by too much, I get a segfault.
>  That is,
> 
>  mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1
> 
>  works fine, but
> 
>  mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1
> >>> I cannot, unfortunately, reproduce this with openib BTL.
> >>>
>  segfaults. Here is the relevant output from gdb:
> 
>  Program received signal SIGSEGV, Segmentation fault.
>  [Switching to Thread 1077541088 (LWP 15600)]
>  0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580,
>  hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
>  267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,
>  sizeof(mca_pml_ob1_fin_hdr_t));
> >>> can you send me what's inside bml_btl?
> >> It turns out that the order of arguments to mca_pml_ob1_send_fin was 
> >> wrong. I 
> >> fixed this in r15304. But now we hang instead of segfault, and have both 
> >> processes just looping through opal_progress. I really don't what to look 
> >> for. Any hints?
> >>
> > Can you look in gdb at mca_pml_ob1.rdma_pending?
> Yeah, rank 0 has nothing on the list, and rank 1 has 48 things.
Do you run both ranks on the same node? Can you try to run them on
different node?

--
Gleb.



Re: [OMPI devel] Ob1 segfault

2007-07-09 Thread Tim Prins

Gleb Natapov wrote:

On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote:

On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:

On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:

While looking into another problem I ran into an issue which made ob1
segfault on me. Using gm, and running the test test_dan1 in the onesided
test suite, if I limit the gm freelist by too much, I get a segfault.
That is,

mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1

works fine, but

mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1

I cannot, unfortunately, reproduce this with openib BTL.


segfaults. Here is the relevant output from gdb:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1077541088 (LWP 15600)]
0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580,
hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,
sizeof(mca_pml_ob1_fin_hdr_t));

can you send me what's inside bml_btl?
It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I 
fixed this in r15304. But now we hang instead of segfault, and have both 
processes just looping through opal_progress. I really don't what to look 
for. Any hints?



Can you look in gdb at mca_pml_ob1.rdma_pending?

Yeah, rank 0 has nothing on the list, and rank 1 has 48 things.

Here is the first item on the list:
$7 = {
  super = {
super = {
  super = {
obj_magic_id = 16046253926196952813,
obj_class = 0x404f5980,
obj_reference_count = 1,
cls_init_file_name = 0x404f30f9 "pml_ob1_sendreq.c",
cls_init_lineno = 1134
  },
  opal_list_next = 0x8f5d680,
  opal_list_prev = 0x404f57c8,
  opal_list_item_refcount = 1,
  opal_list_item_belong_to = 0x404f57b0
},
registration = 0x0,
ptr = 0x0
  },
  rdma_bml = 0x8729098,
  rdma_hdr = {
hdr_common = {
  hdr_type = 8 '\b',
  hdr_flags = 4 '\004'
},
hdr_match = {
  hdr_common = {
hdr_type = 8 '\b',
hdr_flags = 4 '\004'
  },
  hdr_ctx = 5,
  hdr_src = 1,
  hdr_tag = 142418176,
  hdr_seq = 0,
  hdr_padding = "\000"
},
hdr_rndv = {
  hdr_match = {
hdr_common = {
  hdr_type = 8 '\b',
  hdr_flags = 4 '\004'
},
hdr_ctx = 5,
hdr_src = 1,
hdr_tag = 142418176,
hdr_seq = 0,
hdr_padding = "\000"
  },
  hdr_msg_length = 236982400,
  hdr_src_req = {
lval = 0,
ival = 0,
pval = 0x0,
sval = {
  uval = 0,
  lval = 0
}
  }
},
hdr_rget = {
  hdr_rndv = {
hdr_match = {
  hdr_common = {
hdr_type = 8 '\b',
hdr_flags = 4 '\004'
  },
  hdr_ctx = 5,
  hdr_src = 1,
  hdr_tag = 142418176,
  hdr_seq = 0,
  hdr_padding = "\000"
},
hdr_msg_length = 236982400,
hdr_src_req = {
  lval = 0,
  ival = 0,
  pval = 0x0,
  sval = {
uval = 0,
lval = 0
  }
}
  },
  hdr_seg_cnt = 1106481152,
  hdr_padding = "\000\000\000",
  hdr_des = {
lval = 32768,
ival = 32768,
pval = 0x8000,
sval = {
  uval = 32768,
  lval = 0
}
  },
  hdr_segs = {{
  seg_addr = {
lval = 0,
ival = 0,
pval = 0x0,
sval = {
  uval = 0,
  lval = 0
}
  },
  seg_len = 0,
  seg_padding = "\000\000\000",
  seg_key = {
key32 = {0, 0},
key64 = 0,
key8 = "\000\000\000\000\000\000\000"
  }
}}
},
hdr_frag = {
  hdr_common = {
hdr_type = 8 '\b',
hdr_flags = 4 '\004'
  },
  hdr_padding = "\005\000\001\000\000",
  hdr_frag_offset = 142418176,
  hdr_src_req = {
lval = 236982400,
ival = 236982400,
pval = 0xe201080,
sval = {
  uval = 236982400,
  lval = 0
}
  },
  hdr_dst_req = {
lval = 0,
ival = 0,
pval = 0x0,
sval = {
  uval = 0,
  lval = 0
}
  }
},
hdr_ack = {
  hdr_common = {
hdr_type = 8 '\b',
hdr_flags = 4 '\004'
  },
  hdr_padding = "\005\000\001\000\000",
  hdr_src_req = {
lval = 142418176,
ival = 142418176,
pval = 0x87d2100,
sval = {
  uval = 142418176,
  lval = 0
}
  },
  hdr_dst_req = {
lval = 236982400,
ival = 236982400,
pval = 0xe201080,
sval = {
  uval = 236982400,
  lval = 0
}
  },
  hdr_send_offset = 0
},
hdr_rdma = {
  hdr_common = {
hdr_type = 8 

Re: [OMPI devel] opal_output_verbose usage guidelines

2007-07-09 Thread Don Kerr
Yes I use opal_show_help in other places but that is an all or nothing 
proposition. I think the ability to be verbose or quiet can be very 
usefull to end users and that is what I need at the moment. 


-DON

Jeff Squyres wrote:


On Jul 9, 2007, at 9:58 AM, Don Kerr wrote:

 


You want a warning to show when:

1. the udapl btl is used
2. --enable-debug was not configured
3. the user specifies btl_*_verbose (or btl_*_debug) >= some_value

Is that right?  If so, is the intent to warn that somen checks are
not being performed that one would otherwise assume are being
performed (because of #3)?

 


#1 and #2 is just to convey the environment I expect the user to be
running in, not the error case. Interpretation of #3 is a little  
askew.

uDAPL gets its HCA information from  /etc/dat.conf. This file has an
entry for each HCA, even those that are potentially not "UP". Also it
appears the OFED stack includes by default an entry for "OpenIB-bond"
which I have not figured out what it is yet.  In anycase uDAPL has
trouble distinguishing if an HCA is down intentionally or if is down
because something is wrong. So the uDAPL BTL attempts to open all  
of the

entries in this file.
   



You might want to ping the OFA general mailing list or the DAT  
mailing lists with these kinds of questions...?


 


And the issues becomes how much information to
toss back to the user. If a node has two IB interfaces but only one is
up, do they want see a warning message about one of the interfaces  
being
down when they already know this by looking at "ifconfig"?  I think  
not.

But this could be valueable information if there is a real problem.
   



True.  FWIW, in the openib btl, we only use HCA ports that are active  
(i.e., have a link signal and have been recognized/allowed on the  
network by the SM); we silently ignore those that are not active.  We  
do not currently have a diagnostic that shows which ports are ignored  
because they are not active, IIRC.


 

Since its just one message at this point I think I will go with the  
base
output_id and if I need more I will look to create a component  
specific

id.  Thanks Jeff.
   



FWIW, we always treat the opal_output_verbose output as optional  
output.  If there's something that you definitely want to toss back  
to the user, use opal_show_help.


 


I expect to pursue this in order to find a better way to distinguish
between an interface that is up or down but I don't have a solution at
the moment.

-DON


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
   




 



Re: [OMPI devel] Ob1 segfault

2007-07-09 Thread Gleb Natapov
On Sun, Jul 08, 2007 at 12:41:58PM -0400, Tim Prins wrote:
> On Sunday 08 July 2007 08:32:27 am Gleb Natapov wrote:
> > On Fri, Jul 06, 2007 at 06:36:13PM -0400, Tim Prins wrote:
> > > While looking into another problem I ran into an issue which made ob1
> > > segfault on me. Using gm, and running the test test_dan1 in the onesided
> > > test suite, if I limit the gm freelist by too much, I get a segfault.
> > > That is,
> > >
> > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 1024 test_dan1
> > >
> > > works fine, but
> > >
> > > mpirun -np 2 -mca btl gm,self -mca btl_gm_free_list_max 512 test_dan1
> >
> > I cannot, unfortunately, reproduce this with openib BTL.
> >
> > > segfaults. Here is the relevant output from gdb:
> > >
> > > Program received signal SIGSEGV, Segmentation fault.
> > > [Switching to Thread 1077541088 (LWP 15600)]
> > > 0x404d81c1 in mca_pml_ob1_send_fin (proc=0x9bd9490, bml_btl=0xd323580,
> > > hdr_des=0x9e54e78, order=255 '�', status=1) at pml_ob1.c:267
> > > 267 MCA_PML_OB1_DES_ALLOC(bml_btl, fin, order,
> > > sizeof(mca_pml_ob1_fin_hdr_t));
> >
> > can you send me what's inside bml_btl?
> 
> It turns out that the order of arguments to mca_pml_ob1_send_fin was wrong. I 
> fixed this in r15304. But now we hang instead of segfault, and have both 
> processes just looping through opal_progress. I really don't what to look 
> for. Any hints?
> 
Can you look in gdb at mca_pml_ob1.rdma_pending?

--
Gleb.



Re: [OMPI devel] One-sided operations with Portals

2007-07-09 Thread Glendenning, Lisa
This work would be done under a contract with Sandia National
Laboratories.  I believe that makes it SNL's IP.


-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On
Behalf Of Jeff Squyres
Sent: Friday, July 06, 2007 12:03 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] One-sided operations with Portals

On Jul 5, 2007, at 11:16 PM, Glendenning, Lisa wrote:

> Ron Brightwell at SNL has asked me to look into optimizing Open MPI's 
> one-sided operations over Portals.  Does anyone have any guidance or 
> thoughts for this?

Does this mean that HP is considering joining the Open MPI project?   
In order to contribute code, a signed copy of the Open MPI 3rd Party
Contribution agreement must be submitted (see http://www.open-mpi.org/
community/contribute/).

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] opal_output_verbose usage guidelines

2007-07-09 Thread Jeff Squyres

On Jul 6, 2007, at 5:20 PM, Don Kerr wrote:


Are there any guidelines about the use of opal_output_verbose?


Not so much.


- Are there hidden meanings for a given verbose level? e.g. 0
reserved for PML, or 50-100 for BTL and so on


Nope.  The output was designed to use the values with >= kinds of  
checking; i.e., the higher the verbose value the user gives, the more  
output they see.  I.e., the values are not used in a "bit flag" sense  
(i.e., each bit enables/disables a specific set of output).



- Maybe the base component output_id is ok to use in situation
XYZ but a component specific output_id should be used in situation  
ABC?

Or should never be used for component specific output?


I've typically used the base component output_id whenever possible.   
I usually started off having an output ID for a specific component,  
but usually that was for debugging (and therefore having oodles and  
oodles of output).  By the time I was done, I usually had only a few  
output statements and therefore used the base ID.


I guess my suggestion would be: if you're going to have a LOT of  
output, then make it a component-specific ID.  If it's a "reasonable"  
amount, then just use the base ID.  Definitions of those terms are  
subjective and intentionally fuzzy.  :-)


Why I ask.  I want to report a warning to the user when "--enable- 
debug"
is not configured. I also do not want the error to show up all the  
time,

only when for example --mca btl_base_debug is set to some value. I am
thinking I will just use opal_output_verbose but wanted to see if  
there
were any guidelines about its use? Or if I should be thinking about  
some

other option all together.


You want a warning to show when:

1. the udapl btl is used
2. --enable-debug was not configured
3. the user specifies btl_*_verbose (or btl_*_debug) >= some_value

Is that right?  If so, is the intent to warn that somen checks are  
not being performed that one would otherwise assume are being  
performed (because of #3)?


--
Jeff Squyres
Cisco Systems