Re: [OMPI users] 1.4 OpenMPI build not working well with TotalView on Darwin

2010-01-22 Thread Ashley Pittman
On Wed, 2010-01-20 at 21:18 -0500, Peter Thompson wrote:
> Hi Jeff,
> 
> Sorry, speaking in shorthand again.
> 
> Jeff Squyres wrote:
> > On Jan 8, 2010, at 5:03 PM, Peter Thompson wrote:
> > 
> >> I've tried a few builds of 1.4 on Snow Leopard, and trying to start up 
> >> TotalView
> >> gets some of the more 'standard' problems.  
> > 
> > I don't quite know what you mean by "standard" problems...?
> 
> That's more or less 'standard problems' that I hear described when someone 
> tries 
> to build and MPI (not just OpenMPI) and things don't work on first try.  I 
> don't 
> know if you've worked on the interface directly, but you are probably aware 
> that 
> TotalView has an API where we set up a structure, MPIR_PROCTABLE, based on a 
> typedef MPIR_PROCDESC, which gets filled in as to what processes are started 
> up 
> on which nodes.  Which allows the debugger to attach to things automatically. 
> If the build is done so that the files that hold these structures are 
> optimized, 
> sometimes the typedef is optimized away.  Or in the case of other builds, the 
> file may have the correct optimization (none) but the symbol info is stripped 
> in 
> the link phase.  So it's a typical, or 'standard' issue I face, but hopefully 
> not for you.

I've seen several OpenMPI installs in the wild like this where the type
information for MPIR_PROCTABLE is missing.  The fact the type
information is missing however doesn't affect the code or contents of
memory at all, just that it's not described by debug information.  As
there is a standard (sort of) to describe MPIR_PROCTABLE what I choose
to do in padb is to use the standard to calculate the struct size and
offsets rather than the debug info.  This allows padb to work even when
the debug information is missing.

If the debug information is available that it matches what I expect it
to be.

Don't use the debug info but rather use fixed sizes and offsets:
http://code.google.com/p/padb/source/detail?r=355

Verify the type information if present:
http://code.google.com/p/padb/source/detail?r=386

> However, 
> some users prefer the classic launch with -tv, and this seems to be failing 
> with 
> the latest builds I've done on Darwin.

I've seen this 'problem' on Linux as well.  I'm unsure of the OpenMPI
version although I could ask the organisation concerned if required.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] 1.4 OpenMPI build not working well with TotalView on Darwin

2010-01-20 Thread Peter Thompson

Hi Jeff,

Sorry, speaking in shorthand again.

Jeff Squyres wrote:

On Jan 8, 2010, at 5:03 PM, Peter Thompson wrote:


I've tried a few builds of 1.4 on Snow Leopard, and trying to start up TotalView
gets some of the more 'standard' problems.  


I don't quite know what you mean by "standard" problems...?


That's more or less 'standard problems' that I hear described when someone tries 
to build and MPI (not just OpenMPI) and things don't work on first try.  I don't 
know if you've worked on the interface directly, but you are probably aware that 
TotalView has an API where we set up a structure, MPIR_PROCTABLE, based on a 
typedef MPIR_PROCDESC, which gets filled in as to what processes are started up 
on which nodes.  Which allows the debugger to attach to things automatically. 
If the build is done so that the files that hold these structures are optimized, 
sometimes the typedef is optimized away.  Or in the case of other builds, the 
file may have the correct optimization (none) but the symbol info is stripped in 
the link phase.  So it's a typical, or 'standard' issue I face, but hopefully 
not for you.



Either the typdef for MPIR_PROCDESC
can't be found, or MPIR_PROCTABLE is missing.  You can get things to work if you
start up TotalView first and then pick your program and go to the Parallel tab
and pick OpenMPI.  But it would be nice to get the classic launch working as 
well.


I'm unclear on how you could find these symbols if you start TV first, etc., 
but it won't work automatically.


One of the solutions we came up to work around this problem was to start up 
TotalView a different way, so that we need not rely on the symbol information at 
all.  If you start TotalView the 'classic' way, mpirun/mpiexec -tv -np 4 ./foo, 
it will look for MPIR_PROCTABLE and the others.  If you use the newer 'indirect' 
launch, we actually start up the debug servers with MPI, and then use some 
cached into to figure the correct process to start up with the debug servers and 
how many processes to start.  With this method, the symbol information is not 
needed.  This method works with OpenMPI on just about all platforms.  However, 
some users prefer the classic launch with -tv, and this seems to be failing with 
the latest builds I've done on Darwin.  The debug info appears to be preserved 
in the .o files, but does not always seem complete.  It probably needs another 
look on my part, to make sure I'm doing it right.  The fact that Snow Leopard 
(and maybe some earlier releases) now includes OpenMPI also confuses the issue, 
as the version that comes with Darwin does NOT contain the symbol info, and it's 
easy enough to get the native OpenMPI, and not pick up the build you intended.


Does that make any more sense?

I'll try playing around with 1.4.1 and see if it's me, or the compilers, or 
maybe OpenMPI.


PeterT



Do you have deeper knowledge (given your email address) on exactly what is 
going wrong?