Re: [OMPI users] Pointers for understanding failure messages on NetBSD

2009-12-03 Thread Kevin . Buckley
>> I have actually already taken the IPv6 block and simply tried to
>> replace any IPv6 stuff with IPv4 "equivalents", eg:
>
> At the risk of showing a lot of ignorance, here's the block I coddled
> together based on the IPv6 block.
>
> I have tried to keep it looking as close to the original IPv6
> block as possible.

OK, I now have something that seems to work without generating
any error messages.

I'll post it here for reference and try and make a PkgSrc patch
once I get access to the WIP tree for the NetBSD port of OpenMPI,
which will make more sense to Aleksej.

The main difference between this and the IPv6 block is the
extra:

((struct sockaddr_in*) _addr)->sin_len = 
cur_ifaddrs->ifa_addr->sa_len;

line just below the

/* fill values into the opal_if_t */

stanza.

The errors I posted as seeing relating to

opal_sockaddr2str failed:Temporary failure in name resolution (return code 4)

were arising because the sin_len was appearing as 0 after the
interface had been placed into the opal_list.

Given that the getifaddrs code can handle both IPv4 and IPv6, there
may not be a need to have two loops, one for each protocol but I
am not going to make such a major change at present, indeed, my
code probably needs tarting up.

But anyroad, here the block as it stands

#if defined(__NetBSD__)
/* || defined(__OpenBSD__) || defined(__FreeBSD__) ||   \
 defined(__386BSD__) || defined(__bsdi__) ||
defined(__APPLE__) */
/*   || defined(__linux__)  */

{
OBJ_CONSTRUCT(_if_list, opal_list_t);

struct ifaddrs **ifadd_list;
struct ifaddrs *cur_ifaddrs;
struct sockaddr_in* sin_addr;

/*
 * the manpage claims that getifaddrs() allocates the memory,
 * and freeifaddrs() is later used to release the allocated memory.
 * however, without this malloc the call to getifaddrs() segfaults
 */
ifadd_list = (struct ifaddrs **) malloc(sizeof(struct ifaddrs*));

/* create the linked list of ifaddrs structs */
if(getifaddrs(ifadd_list) < 0) {
opal_output(0, "opal_ifinit: getifaddrs() failed with
error=%d\n",
errno);
return OPAL_ERROR;
}

for(cur_ifaddrs = *ifadd_list; NULL != cur_ifaddrs;
cur_ifaddrs = cur_ifaddrs->ifa_next) {

opal_if_t intf;
opal_if_t *intf_ptr;
struct in_addr a4;

#if 0
printf("interface %s.\n", cur_ifaddrs->ifa_name);
#endif
/* skip non- af_inet interface addresses */
if(AF_INET != cur_ifaddrs->ifa_addr->sa_family) {
#if 0
  printf("skipping non- af_inet interface %s, family %d.\n",
 cur_ifaddrs->ifa_name, cur_ifaddrs->ifa_addr->sa_family);
#endif
continue;
}

/* skip interface if it is down (IFF_UP not set) */
if(0 == (cur_ifaddrs->ifa_flags & IFF_UP)) {
#if 0
printf("skipping non-up interface %s.\n",
cur_ifaddrs->ifa_name);
#endif
continue;
}

/* skip interface if it is a loopback device (IFF_LOOPBACK
set) */
/* or if it is a point-to-point interface */
/* TODO: do we really skip p2p? */
if(0 != (cur_ifaddrs->ifa_flags & IFF_LOOPBACK)
|| 0!= (cur_ifaddrs->ifa_flags & IFF_POINTOPOINT)) {
#if 0
printf("skipping loopback interface %s.\n",
cur_ifaddrs->ifa_name);
#endif
continue;
}

#if 0
printf("sa_len %d.\n", cur_ifaddrs->ifa_addr->sa_len);
#endif
sin_addr = (struct sockaddr_in *) cur_ifaddrs->ifa_addr;

/* There shouldn't be any IPv6 address starting with fe80: to
skip */

memset(, 0, sizeof(intf));
OBJ_CONSTRUCT(, opal_list_item_t);
#if 0
char *addr_name = (char *) malloc(48*sizeof(char));
inet_ntop(AF_INET, _addr->sin_addr, addr_name,
48*sizeof(char));
opal_output(0, "inet capable interface %s discovered, address
%s.\n",
cur_ifaddrs->ifa_name, addr_name);
free(addr_name);
#endif

/* fill values into the opal_if_t */
memcpy(, &(sin_addr->sin_addr), sizeof(struct in_addr));

strncpy(intf.if_name, cur_ifaddrs->ifa_name, IF_NAMESIZE);
intf.if_index = opal_list_get_size(_if_list) + 1;
((struct sockaddr_in*) _addr)->sin_addr = a4;
((struct sockaddr_in*) _addr)->sin_family = AF_INET;
((struct sockaddr_in*) _addr)->sin_len = 
cur_ifaddrs->ifa_addr->sa_len;

/* since every scope != 0 is ignored, we just set the scope to
0 */
/* There's no scope_id in the non-ipv6 stuff
((struct sockaddr_in6*) _addr)->sin6_scope_id = 0;
*/

/*
 * hardcoded netmask, adrian says that's ok
 */
/* 

Re: [OMPI users] Dynamic Symbol Relocation in Plugin Shared Library

2009-12-03 Thread Jeff Squyres
What version of Open MPI are you using?  We just made a 
minor-but-potentially-important change to how we handle our dlopen code in 
1.3.4.

Additionally, you might try configuring Open MPI with the --disable-dlopen 
configure switch.  This switch does two things:

1. Slurps all of Open MPI's plugins up into normal libraries (e.g., libmpi.so 
or libmpi.a)
2. Disables / compiles out all of Open MPI's dlopen (and related) code

If 1.3.4 doesn't fix your problem, then --disable-dlopen should.


On Dec 3, 2009, at 2:56 PM, Cupp, Matthew R wrote:

> Hi,
>  
> I’m having an issue with the MPI version of application and the dynamic 
> relocation of symbols from plugin shared libraries. 
>  
> There are duplicate symbols in both the main executable (Engine) and a shared 
> library that opened at runtime using dlopen (Plugin).  The plugin is opened 
> with the command dlopen(pFilepath, RTLD_LAZY | RTLD_LOCAL).  When I run the 
> entry point function that I get using dlsym, there is a segmentation 
> violation that occurs during the execution of that function.  The mpirun 
> outputs the backtrace of the segfault, and in it I can see that execution 
> goes from the engine to the plugin and back to the engine.  The plugin is 
> statically linked to a class library that is also statically linked to the 
> engine (but a different version) and contains a couple of files found in the 
> engine (again a different version).  The plugin should be completely self 
> sufficient, meaning it has everything it needs to function independently of 
> the engine, and should never need to have symbols dynamically linked to the 
> engine. 
>  
> When I run the single (non-MPI) version of the application, it runs fine 
> (apparently without plugin symbol relocation).  When I run the MPI version, I 
> get the segfault.  The code that handles plugins is the same in both 
> versions, and doesn’t rely on any MPI functionality.
>  
> Is there some way to change how the MPI runtime uses the executable so it 
> doesn’t export the symbols?  Or any way to prevent the dynamical symbol 
> relocation when loading the shared library?  Or linker flags that I could use 
> with plugin shared library so it does list its internal symbols?
>  
> I have a Stack Overflow question on this here:
> http://stackoverflow.com/questions/1821153/segfault-on-c-plugin-library-with-duplicate-symbols
>  
> Thanks!
> Matt
>  
> __
> Matt Cupp
> Battelle Memorial Institute
> Statistics and Information Analysis
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com




[OMPI users] Dynamic Symbol Relocation in Plugin Shared Library

2009-12-03 Thread Cupp, Matthew R
Hi,

I'm having an issue with the MPI version of application and the dynamic 
relocation of symbols from plugin shared libraries.

There are duplicate symbols in both the main executable (Engine) and a shared 
library that opened at runtime using dlopen (Plugin).  The plugin is opened 
with the command dlopen(pFilepath, RTLD_LAZY | RTLD_LOCAL).  When I run the 
entry point function that I get using dlsym, there is a segmentation violation 
that occurs during the execution of that function.  The mpirun outputs the 
backtrace of the segfault, and in it I can see that execution goes from the 
engine to the plugin and back to the engine.  The plugin is statically linked 
to a class library that is also statically linked to the engine (but a 
different version) and contains a couple of files found in the engine (again a 
different version).  The plugin should be completely self sufficient, meaning 
it has everything it needs to function independently of the engine, and should 
never need to have symbols dynamically linked to the engine.

When I run the single (non-MPI) version of the application, it runs fine 
(apparently without plugin symbol relocation).  When I run the MPI version, I 
get the segfault.  The code that handles plugins is the same in both versions, 
and doesn't rely on any MPI functionality.

Is there some way to change how the MPI runtime uses the executable so it 
doesn't export the symbols?  Or any way to prevent the dynamical symbol 
relocation when loading the shared library?  Or linker flags that I could use 
with plugin shared library so it does list its internal symbols?

I have a Stack Overflow question on this here:
http://stackoverflow.com/questions/1821153/segfault-on-c-plugin-library-with-duplicate-symbols

Thanks!
Matt

__
Matt Cupp
Battelle Memorial Institute
Statistics and Information Analysis


Re: [OMPI users] Program deadlocks, on simple send/recv loop

2009-12-03 Thread Eugene Loh




Jeff Squyres wrote:

  On Dec 3, 2009, at 10:56 AM, Brock Palen wrote:
  
  
The allocation statement is ok:
allocate(vec(vec_size,vec_per_proc*(size-1)))

This allocates memory vec(32768, 2350)

  
  So this allocates 32768 rows, each with 2350 columns -- all stored contiguously in memory, in column-major order.  Does the language/compiler *guarantee* that the entire matrix is contiguous in memory?  Or does it only guarantee that the *columns* are contiguous in memory -- and there may be gaps between successive columns?
  

I think you're getting one big contiguous block of memory and the
portions that are passed are contiguous, nonoverlapping pieces.

  This means that in the first iteration, you're calling:
call MPI_RECV(vec(1, 2301), 32768, ...)

And in the last iteration, you're calling:
call MPI_RECV(vec(1, 2350), 32768, ...)

That doesn't seem right.  If I'm reading this right -- and I very well may not be -- it looks like successive receives will be partially overlaying the previous receive.

No.  In Fortran, leftmost index varies the fastest.  E.g.,

% cat y.f90
  integer a(2,2)
  a(1,1) = 11
  a(2,1) = 21
  a(1,2) = 12
  a(2,2) = 22
  call sub(a)
end

subroutine sub(a)
  integer a(4)
  write(6,*) a
end
% a.out
 11 21 12 22
% 

Here is how I think of Brock's code:

program sendbuf


  include 'mpif.h'


  integer, parameter :: n = 32 * 1024, m = 50


  complex*16 buf(n)


  call MPI_INIT(ierr)

  call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)

  call MPI_COMM_RANK(MPI_COMM_WORLD, me, ierr)


  buf = 0


  if ( me == 0 ) then

 do i = 1, np-1

    do j = 1, m

   call MPI_RECV(buf, n, MPI_DOUBLE_COMPLEX, i, j,
MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)

    end do

 end do

  else

 do j = 1, m

    call MPI_SEND( x, n, MPI_DOUBLE_COMPLEX, 0, j, MPI_COMM_WORLD,
ierr)

 end do

  end if


  call MPI_FINALIZE(ierr)

end


This version reuses send and receive buffers, but that's fine since
they're all blocking calls anyhow.




Re: [OMPI users] Program deadlocks, on simple send/recv loop

2009-12-03 Thread Jed Brown
On Thu, 3 Dec 2009 12:21:50 -0500, Jeff Squyres  wrote:
> On Dec 3, 2009, at 10:56 AM, Brock Palen wrote:
> 
> > The allocation statement is ok:
> > allocate(vec(vec_size,vec_per_proc*(size-1)))
> > 
> > This allocates memory vec(32768, 2350)

It's easier to translate to C rather than trying to read Fortran
directly.

  #define M 2350
  #define N 32768
  complex double vec[M*N];

> This means that in the first iteration, you're calling:
> 
> irank = 1
> ivec = 1
> vec_ind = (47 - 1) * 50 + 1 = 
> call MPI_RECV(vec(1, 2301), 32768, ...)

  MPI_Recv([2300*N],N,...);

> And in the last iteration, you're calling:
> 
> irank = 47
> ivec = 50
> vec_ind = (47 - 1) * 50 + 50 = 
> call MPI_RECV(vec(1, 2350), 32768, ...)

  MPI_Recv([2349*N],N,...);

> That doesn't seem right.

Should be one non-overlapping column (C row) at a time.  It will be
contiguous in memory, but this isn't using that property.

Jed


Re: [OMPI users] Program deadlocks, on simple send/recv loop

2009-12-03 Thread Jeff Squyres
On Dec 3, 2009, at 10:56 AM, Brock Palen wrote:

> The allocation statement is ok:
> allocate(vec(vec_size,vec_per_proc*(size-1)))
> 
> This allocates memory vec(32768, 2350)

So this allocates 32768 rows, each with 2350 columns -- all stored contiguously 
in memory, in column-major order.  Does the language/compiler *guarantee* that 
the entire matrix is contiguous in memory?  Or does it only guarantee that the 
*columns* are contiguous in memory -- and there may be gaps between successive 
columns?

2350 means you're running with 48 procs.

In the loop:

 do irank=1,size-1
do ivec=1,vec_per_proc
   write (6,*) 'irank=',irank,'ivec=',ivec
   vec_ind=(irank-1)*vec_per_proc+ivec
   call MPI_RECV( vec(1,vec_ind), vec_size, MPI_DOUBLE_COMPLEX, irank, &
vec_ind, MPI_COMM_WORLD, status, ierror)

This means that in the first iteration, you're calling:

irank = 1
ivec = 1
vec_ind = (47 - 1) * 50 + 1 = 
call MPI_RECV(vec(1, 2301), 32768, ...)

And in the last iteration, you're calling:

irank = 47
ivec = 50
vec_ind = (47 - 1) * 50 + 50 = 
call MPI_RECV(vec(1, 2350), 32768, ...)

That doesn't seem right.  If I'm reading this right -- and I very well may not 
be -- it looks like successive receives will be partially overlaying the 
previous receive.  Is that what you intended?  Is MPI supposed to overflow the 
columns properly?  I can see how a problem might occur here if the columns are 
not actually contiguous in memory...?

-- 
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] Program deadlocks, on simple send/recv loop

2009-12-03 Thread Eugene Loh

Ashley Pittman wrote:


On Wed, 2009-12-02 at 13:11 -0500, Brock Palen wrote:
 


On Dec 1, 2009, at 11:15 AM, Ashley Pittman wrote:
   


On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:
 


The attached code, is an example where openmpi/1.3.2 will lock up, if
ran on 48 cores, of IB (4 cores per node),
The code loops over recv from all processors on rank 0 and sends from
all other ranks, as far as I know this should work, and I can't see
why not.
Note yes I know we can do the same thing with a gather, this is a
simple case to demonstrate the issue.
Note that if I increase the openib eager limit, the program runs,
which normally means improper MPI, but I can't on my own figure out
the problem with this code.
   


What are you increasing the eager limit from and too?
 


The same value as ethernet on our system,
mpirun --mca btl_openib_eager_limit 655360 --mca  
btl_openib_max_send_size 655360 ./a.out


Huge values compared to the defaults, but works,
   


My understanding of the code is that each message will be 256k long

Yes.  Brock's Fortran code has each nonzero rank send 50 messages, each 
256K, via standard send to rank 0.  Rank 0 uses standard receives on 
them all, pulling in all 50 messages in order from rank 1, then from 
rank 2, etc.

http://www.open-mpi.org/community/lists/users/2009/12/11311.php

John Cary sent out a C++ code on this same e-mail thread.  It sends 
256*8=2048-byte messages.  Each nonzero rank sends 1 message and rank 0 
pulls these in in rank order.  Then there is a barrier.  The program 
iterates on this pattern.

http://www.open-mpi.org/community/lists/users/2009/12/11348.php

I can imagine the two programs are illustrating two different problems.


Re: [OMPI users] Program deadlocks, on simple send/recv loop

2009-12-03 Thread vasilis gkanis
I had a similar problem with the portland Fortran compiler. I new that this 
was not caused by a network problem ( I run the code on a single node with 4 
CPUs). After I tested pretty much anything, I decided to change the compiler.
I used the Intel Fortran compiler and everything is running fine. 
It could be a PGI compiler voodoo :)

Vasilis



On Thursday 03 December 2009 05:56:39 pm Brock Palen wrote:
> On Dec 1, 2009, at 8:09 PM, John R. Cary wrote:
> > Jeff Squyres wrote:
> >> (for the web archives)
> >>
> >> Brock and I talked about this .f90 code a bit off list -- he's
> >> going to investigate with the test author a bit more because both
> >> of us are a bit confused by the F90 array syntax used.
> 
> Jeff, I talked to the user this morning, that data is contiguous in
> memory, sans any PGI compiler voodoo,
> The allocation statement is ok:
> allocate(vec(vec_size,vec_per_proc*(size-1)))
> 
> This allocates memory vec(32768, 2350)
> 
> Note that Fortran is column major in memory, that explains (I knew
> this sorry forgot) why the indexes are switched,
> 
> Brock Palen
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


Re: [OMPI users] Program deadlocks, on simple send/recv loop

2009-12-03 Thread Brock Palen

On Dec 1, 2009, at 8:09 PM, John R. Cary wrote:


Jeff Squyres wrote:

(for the web archives)

Brock and I talked about this .f90 code a bit off list -- he's  
going to investigate with the test author a bit more because both  
of us are a bit confused by the F90 array syntax used.


Jeff, I talked to the user this morning, that data is contiguous in  
memory, sans any PGI compiler voodoo,

The allocation statement is ok:
allocate(vec(vec_size,vec_per_proc*(size-1)))

This allocates memory vec(32768, 2350)

Note that Fortran is column major in memory, that explains (I knew  
this sorry forgot) why the indexes are switched,


Brock Palen


Re: [OMPI users] Program deadlocks, on simple send/recv loop

2009-12-03 Thread Richard Treumann

MPI standard compliant management of eager send requires that this program
work. There is nothing that says "unless eager limit is set too high/low."
Honoring this requirement in an MPI implementation can be costly. There are
practical reasons to pass up this requirement because most applications do
not need it.

I would like to see the MPI Forum find a way to relax this requirement and
I have made a proposal that would do so that would not invalidate any
current MPI program.

 I would consider simply removing this requirement if the MPI Forum decides
that it is OK for some valid MPI 2.2 programs to be invalid MPI 3.0
programs but I hope the Forum does not go the direction of breaking
existing valid MPI programs.

Ashley says below:  "If the MPI_SEND isn't blocking then each rank will
send 50 messages to rank zero and you'll have 2000 messages "

What the standard says is MPI_SEND must block before there are more
messages at the destination than it can manage.

I do not think ignoring that the standard  requires this program to work is
a very good solution.

   Dick


Here is what the standard says:

Section 3.4 MPI 2.2 page 39:1..7

The send call described in Section 3.2.1 uses the standard communication
mode. In this mode, it is up to MPI to decide whether outgoing messages
will be buffered. MPI may buffer outgoing messages. In such a case, the
send call may complete before a matching receive is invoked. On the other
hand, buffer space may be unavailable, or MPI may choose not to buffer
outgoing messages, for performance reasons. In this case, the send call
will not complete until a matching receive has been posted, and the data
has been moved to the receiver.

Section 3.5 MPI 2.2 page 44:8..19

A buffered send operation that cannot complete because of a lack of buffer
space is erroneous. When such a situation is detected, an error is
signalled that may cause the program to terminate abnormally. On the other
hand, a standard send operation that cannot complete because of lack of
buffer space will merely block, waiting for buffer space to become
available or for a matching receive to be posted. This behavior is
preferable in many situations. Consider a situation where a producer
repeatedly produces new values and sends them to a consumer. Assume that
the producer produces new values faster than the consumer can consume them.
If buffered sends are used, then a buffer overflow will result. Additional
synchronization has to be added to the program so as to prevent this from
occurring. If standard sends are used, then the producer will be
automatically throttled, as its send operations will block when buffer
space is unavailable.

Note - in the paragraph above "buffered send" means MPI_BSEND, not eager
send.

Dick Treumann  -  MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363


users-boun...@open-mpi.org wrote on 12/03/2009 05:33:51 AM:

> [image removed]
>
> Re: [OMPI users] Program deadlocks, on simple send/recv loop
>
> Ashley Pittman
>
> to:
>
> Open MPI Users
>
> 12/03/2009 05:35 AM
>
> Sent by:
>
> users-boun...@open-mpi.org
>
> Please respond to Open MPI Users
>
> On Wed, 2009-12-02 at 13:11 -0500, Brock Palen wrote:
> > On Dec 1, 2009, at 11:15 AM, Ashley Pittman wrote:
> > > On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:
> > >> The attached code, is an example where openmpi/1.3.2 will lock up,
if
> > >> ran on 48 cores, of IB (4 cores per node),
> > >> The code loops over recv from all processors on rank 0 and sends
from
> > >> all other ranks, as far as I know this should work, and I can't see
> > >> why not.
> > >> Note yes I know we can do the same thing with a gather, this is a
> > >> simple case to demonstrate the issue.
> > >> Note that if I increase the openib eager limit, the program runs,
> > >> which normally means improper MPI, but I can't on my own figure out
> > >> the problem with this code.
> > >
> > > What are you increasing the eager limit from and too?
> >
> > The same value as ethernet on our system,
> > mpirun --mca btl_openib_eager_limit 655360 --mca
> > btl_openib_max_send_size 655360 ./a.out
> >
> > Huge values compared to the defaults, but works,
>
> My understanding of the code is that each message will be 256k long and
> the code pretty much guarantees that at some point there will be 46
> messages in the queue in front of the one you are looking to receive
> which makes a total of 11.5Mb, slightly less if you take shared memory
> into account.
>
> If the MPI_SEND isn't blocking then each rank will send 50 messages to
> rank zero and you'll have 2000 messages and 500Mb of data being received
> with the message you want being somewhere towards the end of the queue.
>
> These numbers are far from huge but then compared to an eager limit of
> 64k they aren't small either.
>
> I suspect the eager limit is being reached on COMM_WORLD rank 0 and it's
> not 

Re: [OMPI users] Program deadlocks, on simple send/recv loop

2009-12-03 Thread Ashley Pittman
On Wed, 2009-12-02 at 13:11 -0500, Brock Palen wrote:
> On Dec 1, 2009, at 11:15 AM, Ashley Pittman wrote:
> > On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:
> >> The attached code, is an example where openmpi/1.3.2 will lock up, if
> >> ran on 48 cores, of IB (4 cores per node),
> >> The code loops over recv from all processors on rank 0 and sends from
> >> all other ranks, as far as I know this should work, and I can't see
> >> why not.
> >> Note yes I know we can do the same thing with a gather, this is a
> >> simple case to demonstrate the issue.
> >> Note that if I increase the openib eager limit, the program runs,
> >> which normally means improper MPI, but I can't on my own figure out
> >> the problem with this code.
> >
> > What are you increasing the eager limit from and too?
> 
> The same value as ethernet on our system,
> mpirun --mca btl_openib_eager_limit 655360 --mca  
> btl_openib_max_send_size 655360 ./a.out
> 
> Huge values compared to the defaults, but works,

My understanding of the code is that each message will be 256k long and
the code pretty much guarantees that at some point there will be 46
messages in the queue in front of the one you are looking to receive
which makes a total of 11.5Mb, slightly less if you take shared memory
into account.

If the MPI_SEND isn't blocking then each rank will send 50 messages to
rank zero and you'll have 2000 messages and 500Mb of data being received
with the message you want being somewhere towards the end of the queue.

These numbers are far from huge but then compared to an eager limit of
64k they aren't small either.

I suspect the eager limit is being reached on COMM_WORLD rank 0 and it's
not pulling any more messages off the network pending some of the
existing ones being out of the queue but they never will be because the
message being waited for is one that's stuck on the network.  As I say
the message queue for rank 0 when it's deadlocked would be interesting
to look at.

In summary this code makes heavy use of unexpected messages and network
buffering, it's not surprising to me that it only works with eager
limits set fairly high.

Ashley,
-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] exceedingly virtual memory consumption of MPI, environment if higher-setting "ulimit -s"

2009-12-03 Thread Paul Kapinos

Hi Jeff, hi all,


I can't think of what OMPI would be doing related to the predefined 
stack size -- I am not aware of anywhere in the code where we look up 
the predefine stack size and then do something with it.


I do not know OMPI code at all - but what I see is the consumption of 
virtual memory according to the twice stack size defaults by new login..





That being said, I don't know what the OS and resource consumption 
effects are of setting 1GB+ stack size on *any* application...  


we defenitely have applications which *need* stack size of 500+MB.

Users who use such codes, may trend to hard-code a *huge* stack size in 
their profile (you do not wanna to lose a day ot two of computing time 
just by forgitting to set a ulimit, right?). (Currently, I see *one* 
such user, but who knows how many there are...)


nevertheless, also if the users do not use a huge stack size, the 
default stack size is some 20 MB. That's not much, but does this 
allocation-and-never-use of twice of the stack size really needed?



Best wishes,
PK




Have you
tried non-MPI examples, potentially with applications as large as MPI 
applications but without the complexity of MPI?



On Nov 19, 2009, at 3:13 PM, David Singleton wrote:



Depending on the setup, threads often get allocated a thread local
stack with size equal to the stacksize rlimit.  Two threads maybe?

David

Terry Dontje wrote:
> A couple things to note.  First Sun MPI 8.2.1 is effectively OMPI
> 1.3.4.  I also reproduced the below issue using a C code so I think 
this

> is a general issue with OMPI and not Fortran based.
>
> I did a pmap of a process and there were two anon spaces equal to the
> stack space set by ulimit.
>
> In one case (setting 102400) the anon spaces were next to each other
> prior to all the loadable libraries.  In another case (setting 1024000)
> one anon space was locate in the same area as the first case but the
> second space was deep into some memory used by ompi.
>
> Is any of this possibly related to the predefined handles?  Though I am
> not sure why it would expand based on stack size?.
>
> --td
>> Date: Thu, 19 Nov 2009 19:21:46 +0100
>> From: Paul Kapinos 
>> Subject: [OMPI users] exceedingly virtual memory consumption of MPI
>> environment if higher-setting "ulimit -s"
>> To: Open MPI Users 
>> Message-ID: <4b058cba.3000...@rz.rwth-aachen.de>
>> Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"
>>
>> Hi volks,
>>
>> we see an exeedingly *virtual* memory consumtion through MPI processes
>> if "ulimit -s" (stack size)in profile configuration was setted higher.
>>
>> Furthermore we believe, every mpi process started, wastes about the
>> double size of `ulimit -s` value which will be set in a fresh console
>> (that is, the value is configurated in e.g.  .zshenv, *not* the value
>> actually setted in the console from which the mpiexec runs).
>>
>> Sun MPI 8.2.1, an empty mpi-HelloWorld program
>> ! either if running both processes on the same host..
>>
>> .zshenv: ulimit -s 10240   --> VmPeak:180072 kB
>> .zshenv: ulimit -s 102400  --> VmPeak:364392 kB
>> .zshenv: ulimit -s 1024000 --> VmPeak:2207592 kB
>> .zshenv: ulimit -s 2024000 --> VmPeak:4207592 kB
>> .zshenv: ulimit -s 2024 --> VmPeak:   39.7 GB
>> (see the attached files; the a.out binary is a mpi helloworld program
>> running an never ending loop).
>>
>>
>>
>> Normally, the stack size ulimit is set to some 10 MB by us, but we see
>> a lot of codes which needs *a lot* of stack space, e.g. Fortran codes,
>> OpenMP codes (and especially fortran OpenMP codes). Users tends to
>> hard-code the setting-up the higher value for stack size ulimit.
>>
>> Normally, the using of a lot of virtual memory is no problem, because
>> there is a lot of this thing :-) But... If more than one person is
>> allowed to work on a computer, you have to divide the ressources in
>> such a way that nobody can crash the box. We do not know how to limit
>> the real RAM used so we need to divide the RAM by means of setting
>> virtual memory ulimit (in our batch system e.g.. That is, for us
>> "virtual memory consumption" = "real memory consumption".
>> And real memory is not that way cheap than virtual memory.
>>
>>
>> So, why consuming the *twice* amount of stack size for each process?
>>
>> And, why consuming the virtual memory at all? We guess this virtual
>> memory is allocated for the stack (why else it will be related to the
>> stack size ulimit). But, is such allocation really needed? Is there a
>> way to avoid the vaste of virtual memory?
>>
>> best regards,
>> Paul Kapinos
>>
>>
>>
>>
>>
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)

[OMPI users] Mimicking timeout for MPI_Wait

2009-12-03 Thread Katz, Jacob
Hi,
I wonder if there is a BKM (efficient and portable) to mimic a timeout with a 
call to MPI_Wait, i.e. to interrupt it once a given time period has passed if 
it hasn't returned by then yet.
I'll appreciate if anyone may send a pointer/idea.

Thanks.

Jacob M. Katz | jacob.k...@intel.com | Work: 
+972-4-865-5726 | iNet: (8)-465-5726

-
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-03 Thread Nicolas Bock
That was quick. I will try the patch as soon as you release it.

nick


On Wed, Dec 2, 2009 at 21:06, Ralph Castain  wrote:

> Patch is built and under review...
>
> Thanks again
> Ralph
>
> On Dec 2, 2009, at 5:37 PM, Nicolas Bock wrote:
>
> Thanks
>
> On Wed, Dec 2, 2009 at 17:04, Ralph Castain  wrote:
>
>> Yeah, that's the one all right! Definitely missing from 1.3.x.
>>
>> Thanks - I'll build a patch for the next bug-fix release
>>
>>
>> On Dec 2, 2009, at 4:37 PM, Abhishek Kulkarni wrote:
>>
>> > On Wed, Dec 2, 2009 at 5:00 PM, Ralph Castain  wrote:
>> >> Indeed - that is very helpful! Thanks!
>> >> Looks like we aren't cleaning up high enough - missing the directory
>> level.
>> >> I seem to recall seeing that error go by and that someone fixed it on
>> our
>> >> devel trunk, so this is likely a repair that didn't get moved over to
>> the
>> >> release branch as it should have done.
>> >> I'll look into it and report back.
>> >
>> > You are probably referring to
>> > https://svn.open-mpi.org/trac/ompi/changeset/21498
>> >
>> > There was an issue about orte_session_dir_finalize() not
>> > cleaning up the session directories properly.
>> >
>> > Hope that helps.
>> >
>> > Abhishek
>> >
>> >> Thanks again
>> >> Ralph
>> >> On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:
>> >>
>> >>
>> >> On Wed, Dec 2, 2009 at 14:23, Ralph Castain  wrote:
>> >>>
>> >>> Hmmif you are willing to keep trying, could you perhaps let it run
>> for
>> >>> a brief time, ctrl-z it, and then do an ls on a directory from a
>> process
>> >>> that has already terminated? The pids will be in order, so just look
>> for an
>> >>> early number (not mpirun or the parent, of course).
>> >>> It would help if you could give us the contents of a directory from a
>> >>> child process that has terminated - would tell us what subsystem is
>> failing
>> >>> to properly cleanup.
>> >>
>> >> Ok, so I Ctrl-Z the master. In
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one
>> >> directory
>> >>
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857
>> >>
>> >> I can't find that PID though. mpirun has PID 4230, orted does not
>> exist,
>> >> master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it
>> again,
>> >> slave has a different PID as expected. I Ctrl-Z'ed in iteration 68,
>> there
>> >> are 70 sequentially numbered directories starting at 0. Every directory
>> >> contains another directory called "0". There is nothing in any of those
>> >> directories. I see for instance:
>> >>
>> >> /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
>> >> total 4.0K
>> >> drwx-- 2 nbock users 4.0K Dec  2 14:41 0
>> >>
>> >> and
>> >>
>> >> nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls
>> -lh
>> >> 70/0/
>> >> total 0
>> >>
>> >> I hope this information helps. Did I understand your question
>> correctly?
>> >>
>> >> nick
>> >>
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >> ___
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>