Re: [OMPI users] Upgrade from Open MPI 1.2 to 1.3

2009-03-10 Thread Jeff Squyres

On Mar 10, 2009, at 6:53 PM, Serge wrote:


Thank you, it's very good news. If the issue has been fixed, then does
it mean that v1.3.2 will allow to run applications compiled with  
v1.2.9?

Or is it starting with v1.3.2 and subsequent releases will be backward
compatible with each other?



The latter -- binary compatibility will be starting with v1.3.2 and  
onwards.


Binary compatibility was not actually guaranteed between any of the  
1.2.x releases, either -- it just usually works.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Upgrade from Open MPI 1.2 to 1.3

2009-03-10 Thread Serge
Thank you, it's very good news. If the issue has been fixed, then does 
it mean that v1.3.2 will allow to run applications compiled with v1.2.9? 
Or is it starting with v1.3.2 and subsequent releases will be backward 
compatible with each other?



Jeff Squyres wrote:
Unfortunately, binary compatibility between Open MPI release versions 
has never been guaranteed (even between subreleases).


That being said, we have fixed this issue and expect to support binary 
compatibility between Open MPI releases starting with v1.3.2 (v1.3.1 
should be released soon; we're aiming for v1.3.2 towards the beginning 
of next month).




On Mar 10, 2009, at 11:59 AM, Serge wrote:


Hello,

We have a number of applications built with Open MPI 1.2 in a shared
multi-user environment. The Open MPI library upgrade has been always
transparent and painless within the v1.2 branch. Now we would like to
switch to Open MPI 1.3 as seamlessly. However, an application built with
ompi v1.2 will not run with the 1.3 library; the typical error messages
are given below. Apparently, the type ompi_communicator_t has changed.

Symbol `ompi_mpi_comm_null' has different size in shared object,
consider re-linking
Symbol `ompi_mpi_comm_world' has different size in shared object,
consider re-linking

Do I have to rebuild all the applications with Open MPI 1.3?

Is there a better way to do a smooth upgrade?

Thank you.

= Serge

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Mostyn Lewis

I queued up a job to try this - will let you know.
I do have the authority to ifdown those rogue eth0 as they are only an
artifact of our install (no cables) and will do that afterwards.

Thanks.

On Tue, 10 Mar 2009, Ralph Castain wrote:

Ick. We don't have a way currently to allow you to ignore an interface on a 
node-by-node basis. If you do:


-mca oob_tcp_if_exclude eth0

we will exclude that private Ethernet. The catch is that we will exclude 
"eth0" on -every- node. On the two machines you note here, that will still 
let us work - but I don't know if we will catch an "eth0" on another node 
where we need it.


Can you give it a try and see if it works?
Ralph

On Mar 10, 2009, at 2:13 PM, Mostyn Lewis wrote:


Maybe I know why now but it's not pleasant, e.g. 2 machines in the same
cluster have their ethernets such as:

Machine s0157

eth2  Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A8
BROADCAST MULTICAST  MTU:1500  Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
Interrupt:233 Base address:0x6000

eth3  Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A9
inet addr:10.173.128.13  Bcast:10.173.255.255  Mask:255.255.0.0
inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5780065692 (5512.3 Mb)  TX bytes:59140357016 (56400.6 Mb)
Interrupt:50 Base address:0x8000

Machine s0158

eth0  Link encap:Ethernet  HWaddr 00:23:8B:42:10:A9
inet addr:7.8.82.158  Bcast:7.8.255.255  Mask:255.255.0.0
UP BROADCAST MULTICAST  MTU:1500  Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
Interrupt:233 Base address:0x6000

eth1  Link encap:Ethernet  HWaddr 00:23:8B:42:10:AA
inet addr:10.173.128.14  Bcast:10.173.255.255  Mask:255.255.0.0
inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5879861483 (5607.4 Mb)  TX bytes:2406041840 (2294.5 Mb)
Interrupt:50 Base address:0x8000

Apart from the eths being on different names (happens when installing SuSE 
SLES 10 SP2)
on apparently similar machines, I notice theirs a private ethernet on s0158 
at IP

7.8.82.158 - I guess this was used. How to exclude when the eth names vary?

DM


On Tue, 10 Mar 2009, Ralph Castain wrote:

Not really. I've run much bigger jobs than this without problem, so I 
don't think there is a fundamental issue here.


It looks like the TCP fabric between the various nodes is breaking down. I 
note in the enclosed messages that the problems are all with comm between 
daemons 4 and 21. We keep trying to get through, but failing.


I can fix things so we don't endlessly loop when that happens (IIRC, I 
think we are already supposed to abort, but it appears that isn't 
working). But the real question is why the comm fails in the first place.



On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:


Latest status - 1.4a1r20757 (yesterday);
the job now starts with a little output but quickly runs into trouble 
with

a lot of
'oob-tcp: Communication retries exceeded.  Can not communicate with peer 
'

errors?
e.g.
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries 
exceeded.  Can not communicate with peer [s0158:22513] 
[[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded. 
Can not communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0],21] 
oob-tcp: Communication retries exceeded.  Can not communicate with peer 
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries 
exceeded.  Can not communicate with peer

The nodes are O.K. ...
Any ideas folks?
DM
On Sat, 28 Feb 2009, Ralph Castain wrote:
I think I have this figured out - will fix on Monday. I'm not sure why 
Jeff's conditions are all required, especially the second one. However, 
the fundamental problem is that we pull information from two sources 
regarding the number of procs in the job when unpacking a buffer, and 
the two sources appear to be out-of-sync with each other in certain 
scenarios.
The details are beyond the user list. I'll respond here again once I get 
it fixed.

Ralph
On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
Unfortunately, I think I have reproduced the problem as well -- with 
SVN trunk HEAD (r20655):

[15:12] 

Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Jeff Squyres
You *could* have a per-machine mca param config file that could be  
locally staged on each machine and setup with the exclude for whatever  
you need on *that* node.  Ugly, but it could work...?



On Mar 10, 2009, at 4:26 PM, Ralph Castain wrote:


Ick. We don't have a way currently to allow you to ignore an interface
on a node-by-node basis. If you do:

-mca oob_tcp_if_exclude eth0

we will exclude that private Ethernet. The catch is that we will
exclude "eth0" on -every- node. On the two machines you note here,
that will still let us work - but I don't know if we will catch an
"eth0" on another node where we need it.

Can you give it a try and see if it works?
Ralph

On Mar 10, 2009, at 2:13 PM, Mostyn Lewis wrote:

> Maybe I know why now but it's not pleasant, e.g. 2 machines in the
> same
> cluster have their ethernets such as:
>
> Machine s0157
>
> eth2  Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A8
>  BROADCAST MULTICAST  MTU:1500  Metric:1
>  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>  collisions:0 txqueuelen:1000
>  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>  Interrupt:233 Base address:0x6000
>
> eth3  Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A9
>  inet addr:10.173.128.13  Bcast:10.173.255.255  Mask:
> 255.255.0.0
>  inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
>  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>  RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
>  TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
>  collisions:0 txqueuelen:1000
>  RX bytes:5780065692 (5512.3 Mb)  TX bytes:59140357016
> (56400.6 Mb)
>  Interrupt:50 Base address:0x8000
>
> Machine s0158
>
> eth0  Link encap:Ethernet  HWaddr 00:23:8B:42:10:A9
>  inet addr:7.8.82.158  Bcast:7.8.255.255  Mask:255.255.0.0
>  UP BROADCAST MULTICAST  MTU:1500  Metric:1
>  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>  collisions:0 txqueuelen:1000
>  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>  Interrupt:233 Base address:0x6000
>
> eth1  Link encap:Ethernet  HWaddr 00:23:8B:42:10:AA
>  inet addr:10.173.128.14  Bcast:10.173.255.255  Mask:
> 255.255.0.0
>  inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
>  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>  RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
>  TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
>  collisions:0 txqueuelen:1000
>  RX bytes:5879861483 (5607.4 Mb)  TX bytes:2406041840
> (2294.5 Mb)
>  Interrupt:50 Base address:0x8000
>
> Apart from the eths being on different names (happens when
> installing SuSE SLES 10 SP2)
> on apparently similar machines, I notice theirs a private ethernet
> on s0158 at IP
> 7.8.82.158 - I guess this was used. How to exclude when the eth
> names vary?
>
> DM
>
>
> On Tue, 10 Mar 2009, Ralph Castain wrote:
>
>> Not really. I've run much bigger jobs than this without problem, so
>> I don't think there is a fundamental issue here.
>>
>> It looks like the TCP fabric between the various nodes is breaking
>> down. I note in the enclosed messages that the problems are all
>> with comm between daemons 4 and 21. We keep trying to get through,
>> but failing.
>>
>> I can fix things so we don't endlessly loop when that happens
>> (IIRC, I think we are already supposed to abort, but it appears
>> that isn't working). But the real question is why the comm fails in
>> the first place.
>>
>>
>> On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:
>>
>>> Latest status - 1.4a1r20757 (yesterday);
>>> the job now starts with a little output but quickly runs into
>>> trouble with
>>> a lot of
>>> 'oob-tcp: Communication retries exceeded.  Can not communicate
>>> with peer '
>>> errors?
>>> e.g.
>>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication
>>> retries exceeded.  Can not communicate with peer [s0158:22513]
>>> [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries
>>> exceeded.  Can not communicate with peer [s0158:22513] [[41245,0],
>>> 4]-[[41245,0],21] oob-tcp: Communication retries exceeded.  Can
>>> not communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0],
>>> 21] oob-tcp: Communication retries exceeded.  Can not communicate
>>> with peer
>>> The nodes are O.K. ...
>>> Any ideas folks?
>>> DM
>>> On Sat, 28 Feb 2009, Ralph Castain wrote:
 I think I have this figured out - will fix on Monday. I'm not
 sure why Jeff's conditions are all required, especially the
 second one. However, the fundamental problem is that we pull
 information from two sources regarding the number of procs in the
 job when unpacking a buffer, and the two sources appear to be  
out-

 of-sync with each other in 

Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Ralph Castain
Ick. We don't have a way currently to allow you to ignore an interface  
on a node-by-node basis. If you do:


-mca oob_tcp_if_exclude eth0

we will exclude that private Ethernet. The catch is that we will  
exclude "eth0" on -every- node. On the two machines you note here,  
that will still let us work - but I don't know if we will catch an  
"eth0" on another node where we need it.


Can you give it a try and see if it works?
Ralph

On Mar 10, 2009, at 2:13 PM, Mostyn Lewis wrote:

Maybe I know why now but it's not pleasant, e.g. 2 machines in the  
same

cluster have their ethernets such as:

Machine s0157

eth2  Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A8
 BROADCAST MULTICAST  MTU:1500  Metric:1
 RX packets:0 errors:0 dropped:0 overruns:0 frame:0
 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
 RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
 Interrupt:233 Base address:0x6000

eth3  Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A9
 inet addr:10.173.128.13  Bcast:10.173.255.255  Mask: 
255.255.0.0

 inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
 RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
 TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
 RX bytes:5780065692 (5512.3 Mb)  TX bytes:59140357016  
(56400.6 Mb)

 Interrupt:50 Base address:0x8000

Machine s0158

eth0  Link encap:Ethernet  HWaddr 00:23:8B:42:10:A9
 inet addr:7.8.82.158  Bcast:7.8.255.255  Mask:255.255.0.0
 UP BROADCAST MULTICAST  MTU:1500  Metric:1
 RX packets:0 errors:0 dropped:0 overruns:0 frame:0
 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
 RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
 Interrupt:233 Base address:0x6000

eth1  Link encap:Ethernet  HWaddr 00:23:8B:42:10:AA
 inet addr:10.173.128.14  Bcast:10.173.255.255  Mask: 
255.255.0.0

 inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
 RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
 TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
 RX bytes:5879861483 (5607.4 Mb)  TX bytes:2406041840  
(2294.5 Mb)

 Interrupt:50 Base address:0x8000

Apart from the eths being on different names (happens when  
installing SuSE SLES 10 SP2)
on apparently similar machines, I notice theirs a private ethernet  
on s0158 at IP
7.8.82.158 - I guess this was used. How to exclude when the eth  
names vary?


DM


On Tue, 10 Mar 2009, Ralph Castain wrote:

Not really. I've run much bigger jobs than this without problem, so  
I don't think there is a fundamental issue here.


It looks like the TCP fabric between the various nodes is breaking  
down. I note in the enclosed messages that the problems are all  
with comm between daemons 4 and 21. We keep trying to get through,  
but failing.


I can fix things so we don't endlessly loop when that happens  
(IIRC, I think we are already supposed to abort, but it appears  
that isn't working). But the real question is why the comm fails in  
the first place.



On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:


Latest status - 1.4a1r20757 (yesterday);
the job now starts with a little output but quickly runs into  
trouble with

a lot of
'oob-tcp: Communication retries exceeded.  Can not communicate  
with peer '

errors?
e.g.
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication  
retries exceeded.  Can not communicate with peer [s0158:22513]  
[[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries  
exceeded.  Can not communicate with peer [s0158:22513] [[41245,0], 
4]-[[41245,0],21] oob-tcp: Communication retries exceeded.  Can  
not communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0], 
21] oob-tcp: Communication retries exceeded.  Can not communicate  
with peer

The nodes are O.K. ...
Any ideas folks?
DM
On Sat, 28 Feb 2009, Ralph Castain wrote:
I think I have this figured out - will fix on Monday. I'm not  
sure why Jeff's conditions are all required, especially the  
second one. However, the fundamental problem is that we pull  
information from two sources regarding the number of procs in the  
job when unpacking a buffer, and the two sources appear to be out- 
of-sync with each other in certain scenarios.
The details are beyond the user list. I'll respond here again  
once I get it fixed.

Ralph
On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
Unfortunately, I think I have reproduced the problem as well --  
with SVN trunk HEAD (r20655):
[15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2  
uptime
[svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data  
unpack failed in file 

Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Mostyn Lewis

Maybe I know why now but it's not pleasant, e.g. 2 machines in the same
cluster have their ethernets such as:

Machine s0157

eth2  Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A8
  BROADCAST MULTICAST  MTU:1500  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
  Interrupt:233 Base address:0x6000

eth3  Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A9
  inet addr:10.173.128.13  Bcast:10.173.255.255  Mask:255.255.0.0
  inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
  TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:5780065692 (5512.3 Mb)  TX bytes:59140357016 (56400.6 Mb)
  Interrupt:50 Base address:0x8000

Machine s0158

eth0  Link encap:Ethernet  HWaddr 00:23:8B:42:10:A9
  inet addr:7.8.82.158  Bcast:7.8.255.255  Mask:255.255.0.0
  UP BROADCAST MULTICAST  MTU:1500  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
  Interrupt:233 Base address:0x6000

eth1  Link encap:Ethernet  HWaddr 00:23:8B:42:10:AA
  inet addr:10.173.128.14  Bcast:10.173.255.255  Mask:255.255.0.0
  inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
  TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:5879861483 (5607.4 Mb)  TX bytes:2406041840 (2294.5 Mb)
  Interrupt:50 Base address:0x8000

Apart from the eths being on different names (happens when installing SuSE SLES 
10 SP2)
on apparently similar machines, I notice theirs a private ethernet on s0158 at 
IP
7.8.82.158 - I guess this was used. How to exclude when the eth names vary?

DM


On Tue, 10 Mar 2009, Ralph Castain wrote:

Not really. I've run much bigger jobs than this without problem, so I don't 
think there is a fundamental issue here.


It looks like the TCP fabric between the various nodes is breaking down. I 
note in the enclosed messages that the problems are all with comm between 
daemons 4 and 21. We keep trying to get through, but failing.


I can fix things so we don't endlessly loop when that happens (IIRC, I think 
we are already supposed to abort, but it appears that isn't working). But the 
real question is why the comm fails in the first place.



On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:


Latest status - 1.4a1r20757 (yesterday);
the job now starts with a little output but quickly runs into trouble with
a lot of
'oob-tcp: Communication retries exceeded.  Can not communicate with peer '
errors?

e.g.
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries 
exceeded.  Can not communicate with peer [s0158:22513] 
[[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded.  Can 
not communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0],21] 
oob-tcp: Communication retries exceeded.  Can not communicate with peer 
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries 
exceeded.  Can not communicate with peer


The nodes are O.K. ...

Any ideas folks?

DM

On Sat, 28 Feb 2009, Ralph Castain wrote:

I think I have this figured out - will fix on Monday. I'm not sure why 
Jeff's conditions are all required, especially the second one. However, 
the fundamental problem is that we pull information from two sources 
regarding the number of procs in the job when unpacking a buffer, and the 
two sources appear to be out-of-sync with each other in certain scenarios.


The details are beyond the user list. I'll respond here again once I get 
it fixed.


Ralph

On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:

Unfortunately, I think I have reproduced the problem as well -- with SVN 
trunk HEAD (r20655):

[15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime
[svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack 
failed in file base/odls_base_default_fns.c at line 566

--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
Notice that I'm not trying to run an MPI app -- it's just "uptime".
The following things seem to be necessary to make this error occur for 
me:

1. --bynode
2. set some mca parameter (any mca parameter)
3. -np value less than the size of 

Re: [OMPI users] Upgrade from Open MPI 1.2 to 1.3

2009-03-10 Thread Jeff Squyres
Unfortunately, binary compatibility between Open MPI release versions  
has never been guaranteed (even between subreleases).


That being said, we have fixed this issue and expect to support binary  
compatibility between Open MPI releases starting with v1.3.2 (v1.3.1  
should be released soon; we're aiming for v1.3.2 towards the beginning  
of next month).




On Mar 10, 2009, at 11:59 AM, Serge wrote:


Hello,

We have a number of applications built with Open MPI 1.2 in a shared
multi-user environment. The Open MPI library upgrade has been always
transparent and painless within the v1.2 branch. Now we would like to
switch to Open MPI 1.3 as seamlessly. However, an application built  
with
ompi v1.2 will not run with the 1.3 library; the typical error  
messages

are given below. Apparently, the type ompi_communicator_t has changed.

Symbol `ompi_mpi_comm_null' has different size in shared object,
consider re-linking
Symbol `ompi_mpi_comm_world' has different size in shared object,
consider re-linking

Do I have to rebuild all the applications with Open MPI 1.3?

Is there a better way to do a smooth upgrade?

Thank you.

= Serge

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Ralph Castain
Not really. I've run much bigger jobs than this without problem, so I  
don't think there is a fundamental issue here.


It looks like the TCP fabric between the various nodes is breaking  
down. I note in the enclosed messages that the problems are all with  
comm between daemons 4 and 21. We keep trying to get through, but  
failing.


I can fix things so we don't endlessly loop when that happens (IIRC, I  
think we are already supposed to abort, but it appears that isn't  
working). But the real question is why the comm fails in the first  
place.



On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:


Latest status - 1.4a1r20757 (yesterday);
the job now starts with a little output but quickly runs into  
trouble with

a lot of
'oob-tcp: Communication retries exceeded.  Can not communicate with  
peer '

errors?

e.g.
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication  
retries exceeded.  Can not communicate with peer [s0158:22513]  
[[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries  
exceeded.  Can not communicate with peer [s0158:22513] [[41245,0],4]- 
[[41245,0],21] oob-tcp: Communication retries exceeded.  Can not  
communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0],21] oob- 
tcp: Communication retries exceeded.  Can not communicate with peer


The nodes are O.K. ...

Any ideas folks?

DM

On Sat, 28 Feb 2009, Ralph Castain wrote:

I think I have this figured out - will fix on Monday. I'm not sure  
why Jeff's conditions are all required, especially the second one.  
However, the fundamental problem is that we pull information from  
two sources regarding the number of procs in the job when unpacking  
a buffer, and the two sources appear to be out-of-sync with each  
other in certain scenarios.


The details are beyond the user list. I'll respond here again once  
I get it fixed.


Ralph

On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:

Unfortunately, I think I have reproduced the problem as well --  
with SVN trunk HEAD (r20655):
[15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2  
uptime
[svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data  
unpack failed in file base/odls_base_default_fns.c at line 566

--
mpirun noticed that the job aborted, but has no info as to the  
process

that caused that situation.
--
Notice that I'm not trying to run an MPI app -- it's just "uptime".
The following things seem to be necessary to make this error occur  
for me:

1. --bynode
2. set some mca parameter (any mca parameter)
3. -np value less than the size of my slurm allocation
If I remove any of those, it seems to run file
On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
With further investigation, I have reproduced this problem.  I  
think I was originally testing against a version that was not  
recent enough.  I do not see it with r20594 which is from  
February 19.  So, something must have happened over the last 8  
days.  I will try and narrow down the issue.

Rolf
On 02/27/09 09:34, Rolf Vandevaart wrote:
I just tried trunk-1.4a1r20458 and I did not see this error,  
although my configuration was rather different.  I ran across  
100 2-CPU sparc nodes, np=256, connected with TCP.

Hopefully George's comment helps out with this issue.
One other thought to see whether SGE has anything to do with  
this is create a hostfile and run it outside of SGE.

Rolf
On 02/26/09 22:10, Ralph Castain wrote:
FWIW: I tested the trunk tonight using both SLURM and rsh  
launchers, and everything checks out fine. However, this is  
running under SGE and thus using qrsh, so it is possible the  
SGE support is having a problem.

Perhaps one of the Sun OMPI developers can help here?
Ralph
On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
It looks like the system doesn't know what nodes the procs are  
to be placed upon. Can you run this with --display-devel-map?  
That will tell us where the system thinks it is placing things.

Thanks
Ralph
On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:

Maybe it's my pine mailer.
This is a NAMD run on 256 procs across 32 dual-socket quad- 
core AMD

shangai nodes running a standard benchmark called stmv.
The basic error message, which occurs 31 times is like:
[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in  
file ../../../.././orte/mca/odls/base/odls_base_default_fns.c  
at line 595
The mpirun command has long paths in it, sorry. It's invoking  
a special binding
script which in turn lauches the NAMD run. This works on an  
older SVN at
level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for  
this 256 proc run where
the older SVN hangs indefinitely polling some completion (sm  
or openib). So, I was trying
later SVNs with this 256 proc run, hoping the error would go  
away.

Here's some of the invocation again. Hope you can read it:
EAGER_SIZE=32767
export 

Re: [OMPI users] valgrind complaint in openmpi 1.3 (mca_mpool_sm_alloc)

2009-03-10 Thread Åke Sandgren
On Tue, 2009-03-10 at 09:23 -0800, Eugene Loh wrote:
> Åke Sandgren wrote:
> 
> >Hi!
> >
> >Valgrind seems to think that there is an use of uninitialized value in
> >mca_mpool_sm_alloc, i.e. the if(mpool_sm->mem_node >= 0) {
> >Backtracking that i found that mem_node is not set during initializing
> >in mca_mpool_sm_init.
> >The resources parameter is never used and the mpool_module->mem_node is
> >never initalized.
> >
> >Bug or not?
> >  
> >
> Apparently George fixed this in the trunk in r19257
> https://svn.open-mpi.org/source/history/ompi-trunk/ompi/mca/mpool/sm/mpool_sm_module.c
>  
> .  So, the resources parameter is never used, but you call 
> mca_mpool_sm_module_init(), which has the decency to set mem_node to 
> -1.  Not a helpful value, but a legal one.

So why not set it in the calling function which have access to the
precomputed resources value?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] valgrind complaint in openmpi 1.3 (mca_mpool_sm_alloc)

2009-03-10 Thread Eugene Loh

Åke Sandgren wrote:


Hi!

Valgrind seems to think that there is an use of uninitialized value in
mca_mpool_sm_alloc, i.e. the if(mpool_sm->mem_node >= 0) {
Backtracking that i found that mem_node is not set during initializing
in mca_mpool_sm_init.
The resources parameter is never used and the mpool_module->mem_node is
never initalized.

Bug or not?
 


Apparently George fixed this in the trunk in r19257
https://svn.open-mpi.org/source/history/ompi-trunk/ompi/mca/mpool/sm/mpool_sm_module.c 
.  So, the resources parameter is never used, but you call 
mca_mpool_sm_module_init(), which has the decency to set mem_node to 
-1.  Not a helpful value, but a legal one.






Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Mostyn Lewis

Latest status - 1.4a1r20757 (yesterday);
the job now starts with a little output but quickly runs into trouble with
a lot of
'oob-tcp: Communication retries exceeded.  Can not communicate with peer '
errors?

e.g.
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded.  Can not communicate with peer 
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded.  Can not communicate with peer 
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded.  Can not communicate with peer 
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded.  Can not communicate with peer


The nodes are O.K. ...

Any ideas folks?

DM

On Sat, 28 Feb 2009, Ralph Castain wrote:

I think I have this figured out - will fix on Monday. I'm not sure why Jeff's 
conditions are all required, especially the second one. However, the 
fundamental problem is that we pull information from two sources regarding 
the number of procs in the job when unpacking a buffer, and the two sources 
appear to be out-of-sync with each other in certain scenarios.


The details are beyond the user list. I'll respond here again once I get it 
fixed.


Ralph

On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:

Unfortunately, I think I have reproduced the problem as well -- with SVN 
trunk HEAD (r20655):


[15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime
[svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack failed 
in file base/odls_base_default_fns.c at line 566

--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

Notice that I'm not trying to run an MPI app -- it's just "uptime".

The following things seem to be necessary to make this error occur for me:

1. --bynode
2. set some mca parameter (any mca parameter)
3. -np value less than the size of my slurm allocation

If I remove any of those, it seems to run file


On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:

With further investigation, I have reproduced this problem.  I think I was 
originally testing against a version that was not recent enough.  I do not 
see it with r20594 which is from February 19.  So, something must have 
happened over the last 8 days.  I will try and narrow down the issue.


Rolf

On 02/27/09 09:34, Rolf Vandevaart wrote:
I just tried trunk-1.4a1r20458 and I did not see this error, although my 
configuration was rather different.  I ran across 100 2-CPU sparc nodes, 
np=256, connected with TCP.

Hopefully George's comment helps out with this issue.
One other thought to see whether SGE has anything to do with this is 
create a hostfile and run it outside of SGE.

Rolf
On 02/26/09 22:10, Ralph Castain wrote:
FWIW: I tested the trunk tonight using both SLURM and rsh launchers, and 
everything checks out fine. However, this is running under SGE and thus 
using qrsh, so it is possible the SGE support is having a problem.


Perhaps one of the Sun OMPI developers can help here?

Ralph

On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:

It looks like the system doesn't know what nodes the procs are to be 
placed upon. Can you run this with --display-devel-map? That will tell 
us where the system thinks it is placing things.


Thanks
Ralph

On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:


Maybe it's my pine mailer.

This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
shangai nodes running a standard benchmark called stmv.

The basic error message, which occurs 31 times is like:

[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file 
../../../.././orte/mca/odls/base/odls_base_default_fns.c at line 595


The mpirun command has long paths in it, sorry. It's invoking a 
special binding
script which in turn lauches the NAMD run. This works on an older SVN 
at
level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this 256 
proc run where
the older SVN hangs indefinitely polling some completion (sm or 
openib). So, I was trying

later SVNs with this 256 proc run, hoping the error would go away.

Here's some of the invocation again. Hope you can read it:

EAGER_SIZE=32767
export OMPI_MCA_btl_openib_use_eager_rdma=0
export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE

and, unexpanded

mpirun --prefix $PREFIX -np %PE% $MCA -x 
OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_openib_eager_limit 
-x OMPI_MCA_btl_self_eager_limit -x OMPI_MCA_btl_sm_eager_limit 
-machinefile $HOSTS $MPI_BINDER $NAMD2 stmv.namd


and, expanded

mpirun --prefix 
/tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron 
-np 256 --mca btl sm,openib,self -x OMPI_MCA_btl_openib_use_eager_rdma 
-x 

[OMPI users] Upgrade from Open MPI 1.2 to 1.3

2009-03-10 Thread Serge

Hello,

We have a number of applications built with Open MPI 1.2 in a shared 
multi-user environment. The Open MPI library upgrade has been always 
transparent and painless within the v1.2 branch. Now we would like to 
switch to Open MPI 1.3 as seamlessly. However, an application built with 
ompi v1.2 will not run with the 1.3 library; the typical error messages 
are given below. Apparently, the type ompi_communicator_t has changed.


Symbol `ompi_mpi_comm_null' has different size in shared object, 
consider re-linking
Symbol `ompi_mpi_comm_world' has different size in shared object, 
consider re-linking


Do I have to rebuild all the applications with Open MPI 1.3?

Is there a better way to do a smooth upgrade?

Thank you.

= Serge



Re: [OMPI users] Problem with MPI_Comm_spawn_multiple & MPI_Info_fre

2009-03-10 Thread Lenny Verkhovsky
can you try Open MPI 1.3,

Lenny.

On 3/10/09, Tee Wen Kai  wrote:
>
> Hi,
>
> I am using version 1.2.8.
>
> Thank you.
>
> Regards,
> Wenkai
>
> --- On *Mon, 9/3/09, Ralph Castain * wrote:
>
>
> From: Ralph Castain 
> Subject: Re: [OMPI users] Problem with MPI_Comm_spawn_multiple &
> MPI_Info_free
> To: "Open MPI Users" 
> Date: Monday, 9 March, 2009, 7:42 PM
>
> Could you tell us what version of Open MPI you are using? It would help us
> to provide you with advice.
> Thanks
> Ralph
>
>  On Mar 9, 2009, at 2:18 AM, Tee Wen Kai wrote:
>
>  Hi,
>
> I have a program that allow user to enter their choice of operation. For
> example, when the user enter '4', the program will enter a function which
> will spawn some other programs stored in the same directory. When the user
> enter '5', the program will enter another function to request all spawned
> processes to exit. Therefore, initially only one process which act as the
> controller is spawned.
>
> My MPI_Info_create and MPI_Comm_spawn_multiple are called in a function.
> Everything is working fine except when I tried to free the MPI_Info in the
> function by calling MPI_Info_free, I have segmentation fault error. The
> second problem is when i do the spawning once and exit the controller
> program with MPI_Finalize, the program exited normally. But when spawning is
> done more than once and exit the controller program with MPI_Finalize, there
> is segmentation fault. I also realize that when the spawed processes exit,
> the 'orted' process is still running. Thus, when multiple
> MPI_Comm_spawn_multiple are called, there will be multiple 'orted'
> processes.
>
> Thank you and hope someone has a solution to my problem.
>
> Regards,
> Wenkai
>
> --
> New Email names for you!
> 
> Get the Email name you've always wanted on the new @ymail and @rocketmail.
> Hurry before someone else
> does!___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> -Inline Attachment Follows-
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
>  Adding more friends is quick and 
> easy.
> Import them over to Yahoo! Mail today!
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>