Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-30 Thread Michael Kluskens
I have tested for the MPI_ABORT problem I was seeing and it appears  
to be fixed in the trunk.


Michael

On Oct 28, 2006, at 8:45 AM, Jeff Squyres wrote:


Sorry for the delay on this -- is this still the case with the OMPI
trunk?

We think we finally have all the issues solved with MPI_ABORT on the
trunk.



On Oct 16, 2006, at 8:29 AM, Åke Sandgren wrote:


On Mon, 2006-10-16 at 10:13 +0200, Åke Sandgren wrote:

On Fri, 2006-10-06 at 00:04 -0400, Jeff Squyres wrote:

On 10/5/06 2:42 PM, "Michael Kluskens"  wrote:


System: BLACS 1.1p3 on Debian Linux 3.1r3 on dual-opteron, gcc
3.3.5,
Intel ifort 9.0.32 all tests with 4 processors (comments below)

OpenMPi 1.1.1 patched and OpenMPI 1.1.2 patched:
   C & F tests: no errors with default data set.  F test slowed
down
in the middle of the tests.


Good.  Can you expand on what you mean by "slowed down"?


Lets add some more data to this...
BLACS 1.1p3
Ubuntu Dapper 6.06
dual opteron
gcc 4.0
gfortran 4.0 (for both f77 and f90)
standard tests with 4 tasks on one node (i.e. 2 tasks per cpu)

OpenMPI 1.1.2rc3
The tests comes to a complete standstill at the integer bsbr tests
It consumes cpu all the time but nothing happens.


Actually if i'm not too inpatient i will progress but VERY slowly.
A complete run of the blacstest takes +30min cpu-time...

From the bsbr tests and onwards everything takes "forever".







Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-28 Thread Ake Sandgren
On Sat, 2006-10-28 at 08:45 -0400, Jeff Squyres wrote:
> Sorry for the delay on this -- is this still the case with the OMPI  
> trunk?
> 
> We think we finally have all the issues solved with MPI_ABORT on the  
> trunk.
> 

Nah, it was a problem with overutilization, i.e. 4tasks on 2 cpus in one
node.
Turning on yield_when_idle made the problem go away.
As to why it didn't figure this out by itself is another problem.

Summary: 1.1.2 works correctly, no problems.
(Except when building with PGI, i have patches...)

> >> Lets add some more data to this...
> >> BLACS 1.1p3
> >> Ubuntu Dapper 6.06
> >> dual opteron
> >> gcc 4.0
> >> gfortran 4.0 (for both f77 and f90)
> >> standard tests with 4 tasks on one node (i.e. 2 tasks per cpu)
> >>
> >> OpenMPI 1.1.2rc3
> >> The tests comes to a complete standstill at the integer bsbr tests
> >> It consumes cpu all the time but nothing happens.
> >
> > Actually if i'm not too inpatient i will progress but VERY slowly.
> > A complete run of the blacstest takes +30min cpu-time...
> >> From the bsbr tests and onwards everything takes "forever".

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se




Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-16 Thread Åke Sandgren
On Mon, 2006-10-16 at 10:13 +0200, Åke Sandgren wrote:
> On Fri, 2006-10-06 at 00:04 -0400, Jeff Squyres wrote:
> > On 10/5/06 2:42 PM, "Michael Kluskens"  wrote:
> > 
> > > System: BLACS 1.1p3 on Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5,
> > > Intel ifort 9.0.32 all tests with 4 processors (comments below)
> > > 
> > > OpenMPi 1.1.1 patched and OpenMPI 1.1.2 patched:
> > >C & F tests: no errors with default data set.  F test slowed down
> > > in the middle of the tests.
> > 
> > Good.  Can you expand on what you mean by "slowed down"?
> 
> Lets add some more data to this...
> BLACS 1.1p3
> Ubuntu Dapper 6.06
> dual opteron
> gcc 4.0
> gfortran 4.0 (for both f77 and f90)
> standard tests with 4 tasks on one node (i.e. 2 tasks per cpu)
> 
> OpenMPI 1.1.2rc3
> The tests comes to a complete standstill at the integer bsbr tests
> It consumes cpu all the time but nothing happens.

Actually if i'm not too inpatient i will progress but VERY slowly.
A complete run of the blacstest takes +30min cpu-time...
>From the bsbr tests and onwards everything takes "forever".



Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-07 Thread Ralph Castain



On 10/5/06 10:04 PM, "Jeff Squyres"  wrote:

> On 10/5/06 2:42 PM, "Michael Kluskens"  wrote:

> 
>> The final auxiliary test is for BLACS_ABORT.
>> Immediately after this message, all processes should be killed.
>> If processes survive the call, your BLACS_ABORT is incorrect.
>> {0,2}, pnum=2, Contxt=0, killed other procs, exiting with error #-1.
>> 
>> [cluster:32133] [0,0,0] ORTE_ERROR_LOG: Communication failure in file
>> base/errmgr_base_receive.c at line 143
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>> Failing at addr:0x10030
>> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
>> (opal_backtrace_print+0x1f) [0x2a957e4c1f]
>> *** End of error message ***
>> Segmentation fault (core dumped)
>> 
> 

I believe this is now fixed on the trunk. Please take another crack at it
and let me know.

Thanks
Ralph




Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-06 Thread Jeff Squyres
On 10/5/06 2:42 PM, "Michael Kluskens"  wrote:

> System: BLACS 1.1p3 on Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5,
> Intel ifort 9.0.32 all tests with 4 processors (comments below)
> 
> OpenMPi 1.1.1 patched and OpenMPI 1.1.2 patched:
>C & F tests: no errors with default data set.  F test slowed down
> in the middle of the tests.

Good.  Can you expand on what you mean by "slowed down"?

> OpenMPI 1.3a1r11962 patched: much better, completes all tests with
> default data set but the tester crashes on exit (different problem?)
> 

Quite possibly so.  1.3 is the active development trunk and is not always
stable; we're working on some ORTE issues right now, so it's possible that
mpirun may not be rock solid at the moment.  :-)

> The final auxiliary test is for BLACS_ABORT.
> Immediately after this message, all processes should be killed.
> If processes survive the call, your BLACS_ABORT is incorrect.
> {0,2}, pnum=2, Contxt=0, killed other procs, exiting with error #-1.
> 
> [cluster:32133] [0,0,0] ORTE_ERROR_LOG: Communication failure in file
> base/errmgr_base_receive.c at line 143
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x10030
> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
> (opal_backtrace_print+0x1f) [0x2a957e4c1f]
> *** End of error message ***
> Segmentation fault (core dumped)
> 

Ya; don't worry about this on the trunk at the moment.  :-)

> Results of testing the patch on my system:
> 1) Not certain which branches this patch can be applied to so I may
> have tried to do too much.
> 2) I don't have 11970 on my system so I tried to apply the patch to
> 1.1.1, 1.1.2rc1, 1.3a1r11962

Good.  I literally just posted 1.1.2rc3 with this DDT fix (among others).
It looks like we're getting darn close to releasing 1.1.2.

>   (no nightly tarball for 1.3a1r11970 this morning)

We had a failure in the trunk tarball creation last night.

>   (side note where is 1.2?, only via cvs?)

We haven't opened up nightly tarballs for v1.2 yet because we're not quite
happy yet with the level of stability there yet.  That is, we expect the 1.1
series nightly tarballs to be more-or-less stable.  And we've never provided
guarantees about trunk stability ;-).  We'll open up the 1.2 nightly
tarballs probably in the not-distant future.

> 3) patch complained about all three I tried to apply it to but seemed
> to apply the patch most of the time, I hand-checked all three patched
> routines in the three branches I tried and hand fixed anything that
> got missed because of differences in line numbers.
> 4) The patch applied best against 1.3a1r11962 and second best against
> 1.1.1 -- my lack of experience with patch likely confused the issue.

No worries - we definitely appreciate all your testing!

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-05 Thread George Bosilca

Thanks Michael.

The seg-fault is related to some orterun problem. I notice it  
yesterday and we try to find a fix. For the rest I'm quite happy that  
the BLACS problem was solved.


  Thanks for your help,
george.

On Oct 5, 2006, at 2:42 PM, Michael Kluskens wrote:



On Oct 4, 2006, at 7:51 PM, George Bosilca wrote:

This is the correct patch (same as previous minus the debugging
statements).

On Oct 4, 2006, at 7:42 PM, George Bosilca wrote:

The problem was found and fixed. Until the patch get applied to the
1.1 and 1.2 branches please use the attached patch.


System: BLACS 1.1p3 on Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5,
Intel ifort 9.0.32 all tests with 4 processors (comments below)

OpenMPi 1.1.1 patched and OpenMPI 1.1.2 patched:
   C & F tests: no errors with default data set.  F test slowed down
in the middle of the tests.

OpenMPI 1.3a1r11962 patched: much better, completes all tests with
default data set but the tester crashes on exit (different problem?)

The final auxiliary test is for BLACS_ABORT.
Immediately after this message, all processes should be killed.
If processes survive the call, your BLACS_ABORT is incorrect.
{0,2}, pnum=2, Contxt=0, killed other procs, exiting with error #-1.

[cluster:32133] [0,0,0] ORTE_ERROR_LOG: Communication failure in file
base/errmgr_base_receive.c at line 143
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x10030
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
(opal_backtrace_print+0x1f) [0x2a957e4c1f]
*** End of error message ***
Segmentation fault (core dumped)


Results of testing the patch on my system:
1) Not certain which branches this patch can be applied to so I may
have tried to do too much.
2) I don't have 11970 on my system so I tried to apply the patch to
1.1.1, 1.1.2rc1, 1.3a1r11962
  (no nightly tarball for 1.3a1r11970 this morning)
  (side note where is 1.2?, only via cvs?)
3) patch complained about all three I tried to apply it to but seemed
to apply the patch most of the time, I hand-checked all three patched
routines in the three branches I tried and hand fixed anything that
got missed because of differences in line numbers.
4) The patch applied best against 1.3a1r11962 and second best against
1.1.1 -- my lack of experience with patch likely confused the issue.

Michael


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-05 Thread Michael Kluskens


On Oct 4, 2006, at 7:51 PM, George Bosilca wrote:
This is the correct patch (same as previous minus the debugging  
statements).

On Oct 4, 2006, at 7:42 PM, George Bosilca wrote:
The problem was found and fixed. Until the patch get applied to the  
1.1 and 1.2 branches please use the attached patch.


System: BLACS 1.1p3 on Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5,  
Intel ifort 9.0.32 all tests with 4 processors (comments below)


OpenMPi 1.1.1 patched and OpenMPI 1.1.2 patched:
  C & F tests: no errors with default data set.  F test slowed down  
in the middle of the tests.


OpenMPI 1.3a1r11962 patched: much better, completes all tests with  
default data set but the tester crashes on exit (different problem?)


The final auxiliary test is for BLACS_ABORT.
Immediately after this message, all processes should be killed.
If processes survive the call, your BLACS_ABORT is incorrect.
{0,2}, pnum=2, Contxt=0, killed other procs, exiting with error #-1.

[cluster:32133] [0,0,0] ORTE_ERROR_LOG: Communication failure in file  
base/errmgr_base_receive.c at line 143

Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x10030
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0 
(opal_backtrace_print+0x1f) [0x2a957e4c1f]

*** End of error message ***
Segmentation fault (core dumped)


Results of testing the patch on my system:
1) Not certain which branches this patch can be applied to so I may  
have tried to do too much.
2) I don't have 11970 on my system so I tried to apply the patch to  
1.1.1, 1.1.2rc1, 1.3a1r11962

 (no nightly tarball for 1.3a1r11970 this morning)
 (side note where is 1.2?, only via cvs?)
3) patch complained about all three I tried to apply it to but seemed  
to apply the patch most of the time, I hand-checked all three patched  
routines in the three branches I tried and hand fixed anything that  
got missed because of differences in line numbers.
4) The patch applied best against 1.3a1r11962 and second best against  
1.1.1 -- my lack of experience with patch likely confused the issue.


Michael




Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-04 Thread George Bosilca
The problem was found and fixed. Until the patch get applied to the  
1.1 and 1.2 branches please use the attached patch.


  Thanks for you help for discovering and fixing this bug,
george.



ddt.patch
Description: Binary data


On Oct 4, 2006, at 5:32 PM, George Bosilca wrote:


That's just amazing. We pass all the trapezoidal tests but we fail
the general ones (rectangular matrix) if the leading dimension of the
matrix on the destination processor is greater than the leading
dimension on the sender. At least now I narrow down the place where
the error occur ...

   george.

On Oct 4, 2006, at 4:41 PM, George Bosilca wrote:


OK, that was my 5 minutes hall of shame. Setting the verbosity level
in bt.dat to 6 give me enough information to know exactly the data-
type share. Now, I know how to fix things ...

   george.

On Oct 4, 2006, at 4:35 PM, George Bosilca wrote:


I'm working on this bug. As far as I see the patch from the bug 365
do not help us here. However, on my 64 bits machines (not Opteron  
but
G5) I don't get the segfault. Anyway, I get the bad data  
transmission

for test #1 and #51. So far my main problem is that I cannot
reproduce these errors with any other data-type tests [and  
believe me

we have a bunch of them]. The only one who fails is the BLACS. I
wonder what the data-type looks like for the failing tests. Someone
here knows how to extract the BLACS data-type (for test #1 and  
#51) ?

Or how to force BLACS to print the data-type information for each
test  (M, N and so on) ?

   Thanks,
 george.

On Oct 4, 2006, at 4:13 PM, Michael Kluskens wrote:


On Oct 4, 2006, at 8:22 AM, Harald Forbert wrote:


The TRANSCOMM setting that we are using here and that I think is
the
correct one is "-DUseMpi2" since OpenMPI implements the
corresponding
mpi2 calls. You need a recent version of BLACS for this setting
to be available (1.1 with patch 3 should be fine). Together with
the
patch to openmpi1.1.1 from ticket 356 we are passing the blacs
tester
for 4 processors. I didn't have to time to test with other numbers
though.


Unfortunately this did not solve the problems I'm seeing, could be
that my system is 64 bits (another person seeing problems on an
Opteron system).

New tests of BLACS 1.1p3 vs. OpenMPI (1.1.1, 1.1.2rc1, 1.3a1r11962)
with Intel ifort 9.0.32 and g95 (Sep 27 2006).

System: Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5, all tests
with
4 processors

1) patched OpenMPI 1.1.1 and 1.1.2rc1 using the two lines from
Ticket
356.
2) set TRANSCOMM = -DUseMpi2

Intel ifort 9.0.32 tests (INTFACE=-DAdd):

OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
   In the xCbtest both generated errors until Integer Sum tests  
then

no more errors)

OpenMPI 1.3a1r11962: no errors until crash:

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xe62000
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
(opal_backtrace_print+0x1f) [0x2a95aa8c1f]
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xbc
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
(opal_backtrace_print+0x1f) [0x2a95aa8c1f]
*** End of error message ***

g95 (Sep 27 2006) tests (INTFACE=-Df77IsF2C):

OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
   In the xCbtest both generated errors until Integer Sum tests  
then

no more errors)

OpenMPI 1.3a1r11962:  no errors until crash:

COMPLEX SUM TESTS: BEGIN.
COMPLEX SUM TESTS: 1152 TESTS;  864 PASSED,  288 SKIPPED,0
FAILED.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xb6f000
[0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
+0x1f) [0x2a95aa7c1f]
*** End of error message ***

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xe27000
[0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
+0x1f) [0x2a95aa7c1f]
*** End of error message ***
3 additional processes aborted (not shown)


Michael

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-04 Thread George Bosilca
That's just amazing. We pass all the trapezoidal tests but we fail  
the general ones (rectangular matrix) if the leading dimension of the  
matrix on the destination processor is greater than the leading  
dimension on the sender. At least now I narrow down the place where  
the error occur ...


  george.

On Oct 4, 2006, at 4:41 PM, George Bosilca wrote:


OK, that was my 5 minutes hall of shame. Setting the verbosity level
in bt.dat to 6 give me enough information to know exactly the data-
type share. Now, I know how to fix things ...

   george.

On Oct 4, 2006, at 4:35 PM, George Bosilca wrote:


I'm working on this bug. As far as I see the patch from the bug 365
do not help us here. However, on my 64 bits machines (not Opteron but
G5) I don't get the segfault. Anyway, I get the bad data transmission
for test #1 and #51. So far my main problem is that I cannot
reproduce these errors with any other data-type tests [and believe me
we have a bunch of them]. The only one who fails is the BLACS. I
wonder what the data-type looks like for the failing tests. Someone
here knows how to extract the BLACS data-type (for test #1 and #51) ?
Or how to force BLACS to print the data-type information for each
test  (M, N and so on) ?

   Thanks,
 george.

On Oct 4, 2006, at 4:13 PM, Michael Kluskens wrote:


On Oct 4, 2006, at 8:22 AM, Harald Forbert wrote:

The TRANSCOMM setting that we are using here and that I think is  
the

correct one is "-DUseMpi2" since OpenMPI implements the
corresponding
mpi2 calls. You need a recent version of BLACS for this setting
to be available (1.1 with patch 3 should be fine). Together with  
the

patch to openmpi1.1.1 from ticket 356 we are passing the blacs
tester
for 4 processors. I didn't have to time to test with other numbers
though.


Unfortunately this did not solve the problems I'm seeing, could be
that my system is 64 bits (another person seeing problems on an
Opteron system).

New tests of BLACS 1.1p3 vs. OpenMPI (1.1.1, 1.1.2rc1, 1.3a1r11962)
with Intel ifort 9.0.32 and g95 (Sep 27 2006).

System: Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5, all tests  
with

4 processors

1) patched OpenMPI 1.1.1 and 1.1.2rc1 using the two lines from  
Ticket

356.
2) set TRANSCOMM = -DUseMpi2

Intel ifort 9.0.32 tests (INTFACE=-DAdd):

OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
   In the xCbtest both generated errors until Integer Sum tests then
no more errors)

OpenMPI 1.3a1r11962: no errors until crash:

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xe62000
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
(opal_backtrace_print+0x1f) [0x2a95aa8c1f]
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xbc
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
(opal_backtrace_print+0x1f) [0x2a95aa8c1f]
*** End of error message ***

g95 (Sep 27 2006) tests (INTFACE=-Df77IsF2C):

OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
   In the xCbtest both generated errors until Integer Sum tests then
no more errors)

OpenMPI 1.3a1r11962:  no errors until crash:

COMPLEX SUM TESTS: BEGIN.
COMPLEX SUM TESTS: 1152 TESTS;  864 PASSED,  288 SKIPPED,0
FAILED.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xb6f000
[0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
+0x1f) [0x2a95aa7c1f]
*** End of error message ***

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xe27000
[0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
+0x1f) [0x2a95aa7c1f]
*** End of error message ***
3 additional processes aborted (not shown)


Michael

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-04 Thread George Bosilca
OK, that was my 5 minutes hall of shame. Setting the verbosity level  
in bt.dat to 6 give me enough information to know exactly the data- 
type share. Now, I know how to fix things ...


  george.

On Oct 4, 2006, at 4:35 PM, George Bosilca wrote:


I'm working on this bug. As far as I see the patch from the bug 365
do not help us here. However, on my 64 bits machines (not Opteron but
G5) I don't get the segfault. Anyway, I get the bad data transmission
for test #1 and #51. So far my main problem is that I cannot
reproduce these errors with any other data-type tests [and believe me
we have a bunch of them]. The only one who fails is the BLACS. I
wonder what the data-type looks like for the failing tests. Someone
here knows how to extract the BLACS data-type (for test #1 and #51) ?
Or how to force BLACS to print the data-type information for each
test  (M, N and so on) ?

   Thanks,
 george.

On Oct 4, 2006, at 4:13 PM, Michael Kluskens wrote:


On Oct 4, 2006, at 8:22 AM, Harald Forbert wrote:


The TRANSCOMM setting that we are using here and that I think is the
correct one is "-DUseMpi2" since OpenMPI implements the  
corresponding

mpi2 calls. You need a recent version of BLACS for this setting
to be available (1.1 with patch 3 should be fine). Together with the
patch to openmpi1.1.1 from ticket 356 we are passing the blacs  
tester

for 4 processors. I didn't have to time to test with other numbers
though.


Unfortunately this did not solve the problems I'm seeing, could be
that my system is 64 bits (another person seeing problems on an
Opteron system).

New tests of BLACS 1.1p3 vs. OpenMPI (1.1.1, 1.1.2rc1, 1.3a1r11962)
with Intel ifort 9.0.32 and g95 (Sep 27 2006).

System: Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5, all tests with
4 processors

1) patched OpenMPI 1.1.1 and 1.1.2rc1 using the two lines from Ticket
356.
2) set TRANSCOMM = -DUseMpi2

Intel ifort 9.0.32 tests (INTFACE=-DAdd):

OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
   In the xCbtest both generated errors until Integer Sum tests then
no more errors)

OpenMPI 1.3a1r11962: no errors until crash:

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xe62000
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
(opal_backtrace_print+0x1f) [0x2a95aa8c1f]
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xbc
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
(opal_backtrace_print+0x1f) [0x2a95aa8c1f]
*** End of error message ***

g95 (Sep 27 2006) tests (INTFACE=-Df77IsF2C):

OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
   In the xCbtest both generated errors until Integer Sum tests then
no more errors)

OpenMPI 1.3a1r11962:  no errors until crash:

COMPLEX SUM TESTS: BEGIN.
COMPLEX SUM TESTS: 1152 TESTS;  864 PASSED,  288 SKIPPED,0  
FAILED.

Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xb6f000
[0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
+0x1f) [0x2a95aa7c1f]
*** End of error message ***

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xe27000
[0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
+0x1f) [0x2a95aa7c1f]
*** End of error message ***
3 additional processes aborted (not shown)


Michael

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-04 Thread George Bosilca
I'm working on this bug. As far as I see the patch from the bug 365  
do not help us here. However, on my 64 bits machines (not Opteron but  
G5) I don't get the segfault. Anyway, I get the bad data transmission  
for test #1 and #51. So far my main problem is that I cannot  
reproduce these errors with any other data-type tests [and believe me  
we have a bunch of them]. The only one who fails is the BLACS. I  
wonder what the data-type looks like for the failing tests. Someone  
here knows how to extract the BLACS data-type (for test #1 and #51) ?  
Or how to force BLACS to print the data-type information for each  
test  (M, N and so on) ?


  Thanks,
george.

On Oct 4, 2006, at 4:13 PM, Michael Kluskens wrote:


On Oct 4, 2006, at 8:22 AM, Harald Forbert wrote:


The TRANSCOMM setting that we are using here and that I think is the
correct one is "-DUseMpi2" since OpenMPI implements the corresponding
mpi2 calls. You need a recent version of BLACS for this setting
to be available (1.1 with patch 3 should be fine). Together with the
patch to openmpi1.1.1 from ticket 356 we are passing the blacs tester
for 4 processors. I didn't have to time to test with other numbers
though.


Unfortunately this did not solve the problems I'm seeing, could be
that my system is 64 bits (another person seeing problems on an
Opteron system).

New tests of BLACS 1.1p3 vs. OpenMPI (1.1.1, 1.1.2rc1, 1.3a1r11962)
with Intel ifort 9.0.32 and g95 (Sep 27 2006).

System: Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5, all tests with
4 processors

1) patched OpenMPI 1.1.1 and 1.1.2rc1 using the two lines from Ticket
356.
2) set TRANSCOMM = -DUseMpi2

Intel ifort 9.0.32 tests (INTFACE=-DAdd):

OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
   In the xCbtest both generated errors until Integer Sum tests then
no more errors)

OpenMPI 1.3a1r11962: no errors until crash:

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xe62000
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
(opal_backtrace_print+0x1f) [0x2a95aa8c1f]
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xbc
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
(opal_backtrace_print+0x1f) [0x2a95aa8c1f]
*** End of error message ***

g95 (Sep 27 2006) tests (INTFACE=-Df77IsF2C):

OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
   In the xCbtest both generated errors until Integer Sum tests then
no more errors)

OpenMPI 1.3a1r11962:  no errors until crash:

COMPLEX SUM TESTS: BEGIN.
COMPLEX SUM TESTS: 1152 TESTS;  864 PASSED,  288 SKIPPED,0 FAILED.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xb6f000
[0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
+0x1f) [0x2a95aa7c1f]
*** End of error message ***

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xe27000
[0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
+0x1f) [0x2a95aa7c1f]
*** End of error message ***
3 additional processes aborted (not shown)


Michael

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-04 Thread Michael Kluskens

On Oct 4, 2006, at 8:22 AM, Harald Forbert wrote:


The TRANSCOMM setting that we are using here and that I think is the
correct one is "-DUseMpi2" since OpenMPI implements the corresponding
mpi2 calls. You need a recent version of BLACS for this setting
to be available (1.1 with patch 3 should be fine). Together with the
patch to openmpi1.1.1 from ticket 356 we are passing the blacs tester
for 4 processors. I didn't have to time to test with other numbers
though.


Unfortunately this did not solve the problems I'm seeing, could be  
that my system is 64 bits (another person seeing problems on an  
Opteron system).


New tests of BLACS 1.1p3 vs. OpenMPI (1.1.1, 1.1.2rc1, 1.3a1r11962)  
with Intel ifort 9.0.32 and g95 (Sep 27 2006).


System: Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5, all tests with  
4 processors


1) patched OpenMPI 1.1.1 and 1.1.2rc1 using the two lines from Ticket  
356.

2) set TRANSCOMM = -DUseMpi2

Intel ifort 9.0.32 tests (INTFACE=-DAdd):

OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
  In the xCbtest both generated errors until Integer Sum tests then  
no more errors)


OpenMPI 1.3a1r11962: no errors until crash:

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xe62000
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0 
(opal_backtrace_print+0x1f) [0x2a95aa8c1f]

*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xbc
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0 
(opal_backtrace_print+0x1f) [0x2a95aa8c1f]

*** End of error message ***

g95 (Sep 27 2006) tests (INTFACE=-Df77IsF2C):

OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
  In the xCbtest both generated errors until Integer Sum tests then  
no more errors)


OpenMPI 1.3a1r11962:  no errors until crash:

COMPLEX SUM TESTS: BEGIN.
COMPLEX SUM TESTS: 1152 TESTS;  864 PASSED,  288 SKIPPED,0 FAILED.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xb6f000
[0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print 
+0x1f) [0x2a95aa7c1f]

*** End of error message ***

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xe27000
[0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print 
+0x1f) [0x2a95aa7c1f]

*** End of error message ***
3 additional processes aborted (not shown)


Michael



Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-04 Thread Harald Forbert

> Additional note on the the BLACS vs. OpenMPI 1.1.1 & 1.3 problems:
> 
> The BLACS install program xtc_CsameF77  says to not use -DCsameF77  
> with OpenMPI; however, because of an oversight I used it in my first  
> tests -- for OpenMPI 1.1.1 the errors are the same with and without  
> this setting; however, without it the tester program is very slow  
> with OpenMPI 1.1.1 or hangs at "RUNNING REPEATABLE SUM TEST" near the  
> end.   OpenMPI 1.1.2rc1 behaved nearly identically.
> 
> With regards to OpenMPI 1.3, not using -DCsameF77 (that is setting  
> TRANSCOMM blank), prevents the crash I observed earlier; however,  
> massive errors begin at the "DOUBLE COMPLEX AMX" tests and then the  
> auxiliary tests at the end are very slow or hangs at "RUNNING  
> REPEATABLE SUM TEST".

The TRANSCOMM setting that we are using here and that I think is the
correct one is "-DUseMpi2" since OpenMPI implements the corresponding
mpi2 calls. You need a recent version of BLACS for this setting
to be available (1.1 with patch 3 should be fine). Together with the
patch to openmpi1.1.1 from ticket 356 we are passing the blacs tester
for 4 processors. I didn't have to time to test with other numbers
though.

Harald Forbert


Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-03 Thread Michael Kluskens

Additional note on the the BLACS vs. OpenMPI 1.1.1 & 1.3 problems:

The BLACS install program xtc_CsameF77  says to not use -DCsameF77  
with OpenMPI; however, because of an oversight I used it in my first  
tests -- for OpenMPI 1.1.1 the errors are the same with and without  
this setting; however, without it the tester program is very slow  
with OpenMPI 1.1.1 or hangs at "RUNNING REPEATABLE SUM TEST" near the  
end.   OpenMPI 1.1.2rc1 behaved nearly identically.


With regards to OpenMPI 1.3, not using -DCsameF77 (that is setting  
TRANSCOMM blank), prevents the crash I observed earlier; however,  
massive errors begin at the "DOUBLE COMPLEX AMX" tests and then the  
auxiliary tests at the end are very slow or hangs at "RUNNING  
REPEATABLE SUM TEST".


I don't know enough about the internals of OpenMPI to understand the  
following discussion or to understand if the install program  
xtc_CsameF77 works correctly with OpenMPI:


#  If you know that your MPI uses the same handles for fortran and C
#  communicators, you can replace the empty macro definition below with
#  the macro definition on the following line.
  TRANSCOMM = -DCSameF77


The complete details are below:

# If you know something about your system, you may make it easier for  
the

#  BLACS to translate between C and fortran communicators.  If the empty
#  macro defininition is left alone, this translation will cause the C
#  BLACS to globally block for MPI_COMM_WORLD on calls to BLACS_GRIDINIT
#  and BLACS_GRIDMAP.  If you choose one of the options for translating
#  the context, neither the C or fortran calls will globally block.
#  If you are using MPICH, or a derivitive system, you can replace the
#  empty macro definition below with the following (note that if you let
#  MPICH do the translation between C and fortran, you must also  
indicate
#  here if your system has pointers that are longer than integers.   
If so,

#  define -DPOINTER_64_BITS=1.)  For help on setting TRANSCOMM, you can
#  run BLACS/INSTALL/xtc_CsameF77 and BLACS/INSTALL/xtc_UseMpich as
#  explained in BLACS/INSTALL/README.
#   TRANSCOMM = -DUseMpich
#
#  If you know that your MPI uses the same handles for fortran and C
#  communicators, you can replace the empty macro definition below with
#  the macro definition on the following line.
  TRANSCOMM = -DCSameF77
#   
---

#  TRANSCOMM =

Michael

ps. I have successfully tested MPICH2 1.0.4p1 with BLACS 1.1p3 on the  
same machine with same compilers.



On Oct 3, 2006, at 12:14 PM, Jeff Squyres wrote:


Thanks Michael -- I've updated ticket 356 with this info for v1.1, and
created ticket 464 for the trunk (v1.3) issue.

https://svn.open-mpi.org/trac/ompi/ticket/356
https://svn.open-mpi.org/trac/ompi/ticket/464

On 10/3/06 10:53 AM, "Michael Kluskens"  wrote:


Summary:

OpenMPI 1.1.1 and 1.3a1r11943 have different bugs with regards to
BLACS 1.1p3.

1.3 fails where 1.1.1 passes and vice-versus.

(1.1.1): Integer, real, double precision SDRV tests fail cases 1 &
51, then lots of errors until Integer SUM test then all tests pass.

(1.3): No errors until it crashes on the Complex AMX test (which is
after the Integer Sum test).

System configuration: Debian 3.1r3 on dual opteron, gcc 3.3.5, Intel
ifort 9.1.032.

On Oct 3, 2006, at 2:44 AM, Åke Sandgren wrote:


On Mon, 2006-10-02 at 18:39 -0400, Michael Kluskens wrote:

OpenMPI, BLACS, and blacstester built just fine.  Tester reports
errors for integer and real cases #1 and #51 and more for the other
types..

 is an open ticket
related to this.


Finally someone else with the same problem!!!

I tried the suggested fix from ticket 356 but it didn't help.
I still get lots of errors in the blacstest.

I'm running on a dual-cpu opteron with Ubuntu dapper and gcc-4.0.
The tests also failed on our i386 Ubuntu breezy system with gcc-3.4


More details of my two tests:

OpenMPI 1.1.1
./configure --prefix=/opt/intel9.1/openmpi/1.1.1 F77=ifort  
FC=ifort --

with-mpi-f90-size=medium

BLACS 1.1 patch 3, Bmake.inc based on Bmake.MPI-LINUX with following
changes:

BTOPdir = /opt/intel9.1/openmpi/1.1.1/BLACS
BLACSDBGLVL = 1
MPIdir = /opt/intel9.1/openmpi/1.1.1
MPILIB =
INTFACE = -DAdd_
F77= $(MPIdir)/bin/mpif77
CC = $(MPIdir)/bin/mpicc
CCFLAGS= -O3


OpenMPI 1.3a1r11943
./configure --prefix=/opt/intel9.1/openmpi/1.3 F77=ifort FC=ifort --
with-mpi-f90-size=medium

similar changes for Bmake.inc in BLACS.

test launched in BLACS/TESTING/EXE using:

mpirun --prefix /opt/intel9.1/openmpi/1.3 -np 4 xCbtest_MPI-LINUX-1

No errors works much better but eventually failures with:

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:128()
Failing at addr:(nil)
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xb8
[0] 

Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-03 Thread Jeff Squyres
Thanks Michael -- I've updated ticket 356 with this info for v1.1, and
created ticket 464 for the trunk (v1.3) issue.

https://svn.open-mpi.org/trac/ompi/ticket/356
https://svn.open-mpi.org/trac/ompi/ticket/464

On 10/3/06 10:53 AM, "Michael Kluskens"  wrote:

> Summary:
> 
> OpenMPI 1.1.1 and 1.3a1r11943 have different bugs with regards to
> BLACS 1.1p3.
> 
> 1.3 fails where 1.1.1 passes and vice-versus.
> 
> (1.1.1): Integer, real, double precision SDRV tests fail cases 1 &
> 51, then lots of errors until Integer SUM test then all tests pass.
> 
> (1.3): No errors until it crashes on the Complex AMX test (which is
> after the Integer Sum test).
> 
> System configuration: Debian 3.1r3 on dual opteron, gcc 3.3.5, Intel
> ifort 9.1.032.
> 
> On Oct 3, 2006, at 2:44 AM, Åke Sandgren wrote:
> 
>> On Mon, 2006-10-02 at 18:39 -0400, Michael Kluskens wrote:
>>> OpenMPI, BLACS, and blacstester built just fine.  Tester reports
>>> errors for integer and real cases #1 and #51 and more for the other
>>> types..
>>> 
>>>  is an open ticket
>>> related to this.
>> 
>> Finally someone else with the same problem!!!
>> 
>> I tried the suggested fix from ticket 356 but it didn't help.
>> I still get lots of errors in the blacstest.
>> 
>> I'm running on a dual-cpu opteron with Ubuntu dapper and gcc-4.0.
>> The tests also failed on our i386 Ubuntu breezy system with gcc-3.4
> 
> More details of my two tests:
> 
> OpenMPI 1.1.1
> ./configure --prefix=/opt/intel9.1/openmpi/1.1.1 F77=ifort FC=ifort --
> with-mpi-f90-size=medium
> 
> BLACS 1.1 patch 3, Bmake.inc based on Bmake.MPI-LINUX with following
> changes:
> 
> BTOPdir = /opt/intel9.1/openmpi/1.1.1/BLACS
> BLACSDBGLVL = 1
> MPIdir = /opt/intel9.1/openmpi/1.1.1
> MPILIB =
> INTFACE = -DAdd_
> F77= $(MPIdir)/bin/mpif77
> CC = $(MPIdir)/bin/mpicc
> CCFLAGS= -O3
> 
> 
> OpenMPI 1.3a1r11943
> ./configure --prefix=/opt/intel9.1/openmpi/1.3 F77=ifort FC=ifort --
> with-mpi-f90-size=medium
> 
> similar changes for Bmake.inc in BLACS.
> 
> test launched in BLACS/TESTING/EXE using:
> 
> mpirun --prefix /opt/intel9.1/openmpi/1.3 -np 4 xCbtest_MPI-LINUX-1
> 
> No errors works much better but eventually failures with:
> 
> COMPLEX AMX TESTS: BEGIN.
> Signal:11 info.si_errno:0(Success) si_code:128()
> Failing at addr:(nil)
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0xb8
> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
> (opal_backtrace_print+0x1f) [0x2a95aa5c1f]
> *** End of error message ***
> 
> Michael
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-03 Thread Michael Kluskens

Summary:

OpenMPI 1.1.1 and 1.3a1r11943 have different bugs with regards to  
BLACS 1.1p3.


1.3 fails where 1.1.1 passes and vice-versus.

(1.1.1): Integer, real, double precision SDRV tests fail cases 1 &  
51, then lots of errors until Integer SUM test then all tests pass.


(1.3): No errors until it crashes on the Complex AMX test (which is  
after the Integer Sum test).


System configuration: Debian 3.1r3 on dual opteron, gcc 3.3.5, Intel  
ifort 9.1.032.


On Oct 3, 2006, at 2:44 AM, Åke Sandgren wrote:


On Mon, 2006-10-02 at 18:39 -0400, Michael Kluskens wrote:

OpenMPI, BLACS, and blacstester built just fine.  Tester reports
errors for integer and real cases #1 and #51 and more for the other
types..

 is an open ticket
related to this.


Finally someone else with the same problem!!!

I tried the suggested fix from ticket 356 but it didn't help.
I still get lots of errors in the blacstest.

I'm running on a dual-cpu opteron with Ubuntu dapper and gcc-4.0.
The tests also failed on our i386 Ubuntu breezy system with gcc-3.4


More details of my two tests:

OpenMPI 1.1.1
./configure --prefix=/opt/intel9.1/openmpi/1.1.1 F77=ifort FC=ifort -- 
with-mpi-f90-size=medium


BLACS 1.1 patch 3, Bmake.inc based on Bmake.MPI-LINUX with following  
changes:


BTOPdir = /opt/intel9.1/openmpi/1.1.1/BLACS
BLACSDBGLVL = 1
MPIdir = /opt/intel9.1/openmpi/1.1.1
MPILIB =
INTFACE = -DAdd_
F77= $(MPIdir)/bin/mpif77
CC = $(MPIdir)/bin/mpicc
CCFLAGS= -O3


OpenMPI 1.3a1r11943
./configure --prefix=/opt/intel9.1/openmpi/1.3 F77=ifort FC=ifort -- 
with-mpi-f90-size=medium


similar changes for Bmake.inc in BLACS.

test launched in BLACS/TESTING/EXE using:

mpirun --prefix /opt/intel9.1/openmpi/1.3 -np 4 xCbtest_MPI-LINUX-1

No errors works much better but eventually failures with:

COMPLEX AMX TESTS: BEGIN.
Signal:11 info.si_errno:0(Success) si_code:128()
Failing at addr:(nil)
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0xb8
[0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0 
(opal_backtrace_print+0x1f) [0x2a95aa5c1f]

*** End of error message ***

Michael