Re: [OMPI users] possible bug exercised by mpi4py

2012-05-25 Thread TERRY DONTJE
BTW, the changes prior to r26496 failed some of the MTT test runs on 
several systems.  So if the current implementation is deemed not 
"correct" I suspect we will need to figure out if there are changes to 
the tests that need to be done.


See http://www.open-mpi.org/mtt/index.php?do_redir=2066 for some of the 
failures I think are due to r26495 reduce scatter changes.


--td

On 5/25/2012 12:27 AM, George Bosilca wrote:

On May 24, 2012, at 23:48 , Dave Goodell wrote:


On May 24, 2012, at 10:34 PM CDT, George Bosilca wrote:


On May 24, 2012, at 23:18, Dave Goodell  wrote:


So I take back my prior "right".  Upon further inspection of the text and the 
MPICH2 code I believe it to be true that the number of the elements in the recvcounts 
array must be equal to the size of the LOCAL group.

This is quite illogical, but it will not be the first time the standard is 
lacking some. So, if I understand you correctly, in the case of an 
intercommunicator a process doesn't know how much data it has to reduce, at 
least not until it receives the array of recvcounts from the remote group. 
Weird!

No, it knows because of the restriction that $sum_i^n{recvcounts[i]}$ yields 
the same sum in each group.

I should have read the entire paragraph of the standard … including the 
rationale. Indeed, the rationale describes exactly what you mentioned.

Apparently the figure 12 on the following [MPI Forum blessed] link is supposed 
to clarify any potential misunderstanding regarding the reduce_scatter. Count 
how many elements are on each side of the intercommunicator ;)

   george.


The way it's implemented in MPICH2, and the way that makes this make a lot more 
sense to me, is that you first do intercommunicator reductions to temporary 
buffers on rank 0 in each group.  Then rank 0 scatters within the local group.  
The way I had been thinking about it was to do a local reduction followed by an 
intercomm scatter, but that isn't what the standard is saying, AFAICS.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





Re: [OMPI users] possible bug exercised by mpi4py

2012-05-25 Thread George Bosilca

On May 24, 2012, at 23:48 , Dave Goodell wrote:

> On May 24, 2012, at 10:34 PM CDT, George Bosilca wrote:
> 
>> On May 24, 2012, at 23:18, Dave Goodell  wrote:
>> 
>>> So I take back my prior "right".  Upon further inspection of the text and 
>>> the MPICH2 code I believe it to be true that the number of the elements in 
>>> the recvcounts array must be equal to the size of the LOCAL group.
>> 
>> This is quite illogical, but it will not be the first time the standard is 
>> lacking some. So, if I understand you correctly, in the case of an 
>> intercommunicator a process doesn't know how much data it has to reduce, at 
>> least not until it receives the array of recvcounts from the remote group. 
>> Weird!
> 
> No, it knows because of the restriction that $sum_i^n{recvcounts[i]}$ yields 
> the same sum in each group.

I should have read the entire paragraph of the standard … including the 
rationale. Indeed, the rationale describes exactly what you mentioned.

Apparently the figure 12 on the following [MPI Forum blessed] link is supposed 
to clarify any potential misunderstanding regarding the reduce_scatter. Count 
how many elements are on each side of the intercommunicator ;)

  george.

> The way it's implemented in MPICH2, and the way that makes this make a lot 
> more sense to me, is that you first do intercommunicator reductions to 
> temporary buffers on rank 0 in each group.  Then rank 0 scatters within the 
> local group.  The way I had been thinking about it was to do a local 
> reduction followed by an intercomm scatter, but that isn't what the standard 
> is saying, AFAICS.




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-25 Thread Dave Goodell
On May 24, 2012, at 10:34 PM CDT, George Bosilca wrote:

> On May 24, 2012, at 23:18, Dave Goodell  wrote:
> 
>> So I take back my prior "right".  Upon further inspection of the text and 
>> the MPICH2 code I believe it to be true that the number of the elements in 
>> the recvcounts array must be equal to the size of the LOCAL group.
> 
> This is quite illogical, but it will not be the first time the standard is 
> lacking some. So, if I understand you correctly, in the case of an 
> intercommunicator a process doesn't know how much data it has to reduce, at 
> least not until it receives the array of recvcounts from the remote group. 
> Weird!

No, it knows because of the restriction that $sum_i^n{recvcounts[i]}$ yields 
the same sum in each group.

The way it's implemented in MPICH2, and the way that makes this make a lot more 
sense to me, is that you first do intercommunicator reductions to temporary 
buffers on rank 0 in each group.  Then rank 0 scatters within the local group.  
The way I had been thinking about it was to do a local reduction followed by an 
intercomm scatter, but that isn't what the standard is saying, AFAICS.

-Dave




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-25 Thread George Bosilca
On May 24, 2012, at 23:18, Dave Goodell  wrote:

> On May 24, 2012, at 8:13 PM CDT, Jeff Squyres wrote:
> 
>> On May 24, 2012, at 11:57 AM, Lisandro Dalcin wrote:
>> 
>>> The standard says this:
>>> 
>>> "Within each group, all processes provide the same recvcounts
>>> argument, and provide input vectors of  sum_i^n recvcounts[i] elements
>>> stored in the send buffers, where n is the size of the group"
>>> 
>>> So, I read " Within each group, ... where n is the size of the group"
>>> as being the LOCAL group size.
>> 
>> Actually, that seems like a direct contradiction with the prior sentence: 
>> 
>> If comm is an intercommunicator, then the result of the reduction of the 
>> data provided by processes in one group (group A) is scattered among 
>> processes in the other group (group B), and vice versa.
>> 
>> It looks like the implementors of 2 implementations agree that recvcounts 
>> should be the size of the remote group.  Sounds like this needs to be 
>> brought up in front of the Forum...
> 
> So I take back my prior "right".  Upon further inspection of the text and the 
> MPICH2 code I believe it to be true that the number of the elements in the 
> recvcounts array must be equal to the size of the LOCAL group.

This is quite illogical, but it will not be the first time the standard is 
lacking some. So, if I understand you correctly, in the case of an 
intercommunicator a process doesn't know how much data it has to reduce, at 
least not until it receives the array of recvcounts from the remote group. 
Weird!

It makes much more sense to read it the other way. That will remove the need 
for an extra communication, as every rank knows from the beginning everything, 
what they will have to scatter to the remote group, as well as [based on the 
remote recvcounts] what they have to reduce in the local group.

  George.

> The text certainly could use a bit of clarification.  I'll bring it up at the 
> meeting next week.
> 
> -Dave
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] possible bug exercised by mpi4py

2012-05-25 Thread Dave Goodell
On May 24, 2012, at 8:13 PM CDT, Jeff Squyres wrote:

> On May 24, 2012, at 11:57 AM, Lisandro Dalcin wrote:
> 
>> The standard says this:
>> 
>> "Within each group, all processes provide the same recvcounts
>> argument, and provide input vectors of  sum_i^n recvcounts[i] elements
>> stored in the send buffers, where n is the size of the group"
>> 
>> So, I read " Within each group, ... where n is the size of the group"
>> as being the LOCAL group size.
> 
> Actually, that seems like a direct contradiction with the prior sentence: 
> 
> If comm is an intercommunicator, then the result of the reduction of the data 
> provided by processes in one group (group A) is scattered among processes in 
> the other group (group B), and vice versa.
> 
> It looks like the implementors of 2 implementations agree that recvcounts 
> should be the size of the remote group.  Sounds like this needs to be brought 
> up in front of the Forum...

So I take back my prior "right".  Upon further inspection of the text and the 
MPICH2 code I believe it to be true that the number of the elements in the 
recvcounts array must be equal to the size of the LOCAL group.

The text certainly could use a bit of clarification.  I'll bring it up at the 
meeting next week.

-Dave




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-25 Thread Dave Goodell
On May 24, 2012, at 10:57 AM CDT, Lisandro Dalcin wrote:

> On 24 May 2012 12:40, George Bosilca  wrote:
> 
>> I don't see much difference with the other collective. The generic behavior 
>> is that you apply the operation on the local group but the result is moved 
>> into the remote group.
> 
> Well, for me this one DO IS different (for example, SCATTER is
> unidirectional for intercomunicators, but REDUCE_SCATTER is
> bidirectional). The "recvbuff" is a local buffer, but you understand
> "recvcounts" as remote.
> 
> Mmm, the standard is really confusing in this point...

Don't think of it like an intercommunicator-scatter, think of it more like an 
intercommunicator-allreduce.  The allreduce is also bidirectional.  The only 
difference is that instead of an allreduce (logically reduce+bcast), you 
instead have a reduce_scatter (logically reduce+scatterv).

-Dave




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Jeff Squyres
On May 24, 2012, at 11:57 AM, Lisandro Dalcin wrote:

> The standard says this:
> 
> "Within each group, all processes provide the same recvcounts
> argument, and provide input vectors of  sum_i^n recvcounts[i] elements
> stored in the send buffers, where n is the size of the group"
> 
> So, I read " Within each group, ... where n is the size of the group"
> as being the LOCAL group size.

Actually, that seems like a direct contradiction with the prior sentence: 

If comm is an intercommunicator, then the result of the reduction of the data 
provided by processes in one group (group A) is scattered among processes in 
the other group (group B), and vice versa.

It looks like the implementors of 2 implementations agree that recvcounts 
should be the size of the remote group.  Sounds like this needs to be brought 
up in front of the Forum...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Lisandro Dalcin
On 24 May 2012 12:40, George Bosilca  wrote:
> On May 24, 2012, at 11:22 , Jeff Squyres wrote:
>
>> On May 24, 2012, at 11:10 AM, Lisandro Dalcin wrote:
>>
 So I checked them all, and I found SCATTERV, GATHERV, and REDUCE_SCATTER 
 all had the issue.  Now fixed on the trunk, and will be in 1.6.1.
>>>
>>> Please be careful with REDUCE_SCATTER[_BLOCK] . My understanding of
>>> the MPI standard is that the the length of the recvcounts array is the
>>> local group size
>>> (http://www.mpi-forum.org/docs/mpi22-report/node113.htm#Node113)
>>
>>
>> I read that this morning and it made my head hurt.
>>
>> I read it to be: reduce the data in the local group, scatter the results to 
>> the remote group.
>>
>> As such, the reduce COUNT is sum(recvcounts), and is used for the reduction 
>> in the local group.  Then use recvcounts to scatter it to the remote group.
>>
>> …right?
>
> Right, you reduce locally but you scatter remotely. As such the size of the 
> recvcounts buffer is the remote size. As in the local group you do a reduce 
> (where every process participate with the same amount of data) you only need 
> a total count which in this case is the sum of all recvcounts. This 
> requirement is enforced by the fact that the input buffer is of size sum of 
> all recvcounts, which make sense only if you know the remote group receives 
> counts.

The standard says this:

"Within each group, all processes provide the same recvcounts
argument, and provide input vectors of  sum_i^n recvcounts[i] elements
stored in the send buffers, where n is the size of the group"

So, I read " Within each group, ... where n is the size of the group"
as being the LOCAL group size.

>
> I don't see much difference with the other collective. The generic behavior 
> is that you apply the operation on the local group but the result is moved 
> into the remote group.
>

Well, for me this one DO IS different (for example, SCATTER is
unidirectional for intercomunicators, but REDUCE_SCATTER is
bidirectional). The "recvbuff" is a local buffer, but you understand
"recvcounts" as remote.

Mmm, the standard is really confusing in this point...

-- 
Lisandro Dalcin
---
CIMEC (INTEC/CONICET-UNL)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1011)
Tel/Fax: +54-342-4511169



Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread George Bosilca
On May 24, 2012, at 11:22 , Jeff Squyres wrote:

> On May 24, 2012, at 11:10 AM, Lisandro Dalcin wrote:
> 
>>> So I checked them all, and I found SCATTERV, GATHERV, and REDUCE_SCATTER 
>>> all had the issue.  Now fixed on the trunk, and will be in 1.6.1.
>> 
>> Please be careful with REDUCE_SCATTER[_BLOCK] . My understanding of
>> the MPI standard is that the the length of the recvcounts array is the
>> local group size
>> (http://www.mpi-forum.org/docs/mpi22-report/node113.htm#Node113)
> 
> 
> I read that this morning and it made my head hurt.
> 
> I read it to be: reduce the data in the local group, scatter the results to 
> the remote group.
> 
> As such, the reduce COUNT is sum(recvcounts), and is used for the reduction 
> in the local group.  Then use recvcounts to scatter it to the remote group.
> 
> …right?

Right, you reduce locally but you scatter remotely. As such the size of the 
recvcounts buffer is the remote size. As in the local group you do a reduce 
(where every process participate with the same amount of data) you only need a 
total count which in this case is the sum of all recvcounts. This requirement 
is enforced by the fact that the input buffer is of size sum of all recvcounts, 
which make sense only if you know the remote group receives counts.

I don't see much difference with the other collective. The generic behavior is 
that you apply the operation on the local group but the result is moved into 
the remote group.

  george.



> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Dave Goodell
On May 24, 2012, at 10:22 AM CDT, Jeff Squyres wrote:

> I read it to be: reduce the data in the local group, scatter the results to 
> the remote group.
> 
> As such, the reduce COUNT is sum(recvcounts), and is used for the reduction 
> in the local group.  Then use recvcounts to scatter it to the remote group.
> 
> ...right?
> 

right.

-Dave




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Jeff Squyres
On May 24, 2012, at 11:10 AM, Lisandro Dalcin wrote:

>> So I checked them all, and I found SCATTERV, GATHERV, and REDUCE_SCATTER all 
>> had the issue.  Now fixed on the trunk, and will be in 1.6.1.
> 
> Please be careful with REDUCE_SCATTER[_BLOCK] . My understanding of
> the MPI standard is that the the length of the recvcounts array is the
> local group size
> (http://www.mpi-forum.org/docs/mpi22-report/node113.htm#Node113)


I read that this morning and it made my head hurt.

I read it to be: reduce the data in the local group, scatter the results to the 
remote group.

As such, the reduce COUNT is sum(recvcounts), and is used for the reduction in 
the local group.  Then use recvcounts to scatter it to the remote group.

...right?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Lisandro Dalcin
On 24 May 2012 10:28, Jeff Squyres  wrote:
> On May 24, 2012, at 6:53 AM, Jonathan Dursi wrote:
>
>> It seems like this might also be an issue for gatherv and reduce_scatter as 
>> well.
>
>
> Gah.  I spot-checked a few of these before my first commit, but didn't see 
> these.
>
> So I checked them all, and I found SCATTERV, GATHERV, and REDUCE_SCATTER all 
> had the issue.  Now fixed on the trunk, and will be in 1.6.1.
>

Please be careful with REDUCE_SCATTER[_BLOCK] . My understanding of
the MPI standard is that the the length of the recvcounts array is the
local group size
(http://www.mpi-forum.org/docs/mpi22-report/node113.htm#Node113)

-- 
Lisandro Dalcin
---
CIMEC (INTEC/CONICET-UNL)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1011)
Tel/Fax: +54-342-4511169



Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread George Bosilca
This bug had the opportunity to appear in all collectives supporting 
intercommunicators where we check the receive buffer(s) consistency. In 
addition to what Jeff fixed already, I fix it in ALLTOALLV, ALLTOALLW and 
GATHER.

  george.

On May 24, 2012, at 09:37 , Jeff Squyres wrote:

> On May 24, 2012, at 9:28 AM, Jeff Squyres wrote:
> 
>> So I checked them all, and I found SCATTERV, GATHERV, and REDUCE_SCATTER all 
>> had the issue.  Now fixed on the trunk, and will be in 1.6.1.
> 
> 
> I forgot to mention -- this issue exists waaay back in the Open MPI code 
> base.  I spot-checked Open MPI 1.2.0 and see it there, too.  
> 
> To be clear: this particular bug only shows itself when you invoke 
> ALLGATHERV, GATHERV, SCATTERV, or REDUCE_SCATTER on an intercommunicator 
> where the sizes of the two groups are unequal.  Whether the problem shows 
> itself or not is rather random (i.e., it depends on how "safe" the memory is 
> after the recvcounts array).  FWIW, you can workaround this bug by setting 
> the MCA parameter "mpi_param_check" to 0, which disables all MPI function 
> parameter checking.  That may not be attractive in some cases, of course.
> 
> More specifically: since this problem has been in the OMPI code base for 
> *years* (possibly since 1.0 -- but I'm not going to bother to check), it 
> shows how little real-world applications actually use this specific 
> functionality.  Don't get me wrong -- I'm *very* thankful to the mpi4py 
> community for raising this issue, and I'm glad to get it fixed!  But it does 
> show that there are dark, dusty corners in MPI functionality where few bother 
> to tread.  :-)
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Jeff Squyres
On May 24, 2012, at 9:28 AM, Jeff Squyres wrote:

> So I checked them all, and I found SCATTERV, GATHERV, and REDUCE_SCATTER all 
> had the issue.  Now fixed on the trunk, and will be in 1.6.1.


I forgot to mention -- this issue exists waaay back in the Open MPI code base.  
I spot-checked Open MPI 1.2.0 and see it there, too.  

To be clear: this particular bug only shows itself when you invoke ALLGATHERV, 
GATHERV, SCATTERV, or REDUCE_SCATTER on an intercommunicator where the sizes of 
the two groups are unequal.  Whether the problem shows itself or not is rather 
random (i.e., it depends on how "safe" the memory is after the recvcounts 
array).  FWIW, you can workaround this bug by setting the MCA parameter 
"mpi_param_check" to 0, which disables all MPI function parameter checking.  
That may not be attractive in some cases, of course.

More specifically: since this problem has been in the OMPI code base for 
*years* (possibly since 1.0 -- but I'm not going to bother to check), it shows 
how little real-world applications actually use this specific functionality.  
Don't get me wrong -- I'm *very* thankful to the mpi4py community for raising 
this issue, and I'm glad to get it fixed!  But it does show that there are 
dark, dusty corners in MPI functionality where few bother to tread.  :-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Jeff Squyres
On May 24, 2012, at 6:53 AM, Jonathan Dursi wrote:

> It seems like this might also be an issue for gatherv and reduce_scatter as 
> well.


Gah.  I spot-checked a few of these before my first commit, but didn't see 
these.

So I checked them all, and I found SCATTERV, GATHERV, and REDUCE_SCATTER all 
had the issue.  Now fixed on the trunk, and will be in 1.6.1.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Jonathan Dursi
It seems like this might also be an issue for gatherv and reduce_scatter 
as well.


- Jonathan
--
Jonathan Dursi | SciNet, Compute/Calcul Canada | www.SciNetHPC.ca


Re: [OMPI users] possible bug exercised by mpi4py

2012-05-24 Thread Jeff Squyres
Many thanks for trans-coding to C; this was a major help in debugging the issue.

Thankfully, it turned out to be a simple bug.  OMPI's parameter checking for 
MPI_ALLGATHERV was using the *local* group size when checking the recvcounts 
parameter, where it really should have been using the *remote* group size.  So 
when the local group size > the remote group size, Bad Things could happen.

For this test, the bad case would only happen with odd numbers of processes.  
It probably only happens sometimes because the contents of memory after the 
recvcounts array are undefined -- sometimes they'll be ok, sometimes they won't.

I fixed the issue in https://svn.open-mpi.org/trac/ompi/changeset/26488 and 
filed to move it to 1.6.1 in https://svn.open-mpi.org/trac/ompi/ticket/3105.

Many thanks for reporting the issue!


On May 23, 2012, at 10:30 PM, Jonathan Dursi wrote:

> On 23 May 9:37PM, Jonathan Dursi wrote:
> 
>> On the other hand, it works everywhere if I pad the rcounts array with
>> an extra valid value (0 or 1, or for that matter 783), or replace the
>> allgatherv with an allgather.
> 
> .. and it fails with 7 even where it worked (but succeeds with 8) if I pad 
> rcounts with an extra invalid value which should never be read.
> 
> Should the recvcounts[] parameters test in allgatherv.c loop up to 
> size=ompi_comm_remote_size(comm), as is done in alltoallv.c, rather than 
> ompi_comm_size(comm) ?   That seems to avoid the problem.
> 
>   - Jonathan
> -- 
> Jonathan Dursi | SciNet, Compute/Calcul Canada | www.SciNetHPC.ca
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] possible bug exercised by mpi4py

2012-05-23 Thread Jonathan Dursi

On 23 May 9:37PM, Jonathan Dursi wrote:


On the other hand, it works everywhere if I pad the rcounts array with
an extra valid value (0 or 1, or for that matter 783), or replace the
allgatherv with an allgather.


.. and it fails with 7 even where it worked (but succeeds with 8) if I 
pad rcounts with an extra invalid value which should never be read.


Should the recvcounts[] parameters test in allgatherv.c loop up to 
size=ompi_comm_remote_size(comm), as is done in alltoallv.c, rather than 
ompi_comm_size(comm) ?   That seems to avoid the problem.


   - Jonathan
--
Jonathan Dursi | SciNet, Compute/Calcul Canada | www.SciNetHPC.ca


Re: [OMPI users] possible bug exercised by mpi4py

2012-05-23 Thread Bennet Fauber
In case it is helpful to those who may not have the Intel compilers, these 
are the libraries against which the two executables of Lisandro's 
allgather.c get linked:



with Intel compilers:
=
$ ldd a.out
linux-vdso.so.1 =>  (0x7fffb7dfd000)
libmpi.so.0 => 
/home/software/rhel5/openmpi-1.4.3/intel-11.0/lib/libmpi.so.0 (0x2b460cec2000)
libopen-rte.so.0 => 
/home/software/rhel5/openmpi-1.4.3/intel-11.0/lib/libopen-rte.so.0 
(0x2b460d198000)
libopen-pal.so.0 => 
/home/software/rhel5/openmpi-1.4.3/intel-11.0/lib/libopen-pal.so.0 
(0x2b460d40)
libdl.so.2 => /lib64/libdl.so.2 (0x003f6a20)
libnsl.so.1 => /lib64/libnsl.so.1 (0x003f6fe0)
libutil.so.1 => /lib64/libutil.so.1 (0x003f7460)
libm.so.6 => /lib64/libm.so.6 (0x003f6a60)
libpthread.so.0 => /lib64/libpthread.so.0 (0x003f6ae0)
libc.so.6 => /lib64/libc.so.6 (0x003f69e0)
libimf.so => /usr/caen/intel-11.0/fc/11.0.074/lib/intel64/libimf.so 
(0x2b460d69f000)
libsvml.so => /usr/caen/intel-11.0/fc/11.0.074/lib/intel64/libsvml.so 
(0x2b460d9f6000)
libintlc.so.5 => 
/usr/caen/intel-11.0/fc/11.0.074/lib/intel64/libintlc.so.5 (0x2b460dbb3000)
libgcc_s.so.1 => /home/software/rhel5/gcc/4.6.2/lib64/libgcc_s.so.1 
(0x2b460dcf)
/lib64/ld-linux-x86-64.so.2 (0x003f69a0)
=


with GCC 4.6.2
=
$ ldd a.out
linux-vdso.so.1 =>  (0x7fff93dfd000)
libmpi.so.0 => 
/home/software/rhel5/openmpi-1.4.4/gcc-4.6.2/lib/libmpi.so.0 (0x2ab3ba523000)
libopen-rte.so.0 => 
/home/software/rhel5/openmpi-1.4.4/gcc-4.6.2/lib/libopen-rte.so.0 
(0x2ab3ba7cf000)
libopen-pal.so.0 => 
/home/software/rhel5/openmpi-1.4.4/gcc-4.6.2/lib/libopen-pal.so.0 
(0x2ab3baa1d000)
libdl.so.2 => /lib64/libdl.so.2 (0x003f6a20)
libnsl.so.1 => /lib64/libnsl.so.1 (0x003f6fe0)
libutil.so.1 => /lib64/libutil.so.1 (0x003f7460)
libm.so.6 => /lib64/libm.so.6 (0x003f6a60)
libpthread.so.0 => /lib64/libpthread.so.0 (0x003f6ae0)
libc.so.6 => /lib64/libc.so.6 (0x003f69e0)
/lib64/ld-linux-x86-64.so.2 (0x003f69a0)
=

-- bennet
--
East Hall Technical Services
Mathematics and Psychology Research Computing
University of Michigan
(734) 763-1182


Re: [OMPI users] possible bug exercised by mpi4py

2012-05-23 Thread Jonathan Dursi
Fails for me with 1.4.3 with gcc, but works with intel; works with 1.4.4 
with gcc or intel; fails with 1.5.5 with either.   Succeeds with intelmpi.


On the other hand, it works everywhere if I pad the rcounts array  with 
an extra valid value (0 or 1, or for that matter 783), or replace the 
allgatherv with an allgather.


  - Jonathan
--
Jonathan Dursi | SciNet, Compute/Calcul Canada | www.SciNetHPC.ca


Re: [OMPI users] possible bug exercised by mpi4py

2012-05-23 Thread Bennet Fauber

On Wed, 23 May 2012, Lisandro Dalcin wrote:


On 23 May 2012 19:04, Jeff Squyres  wrote:

Thanks for all the info!

But still, can we get a copy of the test in C?  That would make it 
significantly easier for us to tell if there is a problem with Open MPI -- 
mainly because we don't know anything about the internals of mpi4py.


FYI, this test ran fine with previous (but recent, let say 1.5.4)
OpenMPI versions, but fails with 1.6. The test also runs fine with
MPICH2.


I compiled the C example Lisandro provided using openmpi/1.4.3 compiled 
against the Intel 11.0 compilers, and it ran the first time.  I then 
recompiled using gcc 4.6.2 and openmpi 1.4.4, and it provided the 
following errors:


$ mpirun -np 5 a.out
[hostname:6601] *** An error occurred in MPI_Allgatherv
[hostname:6601] *** on communicator
[hostname:6601] *** MPI_ERR_COUNT: invalid count argument
[hostname:6601] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--
mpirun has exited due to process rank 4 with PID 6601 on
node hostname exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--

I then recompiled using the Intel compilers, and it runs without error 10 
out of 10 times.


I then recompiled using the gcc 4.6.2/openmpi 1.4.4 combination, and it 
fails consistently.


On the second and subsequent tries, it provides the following additional 
errors:


$ mpirun -np 5 a.out
[hostname:7168] *** An error occurred in MPI_Allgatherv
[hostname:7168] *** on communicator
[hostname:7168] *** MPI_ERR_COUNT: invalid count argument
[hostname:7168] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--
mpirun has exited due to process rank 2 with PID 7168 on
node hostname exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
[hostname:07163] 1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
[hostname:07163] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages

Not sure if that information is helpful or not.

I am still completely puzzled why the number 5 is magic

-- bennet

Re: [OMPI users] possible bug exercised by mpi4py

2012-05-23 Thread Jeff Squyres
Thanks for all the info!

But still, can we get a copy of the test in C?  That would make it 
significantly easier for us to tell if there is a problem with Open MPI -- 
mainly because we don't know anything about the internals of mpi4py.


On May 23, 2012, at 5:43 PM, Bennet Fauber wrote:

> Thanks, Ralph,
> 
> On Wed, 23 May 2012, Ralph Castain wrote:
> 
>> I don't honestly think many of us have any knowledge of mpi4py. Does this 
>> test work with other MPIs?
> 
> The mpi4py developers have said they've never seen this using mpich2.  I have 
> not been able to test that myself.
> 
>> MPI_Allgather seems to be passing our tests, so I suspect it is something in 
>> the binding. If you can provide the actual test, I'm willing to take a look 
>> at it.
> 
> The actual test is included in the install bundle for mpi4py, along with the 
> C source code used to create the bindings.
> 
>   http://code.google.com/p/mpi4py/downloads/list
> 
> The install is straightforward and simple.  Unpack the tarball, make sure 
> that mpicc is in your path
> 
>   $ cd mpi4py-1.3
>   $ python setup.py build
>   $ python setup.py install --prefix=/your/install
>   $ export PYTHONPATH=/your/install/lib/pythonN.M/site-packages
>   $ mpirun -np 5 python test/runtests.py \
>--verbose --no-threads --include cco_obj_inter
> 
> where N.M are the major.minor numbers of your python distribution.
> 
> What I find most puzzling is that, maybe, 1 out of 10 times it will run to 
> completion with -np 5, and it runs with all other numbers of processors I've 
> tested always.
> 
>   -- bennet
> 
>> On May 23, 2012, at 2:52 PM, Bennet Fauber wrote:
>> 
>>> I've installed the latest mpi4py-1.3 on several systems, and there is a 
>>> repeated bug when running
>>> 
>>> $ mpirun -np 5 python test/runtests.py
>>> 
>>> where it throws an error on mpigather with openmpi-1.4.4 and hangs with 
>>> openmpi-1.3.
>>> 
>>> It runs to completion and passes all tests when run with -np of 2, 3, 4, 6, 
>>> 7, 8, 9, 10, 11, and 12.
>>> 
>>> There is a thread on this at
>>> 
>>> http://groups.google.com/group/mpi4py/browse_thread/thread/509ac46af6f79973
>>> 
>>> where others report being able to replicate, too.
>>> 
>>> The compiler used first was gcc-4.6.2, with openmpi-1.4.4.
>>> 
>>> These are all Red Hat machines, RHEL 5 or 6 and with multiple compilers and 
>>> versions of openmpi 1.3.0 and 1.4.4.
>>> 
>>> Lisandro who is the primary developer of mpi4py is able to replicate on 
>>> Fedora 16.
>>> 
>>> Someone else is able to reproduce with
>>> 
>>> [ quoting from the groups.google.com page... ]
>>> ===
>>> It also happens with the current hg version of mpi4py and
>>> $ rpm -qa openmpi gcc python
>>> python-2.7.3-6.fc17.x86_64
>>> gcc-4.7.0-5.fc17.x86_64
>>> openmpi-1.5.4-5.fc17.1.x86_64
>>> ===
>>> 
>>> So, I believe this is a bug to be reported.  Per the advice at
>>> 
>>> http://www.open-mpi.org/community/help/bugs.php
>>> 
>>> If you feel that you do have a definite bug to report but are
>>> unsure which list to post to, then post to the user's list.
>>> 
>>> Please let me know if there is additional information that you need to 
>>> replicate.
>>> 
>>> Some output is included below the signature in case it is useful.
>>> 
>>> -- bennet
>>> --
>>> East Hall Technical Services
>>> Mathematics and Psychology Research Computing
>>> University of Michigan
>>> (734) 763-1182
>>> 
>>> On RHEL 5, openmpi 1.3, gcc 4.1.2, python 2.7
>>> 
>>> $ mpirun -np 5 --mca btl ^sm python test/runtests.py --verbose --no-threads 
>>> --include cco_obj_inter
>>> [0...@sirocco.math.lsa.umich.edu] Python 2.7 
>>> (/home/bennet/epd7.2.2/bin/python)
>>> [0...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
>>> [0...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
>>> (build/lib.linux-x86_64-2.7/mpi4py)
>>> [1...@sirocco.math.lsa.umich.edu] Python 2.7 
>>> (/home/bennet/epd7.2.2/bin/python)
>>> [1...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
>>> [1...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
>>> (build/lib.linux-x86_64-2.7/mpi4py)
>>> [2...@sirocco.math.lsa.umich.edu] Python 2.7 
>>> (/home/bennet/epd7.2.2/bin/python)
>>> [2...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
>>> [2...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
>>> (build/lib.linux-x86_64-2.7/mpi4py)
>>> [3...@sirocco.math.lsa.umich.edu] Python 2.7 
>>> (/home/bennet/epd7.2.2/bin/python)
>>> [3...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
>>> [3...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
>>> (build/lib.linux-x86_64-2.7/mpi4py)
>>> [4...@sirocco.math.lsa.umich.edu] Python 2.7 
>>> (/home/bennet/epd7.2.2/bin/python)
>>> [4...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
>>> [4...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
>>> (build/lib.linux-x86_64-2.7/mpi4py)
>>> testAllgather 

Re: [OMPI users] possible bug exercised by mpi4py

2012-05-23 Thread Bennet Fauber

Jeff,

Well, not really, since the test is written in python   ;-)

The mpi4py source code is at

http://code.google.com/p/mpi4py/downloads/list

but I'm not sure what else I can provide, though.

I'm more the reporting middleman here.  I'd be happy to try to connect you 
and the developer of mpi4py.  It seems like openmpi should work regardless 
what value -np is used, which is what puzzles both me and the mpi4py 
developers.


-- bennet

On Wed, 23 May 2012, Jeff Squyres wrote:


Can you provide us with a C version of the test?

On May 23, 2012, at 4:52 PM, Bennet Fauber wrote:


I've installed the latest mpi4py-1.3 on several systems, and there is a 
repeated bug when running

$ mpirun -np 5 python test/runtests.py

where it throws an error on mpigather with openmpi-1.4.4 and hangs with 
openmpi-1.3.

It runs to completion and passes all tests when run with -np of 2, 3, 4, 6, 7, 
8, 9, 10, 11, and 12.

There is a thread on this at

http://groups.google.com/group/mpi4py/browse_thread/thread/509ac46af6f79973

where others report being able to replicate, too.

The compiler used first was gcc-4.6.2, with openmpi-1.4.4.

These are all Red Hat machines, RHEL 5 or 6 and with multiple compilers and 
versions of openmpi 1.3.0 and 1.4.4.

Lisandro who is the primary developer of mpi4py is able to replicate on Fedora 
16.

Someone else is able to reproduce with

[ quoting from the groups.google.com page... ]
===
It also happens with the current hg version of mpi4py and
$ rpm -qa openmpi gcc python
python-2.7.3-6.fc17.x86_64
gcc-4.7.0-5.fc17.x86_64
openmpi-1.5.4-5.fc17.1.x86_64
===

So, I believe this is a bug to be reported.  Per the advice at

http://www.open-mpi.org/community/help/bugs.php

If you feel that you do have a definite bug to report but are
unsure which list to post to, then post to the user's list.

Please let me know if there is additional information that you need to 
replicate.

Some output is included below the signature in case it is useful.

-- bennet
--
East Hall Technical Services
Mathematics and Psychology Research Computing
University of Michigan
(734) 763-1182

On RHEL 5, openmpi 1.3, gcc 4.1.2, python 2.7

$ mpirun -np 5 --mca btl ^sm python test/runtests.py --verbose --no-threads 
--include cco_obj_inter
[0...@sirocco.math.lsa.umich.edu] Python 2.7 (/home/bennet/epd7.2.2/bin/python)
[0...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
[0...@sirocco.math.lsa.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[1...@sirocco.math.lsa.umich.edu] Python 2.7 (/home/bennet/epd7.2.2/bin/python)
[1...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
[1...@sirocco.math.lsa.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[2...@sirocco.math.lsa.umich.edu] Python 2.7 (/home/bennet/epd7.2.2/bin/python)
[2...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
[2...@sirocco.math.lsa.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[3...@sirocco.math.lsa.umich.edu] Python 2.7 (/home/bennet/epd7.2.2/bin/python)
[3...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
[3...@sirocco.math.lsa.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[4...@sirocco.math.lsa.umich.edu] Python 2.7 (/home/bennet/epd7.2.2/bin/python)
[4...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
[4...@sirocco.math.lsa.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
testAllgather (test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
(test_cco_obj_inter.TestCCOObjInter) ...
[ hangs ]

RHEL5
===
$ python
Python 2.6.6 (r266:84292, Sep 12 2011, 14:03:14)
[GCC 4.4.5 20110214 (Red Hat 4.4.5-6)] on linux2

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/software/rhel6/gcc/4.7.0/libexec/gcc/x86_64-
unknown-linux-gnu/4.7.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-4.7.0/configure --prefix=/home/software/rhel6/
gcc/4.7.0 --with-mpfr=/home/software/rhel6/gcc/mpfr-3.1.0/ --with-mpc=/
home/software/rhel6/gcc/mpc-0.9/ --with-gmp=/home/software/rhel6/gcc/
gmp-5.0.5/ --disable-multilib
Thread model: posix
gcc version 4.7.0 (GCC)

$ mpirun -np 5 python test/runtests.py --verbose --no-threads --include 
cco_obj_inter
[4...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
[4...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
[4...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/mpi4py)
[2...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
[2...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
[2...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/mpi4py)

Re: [OMPI users] possible bug exercised by mpi4py

2012-05-23 Thread Bennet Fauber

Thanks, Ralph,

On Wed, 23 May 2012, Ralph Castain wrote:

I don't honestly think many of us have any knowledge of mpi4py. Does 
this test work with other MPIs?


The mpi4py developers have said they've never seen this using mpich2.  I 
have not been able to test that myself.


MPI_Allgather seems to be passing our tests, so I suspect it is 
something in the binding. If you can provide the actual test, I'm 
willing to take a look at it.


The actual test is included in the install bundle for mpi4py, along with 
the C source code used to create the bindings.


http://code.google.com/p/mpi4py/downloads/list

The install is straightforward and simple.  Unpack the tarball, make sure 
that mpicc is in your path


$ cd mpi4py-1.3
$ python setup.py build
$ python setup.py install --prefix=/your/install
$ export PYTHONPATH=/your/install/lib/pythonN.M/site-packages
$ mpirun -np 5 python test/runtests.py \
 --verbose --no-threads --include cco_obj_inter

where N.M are the major.minor numbers of your python distribution.

What I find most puzzling is that, maybe, 1 out of 10 times it will run to 
completion with -np 5, and it runs with all other numbers of processors 
I've tested always.


-- bennet


On May 23, 2012, at 2:52 PM, Bennet Fauber wrote:


I've installed the latest mpi4py-1.3 on several systems, and there is a 
repeated bug when running

$ mpirun -np 5 python test/runtests.py

where it throws an error on mpigather with openmpi-1.4.4 and hangs with 
openmpi-1.3.

It runs to completion and passes all tests when run with -np of 2, 3, 4, 6, 7, 
8, 9, 10, 11, and 12.

There is a thread on this at

http://groups.google.com/group/mpi4py/browse_thread/thread/509ac46af6f79973

where others report being able to replicate, too.

The compiler used first was gcc-4.6.2, with openmpi-1.4.4.

These are all Red Hat machines, RHEL 5 or 6 and with multiple compilers and 
versions of openmpi 1.3.0 and 1.4.4.

Lisandro who is the primary developer of mpi4py is able to replicate on Fedora 
16.

Someone else is able to reproduce with

[ quoting from the groups.google.com page... ]
===
It also happens with the current hg version of mpi4py and
$ rpm -qa openmpi gcc python
python-2.7.3-6.fc17.x86_64
gcc-4.7.0-5.fc17.x86_64
openmpi-1.5.4-5.fc17.1.x86_64
===

So, I believe this is a bug to be reported.  Per the advice at

http://www.open-mpi.org/community/help/bugs.php

If you feel that you do have a definite bug to report but are
unsure which list to post to, then post to the user's list.

Please let me know if there is additional information that you need to 
replicate.

Some output is included below the signature in case it is useful.

-- bennet
--
East Hall Technical Services
Mathematics and Psychology Research Computing
University of Michigan
(734) 763-1182

On RHEL 5, openmpi 1.3, gcc 4.1.2, python 2.7

$ mpirun -np 5 --mca btl ^sm python test/runtests.py --verbose --no-threads 
--include cco_obj_inter
[0...@sirocco.math.lsa.umich.edu] Python 2.7 (/home/bennet/epd7.2.2/bin/python)
[0...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
[0...@sirocco.math.lsa.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[1...@sirocco.math.lsa.umich.edu] Python 2.7 (/home/bennet/epd7.2.2/bin/python)
[1...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
[1...@sirocco.math.lsa.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[2...@sirocco.math.lsa.umich.edu] Python 2.7 (/home/bennet/epd7.2.2/bin/python)
[2...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
[2...@sirocco.math.lsa.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[3...@sirocco.math.lsa.umich.edu] Python 2.7 (/home/bennet/epd7.2.2/bin/python)
[3...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
[3...@sirocco.math.lsa.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
[4...@sirocco.math.lsa.umich.edu] Python 2.7 (/home/bennet/epd7.2.2/bin/python)
[4...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
[4...@sirocco.math.lsa.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.7/mpi4py)
testAllgather (test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
(test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
(test_cco_obj_inter.TestCCOObjInter) ...
[ hangs ]

RHEL5
===
$ python
Python 2.6.6 (r266:84292, Sep 12 2011, 14:03:14)
[GCC 4.4.5 20110214 (Red Hat 4.4.5-6)] on linux2

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/software/rhel6/gcc/4.7.0/libexec/gcc/x86_64-
unknown-linux-gnu/4.7.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-4.7.0/configure 

Re: [OMPI users] possible bug exercised by mpi4py

2012-05-23 Thread Jeff Squyres
Can you provide us with a C version of the test?

On May 23, 2012, at 4:52 PM, Bennet Fauber wrote:

> I've installed the latest mpi4py-1.3 on several systems, and there is a 
> repeated bug when running
> 
>   $ mpirun -np 5 python test/runtests.py
> 
> where it throws an error on mpigather with openmpi-1.4.4 and hangs with 
> openmpi-1.3.
> 
> It runs to completion and passes all tests when run with -np of 2, 3, 4, 6, 
> 7, 8, 9, 10, 11, and 12.
> 
> There is a thread on this at
> 
> http://groups.google.com/group/mpi4py/browse_thread/thread/509ac46af6f79973
> 
> where others report being able to replicate, too.
> 
> The compiler used first was gcc-4.6.2, with openmpi-1.4.4.
> 
> These are all Red Hat machines, RHEL 5 or 6 and with multiple compilers and 
> versions of openmpi 1.3.0 and 1.4.4.
> 
> Lisandro who is the primary developer of mpi4py is able to replicate on 
> Fedora 16.
> 
> Someone else is able to reproduce with
> 
> [ quoting from the groups.google.com page... ]
> ===
> It also happens with the current hg version of mpi4py and
> $ rpm -qa openmpi gcc python
> python-2.7.3-6.fc17.x86_64
> gcc-4.7.0-5.fc17.x86_64
> openmpi-1.5.4-5.fc17.1.x86_64
> ===
> 
> So, I believe this is a bug to be reported.  Per the advice at
> 
>   http://www.open-mpi.org/community/help/bugs.php
> 
>   If you feel that you do have a definite bug to report but are
>   unsure which list to post to, then post to the user's list.
> 
> Please let me know if there is additional information that you need to 
> replicate.
> 
> Some output is included below the signature in case it is useful.
> 
>   -- bennet
> --
> East Hall Technical Services
> Mathematics and Psychology Research Computing
> University of Michigan
> (734) 763-1182
> 
> On RHEL 5, openmpi 1.3, gcc 4.1.2, python 2.7
> 
> $ mpirun -np 5 --mca btl ^sm python test/runtests.py --verbose --no-threads 
> --include cco_obj_inter
> [0...@sirocco.math.lsa.umich.edu] Python 2.7 
> (/home/bennet/epd7.2.2/bin/python)
> [0...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
> [0...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
> (build/lib.linux-x86_64-2.7/mpi4py)
> [1...@sirocco.math.lsa.umich.edu] Python 2.7 
> (/home/bennet/epd7.2.2/bin/python)
> [1...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
> [1...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
> (build/lib.linux-x86_64-2.7/mpi4py)
> [2...@sirocco.math.lsa.umich.edu] Python 2.7 
> (/home/bennet/epd7.2.2/bin/python)
> [2...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
> [2...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
> (build/lib.linux-x86_64-2.7/mpi4py)
> [3...@sirocco.math.lsa.umich.edu] Python 2.7 
> (/home/bennet/epd7.2.2/bin/python)
> [3...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
> [3...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
> (build/lib.linux-x86_64-2.7/mpi4py)
> [4...@sirocco.math.lsa.umich.edu] Python 2.7 
> (/home/bennet/epd7.2.2/bin/python)
> [4...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
> [4...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
> (build/lib.linux-x86_64-2.7/mpi4py)
> testAllgather (test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
> (test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
> (test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
> (test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
> (test_cco_obj_inter.TestCCOObjInter) ...
> [ hangs ]
> 
> RHEL5
> ===
> $ python
> Python 2.6.6 (r266:84292, Sep 12 2011, 14:03:14)
> [GCC 4.4.5 20110214 (Red Hat 4.4.5-6)] on linux2
> 
> $ gcc -v
> Using built-in specs.
> COLLECT_GCC=gcc
> COLLECT_LTO_WRAPPER=/home/software/rhel6/gcc/4.7.0/libexec/gcc/x86_64-
> unknown-linux-gnu/4.7.0/lto-wrapper
> Target: x86_64-unknown-linux-gnu
> Configured with: ../gcc-4.7.0/configure --prefix=/home/software/rhel6/
> gcc/4.7.0 --with-mpfr=/home/software/rhel6/gcc/mpfr-3.1.0/ --with-mpc=/
> home/software/rhel6/gcc/mpc-0.9/ --with-gmp=/home/software/rhel6/gcc/
> gmp-5.0.5/ --disable-multilib
> Thread model: posix
> gcc version 4.7.0 (GCC)
> 
> $ mpirun -np 5 python test/runtests.py --verbose --no-threads --include 
> cco_obj_inter
> [4...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
> [4...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
> [4...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/mpi4py)
> [2...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
> [2...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
> [2...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/mpi4py)
> [1...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
> [1...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
> [1...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/mpi4py)
> [0...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
> [0...@host-rh6.engin.umich.edu] 

Re: [OMPI users] possible bug exercised by mpi4py

2012-05-23 Thread Ralph Castain
I don't honestly think many of us have any knowledge of mpi4py. Does this test 
work with other MPIs?

MPI_Allgather seems to be passing our tests, so I suspect it is something in 
the binding. If you can provide the actual test, I'm willing to take a look at 
it.


On May 23, 2012, at 2:52 PM, Bennet Fauber wrote:

> I've installed the latest mpi4py-1.3 on several systems, and there is a 
> repeated bug when running
> 
>   $ mpirun -np 5 python test/runtests.py
> 
> where it throws an error on mpigather with openmpi-1.4.4 and hangs with 
> openmpi-1.3.
> 
> It runs to completion and passes all tests when run with -np of 2, 3, 4, 6, 
> 7, 8, 9, 10, 11, and 12.
> 
> There is a thread on this at
> 
> http://groups.google.com/group/mpi4py/browse_thread/thread/509ac46af6f79973
> 
> where others report being able to replicate, too.
> 
> The compiler used first was gcc-4.6.2, with openmpi-1.4.4.
> 
> These are all Red Hat machines, RHEL 5 or 6 and with multiple compilers and 
> versions of openmpi 1.3.0 and 1.4.4.
> 
> Lisandro who is the primary developer of mpi4py is able to replicate on 
> Fedora 16.
> 
> Someone else is able to reproduce with
> 
> [ quoting from the groups.google.com page... ]
> ===
> It also happens with the current hg version of mpi4py and
> $ rpm -qa openmpi gcc python
> python-2.7.3-6.fc17.x86_64
> gcc-4.7.0-5.fc17.x86_64
> openmpi-1.5.4-5.fc17.1.x86_64
> ===
> 
> So, I believe this is a bug to be reported.  Per the advice at
> 
>   http://www.open-mpi.org/community/help/bugs.php
> 
>   If you feel that you do have a definite bug to report but are
>   unsure which list to post to, then post to the user's list.
> 
> Please let me know if there is additional information that you need to 
> replicate.
> 
> Some output is included below the signature in case it is useful.
> 
>   -- bennet
> --
> East Hall Technical Services
> Mathematics and Psychology Research Computing
> University of Michigan
> (734) 763-1182
> 
> On RHEL 5, openmpi 1.3, gcc 4.1.2, python 2.7
> 
> $ mpirun -np 5 --mca btl ^sm python test/runtests.py --verbose --no-threads 
> --include cco_obj_inter
> [0...@sirocco.math.lsa.umich.edu] Python 2.7 
> (/home/bennet/epd7.2.2/bin/python)
> [0...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
> [0...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
> (build/lib.linux-x86_64-2.7/mpi4py)
> [1...@sirocco.math.lsa.umich.edu] Python 2.7 
> (/home/bennet/epd7.2.2/bin/python)
> [1...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
> [1...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
> (build/lib.linux-x86_64-2.7/mpi4py)
> [2...@sirocco.math.lsa.umich.edu] Python 2.7 
> (/home/bennet/epd7.2.2/bin/python)
> [2...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
> [2...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
> (build/lib.linux-x86_64-2.7/mpi4py)
> [3...@sirocco.math.lsa.umich.edu] Python 2.7 
> (/home/bennet/epd7.2.2/bin/python)
> [3...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
> [3...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
> (build/lib.linux-x86_64-2.7/mpi4py)
> [4...@sirocco.math.lsa.umich.edu] Python 2.7 
> (/home/bennet/epd7.2.2/bin/python)
> [4...@sirocco.math.lsa.umich.edu] MPI 2.0 (Open MPI 1.3.0)
> [4...@sirocco.math.lsa.umich.edu] mpi4py 1.3 
> (build/lib.linux-x86_64-2.7/mpi4py)
> testAllgather (test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
> (test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
> (test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
> (test_cco_obj_inter.TestCCOObjInter) ... testAllgather 
> (test_cco_obj_inter.TestCCOObjInter) ...
> [ hangs ]
> 
> RHEL5
> ===
> $ python
> Python 2.6.6 (r266:84292, Sep 12 2011, 14:03:14)
> [GCC 4.4.5 20110214 (Red Hat 4.4.5-6)] on linux2
> 
> $ gcc -v
> Using built-in specs.
> COLLECT_GCC=gcc
> COLLECT_LTO_WRAPPER=/home/software/rhel6/gcc/4.7.0/libexec/gcc/x86_64-
> unknown-linux-gnu/4.7.0/lto-wrapper
> Target: x86_64-unknown-linux-gnu
> Configured with: ../gcc-4.7.0/configure --prefix=/home/software/rhel6/
> gcc/4.7.0 --with-mpfr=/home/software/rhel6/gcc/mpfr-3.1.0/ --with-mpc=/
> home/software/rhel6/gcc/mpc-0.9/ --with-gmp=/home/software/rhel6/gcc/
> gmp-5.0.5/ --disable-multilib
> Thread model: posix
> gcc version 4.7.0 (GCC)
> 
> $ mpirun -np 5 python test/runtests.py --verbose --no-threads --include 
> cco_obj_inter
> [4...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
> [4...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
> [4...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/mpi4py)
> [2...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
> [2...@host-rh6.engin.umich.edu] MPI 2.1 (Open MPI 1.6.0)
> [2...@host-rh6.engin.umich.edu] mpi4py 1.3 (build/lib.linux-x86_64-2.6/mpi4py)
> [1...@host-rh6.engin.umich.edu] Python 2.6 (/usr/bin/python)
>