Many thanks for trans-coding to C; this was a major help in debugging the issue.

Thankfully, it turned out to be a simple bug.  OMPI's parameter checking for 
MPI_ALLGATHERV was using the *local* group size when checking the recvcounts 
parameter, where it really should have been using the *remote* group size.  So 
when the local group size > the remote group size, Bad Things could happen.

For this test, the bad case would only happen with odd numbers of processes.  
It probably only happens sometimes because the contents of memory after the 
recvcounts array are undefined -- sometimes they'll be ok, sometimes they won't.

I fixed the issue in https://svn.open-mpi.org/trac/ompi/changeset/26488 and 
filed to move it to 1.6.1 in https://svn.open-mpi.org/trac/ompi/ticket/3105.

Many thanks for reporting the issue!


On May 23, 2012, at 10:30 PM, Jonathan Dursi wrote:

> On 23 May 9:37PM, Jonathan Dursi wrote:
> 
>> On the other hand, it works everywhere if I pad the rcounts array with
>> an extra valid value (0 or 1, or for that matter 783), or replace the
>> allgatherv with an allgather.
> 
> .. and it fails with 7 even where it worked (but succeeds with 8) if I pad 
> rcounts with an extra invalid value which should never be read.
> 
> Should the recvcounts[] parameters test in allgatherv.c loop up to 
> size=ompi_comm_remote_size(comm), as is done in alltoallv.c, rather than 
> ompi_comm_size(comm) ?   That seems to avoid the problem.
> 
>   - Jonathan
> -- 
> Jonathan Dursi | SciNet, Compute/Calcul Canada | www.SciNetHPC.ca
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to