The issues have been identified deep into the tuned collective component. It has been fixed in the trunk and 1.5 a while back, but never pushed in the 1.4. I attached a patch to the ticket, and force its way into the next 1.4 release.
Thanks, george. On Feb 14, 2011, at 13:11 , Jeff Squyres wrote: > Thanks Jeremiah; I filed the following ticket about this: > > https://svn.open-mpi.org/trac/ompi/ticket/2723 > > > On Feb 10, 2011, at 3:24 PM, Jeremiah Willcock wrote: > >> I forgot to mention that this was tested with 3 or 4 ranks, connected via >> TCP. >> >> -- Jeremiah Willcock >> >> On Thu, 10 Feb 2011, Jeremiah Willcock wrote: >> >>> Here is a small test case that hits the bug on 1.4.1: >>> >>> #include <mpi.h> >>> >>> int arr[1142]; >>> >>> int main(int argc, char** argv) { >>> int rank, my_size; >>> MPI_Init(&argc, &argv); >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> my_size = (rank == 1) ? 1142 : 1088; >>> MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD); >>> MPI_Finalize(); >>> return 0; >>> } >>> >>> I tried it on 1.5.1, and I get MPI_ERR_TRUNCATE instead, so this might have >>> already been fixed. >>> >>> -- Jeremiah Willcock >>> >>> >>> On Thu, 10 Feb 2011, Jeremiah Willcock wrote: >>> >>>> FYI, I am having trouble finding a small test case that will trigger this >>>> on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have >>>> been fixed. What are the triggering rules for different broadcast >>>> algorithms? It could be that only certain sizes or only certain BTLs >>>> trigger it. >>>> -- Jeremiah Willcock >>>> On Thu, 10 Feb 2011, Jeff Squyres wrote: >>>>> Nifty! Yes, I agree that that's a poor error message. It's probably >>>>> (unfortunately) being propagated up from the underlying point-to-point >>>>> system, where an ERR_IN_STATUS would actually make sense. >>>>> I'll file a ticket about this. Thanks for the heads up. >>>>> On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote: >>>>>> On Wed, 9 Feb 2011, Jeremiah Willcock wrote: >>>>>>> I get the following Open MPI error from 1.4.1: >>>>>>> *** An error occurred in MPI_Bcast >>>>>>> *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 >>>>>>> *** MPI_ERR_IN_STATUS: error code in status >>>>>>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >>>>>>> (hostname and port removed from each line). There is no MPI_Status >>>>>>> returned by MPI_Bcast, so I don't know what the error is? Is this >>>>>>> something that people have seen before? >>>>>> For the record, this appears to be caused by specifying inconsistent >>>>>> data sizes on the different ranks in the broadcast operation. The error >>>>>> message could still be improved, though. >>>>>> -- Jeremiah Willcock >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users "I disapprove of what you say, but I will defend to the death your right to say it" -- Evelyn Beatrice Hall