Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?
The issues have been identified deep into the tuned collective component. It has been fixed in the trunk and 1.5 a while back, but never pushed in the 1.4. I attached a patch to the ticket, and force its way into the next 1.4 release. Thanks, george. On Feb 14, 2011, at 13:11 , Jeff Squyres wrote: > Thanks Jeremiah; I filed the following ticket about this: > >https://svn.open-mpi.org/trac/ompi/ticket/2723 > > > On Feb 10, 2011, at 3:24 PM, Jeremiah Willcock wrote: > >> I forgot to mention that this was tested with 3 or 4 ranks, connected via >> TCP. >> >> -- Jeremiah Willcock >> >> On Thu, 10 Feb 2011, Jeremiah Willcock wrote: >> >>> Here is a small test case that hits the bug on 1.4.1: >>> >>> #include >>> >>> int arr[1142]; >>> >>> int main(int argc, char** argv) { >>> int rank, my_size; >>> MPI_Init(, ); >>> MPI_Comm_rank(MPI_COMM_WORLD, ); >>> my_size = (rank == 1) ? 1142 : 1088; >>> MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD); >>> MPI_Finalize(); >>> return 0; >>> } >>> >>> I tried it on 1.5.1, and I get MPI_ERR_TRUNCATE instead, so this might have >>> already been fixed. >>> >>> -- Jeremiah Willcock >>> >>> >>> On Thu, 10 Feb 2011, Jeremiah Willcock wrote: >>> FYI, I am having trouble finding a small test case that will trigger this on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have been fixed. What are the triggering rules for different broadcast algorithms? It could be that only certain sizes or only certain BTLs trigger it. -- Jeremiah Willcock On Thu, 10 Feb 2011, Jeff Squyres wrote: > Nifty! Yes, I agree that that's a poor error message. It's probably > (unfortunately) being propagated up from the underlying point-to-point > system, where an ERR_IN_STATUS would actually make sense. > I'll file a ticket about this. Thanks for the heads up. > On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote: >> On Wed, 9 Feb 2011, Jeremiah Willcock wrote: >>> I get the following Open MPI error from 1.4.1: >>> *** An error occurred in MPI_Bcast >>> *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 >>> *** MPI_ERR_IN_STATUS: error code in status >>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >>> (hostname and port removed from each line). There is no MPI_Status >>> returned by MPI_Bcast, so I don't know what the error is? Is this >>> something that people have seen before? >> For the record, this appears to be caused by specifying inconsistent >> data sizes on the different ranks in the broadcast operation. The error >> message could still be improved, though. >> -- Jeremiah Willcock >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users "I disapprove of what you say, but I will defend to the death your right to say it" -- Evelyn Beatrice Hall
Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?
Thanks Jeremiah; I filed the following ticket about this: https://svn.open-mpi.org/trac/ompi/ticket/2723 On Feb 10, 2011, at 3:24 PM, Jeremiah Willcock wrote: > I forgot to mention that this was tested with 3 or 4 ranks, connected via TCP. > > -- Jeremiah Willcock > > On Thu, 10 Feb 2011, Jeremiah Willcock wrote: > >> Here is a small test case that hits the bug on 1.4.1: >> >> #include >> >> int arr[1142]; >> >> int main(int argc, char** argv) { >> int rank, my_size; >> MPI_Init(, ); >> MPI_Comm_rank(MPI_COMM_WORLD, ); >> my_size = (rank == 1) ? 1142 : 1088; >> MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD); >> MPI_Finalize(); >> return 0; >> } >> >> I tried it on 1.5.1, and I get MPI_ERR_TRUNCATE instead, so this might have >> already been fixed. >> >> -- Jeremiah Willcock >> >> >> On Thu, 10 Feb 2011, Jeremiah Willcock wrote: >> >>> FYI, I am having trouble finding a small test case that will trigger this >>> on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have >>> been fixed. What are the triggering rules for different broadcast >>> algorithms? It could be that only certain sizes or only certain BTLs >>> trigger it. >>> -- Jeremiah Willcock >>> On Thu, 10 Feb 2011, Jeff Squyres wrote: Nifty! Yes, I agree that that's a poor error message. It's probably (unfortunately) being propagated up from the underlying point-to-point system, where an ERR_IN_STATUS would actually make sense. I'll file a ticket about this. Thanks for the heads up. On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote: > On Wed, 9 Feb 2011, Jeremiah Willcock wrote: >> I get the following Open MPI error from 1.4.1: >> *** An error occurred in MPI_Bcast >> *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 >> *** MPI_ERR_IN_STATUS: error code in status >> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >> (hostname and port removed from each line). There is no MPI_Status >> returned by MPI_Bcast, so I don't know what the error is? Is this >> something that people have seen before? > For the record, this appears to be caused by specifying inconsistent data > sizes on the different ranks in the broadcast operation. The error > message could still be improved, though. > -- Jeremiah Willcock > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?
I forgot to mention that this was tested with 3 or 4 ranks, connected via TCP. -- Jeremiah Willcock On Thu, 10 Feb 2011, Jeremiah Willcock wrote: Here is a small test case that hits the bug on 1.4.1: #include int arr[1142]; int main(int argc, char** argv) { int rank, my_size; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, ); my_size = (rank == 1) ? 1142 : 1088; MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD); MPI_Finalize(); return 0; } I tried it on 1.5.1, and I get MPI_ERR_TRUNCATE instead, so this might have already been fixed. -- Jeremiah Willcock On Thu, 10 Feb 2011, Jeremiah Willcock wrote: FYI, I am having trouble finding a small test case that will trigger this on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have been fixed. What are the triggering rules for different broadcast algorithms? It could be that only certain sizes or only certain BTLs trigger it. -- Jeremiah Willcock On Thu, 10 Feb 2011, Jeff Squyres wrote: Nifty! Yes, I agree that that's a poor error message. It's probably (unfortunately) being propagated up from the underlying point-to-point system, where an ERR_IN_STATUS would actually make sense. I'll file a ticket about this. Thanks for the heads up. On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote: On Wed, 9 Feb 2011, Jeremiah Willcock wrote: I get the following Open MPI error from 1.4.1: *** An error occurred in MPI_Bcast *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 *** MPI_ERR_IN_STATUS: error code in status *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) (hostname and port removed from each line). There is no MPI_Status returned by MPI_Bcast, so I don't know what the error is? Is this something that people have seen before? For the record, this appears to be caused by specifying inconsistent data sizes on the different ranks in the broadcast operation. The error message could still be improved, though. -- Jeremiah Willcock ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?
Here is a small test case that hits the bug on 1.4.1: #include int arr[1142]; int main(int argc, char** argv) { int rank, my_size; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, ); my_size = (rank == 1) ? 1142 : 1088; MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD); MPI_Finalize(); return 0; } I tried it on 1.5.1, and I get MPI_ERR_TRUNCATE instead, so this might have already been fixed. -- Jeremiah Willcock On Thu, 10 Feb 2011, Jeremiah Willcock wrote: FYI, I am having trouble finding a small test case that will trigger this on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have been fixed. What are the triggering rules for different broadcast algorithms? It could be that only certain sizes or only certain BTLs trigger it. -- Jeremiah Willcock On Thu, 10 Feb 2011, Jeff Squyres wrote: Nifty! Yes, I agree that that's a poor error message. It's probably (unfortunately) being propagated up from the underlying point-to-point system, where an ERR_IN_STATUS would actually make sense. I'll file a ticket about this. Thanks for the heads up. On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote: On Wed, 9 Feb 2011, Jeremiah Willcock wrote: I get the following Open MPI error from 1.4.1: *** An error occurred in MPI_Bcast *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 *** MPI_ERR_IN_STATUS: error code in status *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) (hostname and port removed from each line). There is no MPI_Status returned by MPI_Bcast, so I don't know what the error is? Is this something that people have seen before? For the record, this appears to be caused by specifying inconsistent data sizes on the different ranks in the broadcast operation. The error message could still be improved, though. -- Jeremiah Willcock ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?
FYI, I am having trouble finding a small test case that will trigger this on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have been fixed. What are the triggering rules for different broadcast algorithms? It could be that only certain sizes or only certain BTLs trigger it. -- Jeremiah Willcock On Thu, 10 Feb 2011, Jeff Squyres wrote: Nifty! Yes, I agree that that's a poor error message. It's probably (unfortunately) being propagated up from the underlying point-to-point system, where an ERR_IN_STATUS would actually make sense. I'll file a ticket about this. Thanks for the heads up. On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote: On Wed, 9 Feb 2011, Jeremiah Willcock wrote: I get the following Open MPI error from 1.4.1: *** An error occurred in MPI_Bcast *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 *** MPI_ERR_IN_STATUS: error code in status *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) (hostname and port removed from each line). There is no MPI_Status returned by MPI_Bcast, so I don't know what the error is? Is this something that people have seen before? For the record, this appears to be caused by specifying inconsistent data sizes on the different ranks in the broadcast operation. The error message could still be improved, though. -- Jeremiah Willcock ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?
Nifty! Yes, I agree that that's a poor error message. It's probably (unfortunately) being propagated up from the underlying point-to-point system, where an ERR_IN_STATUS would actually make sense. I'll file a ticket about this. Thanks for the heads up. On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote: > On Wed, 9 Feb 2011, Jeremiah Willcock wrote: > >> I get the following Open MPI error from 1.4.1: >> >> *** An error occurred in MPI_Bcast >> *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 >> *** MPI_ERR_IN_STATUS: error code in status >> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >> >> (hostname and port removed from each line). There is no MPI_Status returned >> by MPI_Bcast, so I don't know what the error is? Is this something that >> people have seen before? > > For the record, this appears to be caused by specifying inconsistent data > sizes on the different ranks in the broadcast operation. The error message > could still be improved, though. > > -- Jeremiah Willcock > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?
On Wed, 9 Feb 2011, Jeremiah Willcock wrote: I get the following Open MPI error from 1.4.1: *** An error occurred in MPI_Bcast *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0 *** MPI_ERR_IN_STATUS: error code in status *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) (hostname and port removed from each line). There is no MPI_Status returned by MPI_Bcast, so I don't know what the error is? Is this something that people have seen before? For the record, this appears to be caused by specifying inconsistent data sizes on the different ranks in the broadcast operation. The error message could still be improved, though. -- Jeremiah Willcock