Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?

2011-02-24 Thread George Bosilca
The issues have been identified deep into the tuned collective component. It 
has been fixed in the trunk and 1.5 a while back, but never pushed in the 1.4. 
I attached a patch to the ticket, and force its way into the next 1.4 release.

  Thanks,
george.

On Feb 14, 2011, at 13:11 , Jeff Squyres wrote:

> Thanks Jeremiah; I filed the following ticket about this:
> 
>https://svn.open-mpi.org/trac/ompi/ticket/2723
> 
> 
> On Feb 10, 2011, at 3:24 PM, Jeremiah Willcock wrote:
> 
>> I forgot to mention that this was tested with 3 or 4 ranks, connected via 
>> TCP.
>> 
>> -- Jeremiah Willcock
>> 
>> On Thu, 10 Feb 2011, Jeremiah Willcock wrote:
>> 
>>> Here is a small test case that hits the bug on 1.4.1:
>>> 
>>> #include 
>>> 
>>> int arr[1142];
>>> 
>>> int main(int argc, char** argv) {
>>> int rank, my_size;
>>> MPI_Init(, );
>>> MPI_Comm_rank(MPI_COMM_WORLD, );
>>> my_size = (rank == 1) ? 1142 : 1088;
>>> MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD);
>>> MPI_Finalize();
>>> return 0;
>>> }
>>> 
>>> I tried it on 1.5.1, and I get MPI_ERR_TRUNCATE instead, so this might have 
>>> already been fixed.
>>> 
>>> -- Jeremiah Willcock
>>> 
>>> 
>>> On Thu, 10 Feb 2011, Jeremiah Willcock wrote:
>>> 
 FYI, I am having trouble finding a small test case that will trigger this 
 on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have 
 been fixed.  What are the triggering rules for different broadcast 
 algorithms?  It could be that only certain sizes or only certain BTLs 
 trigger it.
 -- Jeremiah Willcock
 On Thu, 10 Feb 2011, Jeff Squyres wrote:
> Nifty!  Yes, I agree that that's a poor error message.  It's probably 
> (unfortunately) being propagated up from the underlying point-to-point 
> system, where an ERR_IN_STATUS would actually make sense.
> I'll file a ticket about this.  Thanks for the heads up.
> On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote:
>> On Wed, 9 Feb 2011, Jeremiah Willcock wrote:
>>> I get the following Open MPI error from 1.4.1:
>>> *** An error occurred in MPI_Bcast
>>> *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
>>> *** MPI_ERR_IN_STATUS: error code in status
>>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>> (hostname and port removed from each line).  There is no MPI_Status 
>>> returned by MPI_Bcast, so I don't know what the error is?  Is this 
>>> something that people have seen before?
>> For the record, this appears to be caused by specifying inconsistent 
>> data sizes on the different ranks in the broadcast operation.  The error 
>> message could still be improved, though.
>> -- Jeremiah Willcock
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

"I disapprove of what you say, but I will defend to the death your right to say 
it"
  -- Evelyn Beatrice Hall




Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?

2011-02-14 Thread Jeff Squyres
Thanks Jeremiah; I filed the following ticket about this:

https://svn.open-mpi.org/trac/ompi/ticket/2723


On Feb 10, 2011, at 3:24 PM, Jeremiah Willcock wrote:

> I forgot to mention that this was tested with 3 or 4 ranks, connected via TCP.
> 
> -- Jeremiah Willcock
> 
> On Thu, 10 Feb 2011, Jeremiah Willcock wrote:
> 
>> Here is a small test case that hits the bug on 1.4.1:
>> 
>> #include 
>> 
>> int arr[1142];
>> 
>> int main(int argc, char** argv) {
>> int rank, my_size;
>> MPI_Init(, );
>> MPI_Comm_rank(MPI_COMM_WORLD, );
>> my_size = (rank == 1) ? 1142 : 1088;
>> MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD);
>> MPI_Finalize();
>> return 0;
>> }
>> 
>> I tried it on 1.5.1, and I get MPI_ERR_TRUNCATE instead, so this might have 
>> already been fixed.
>> 
>> -- Jeremiah Willcock
>> 
>> 
>> On Thu, 10 Feb 2011, Jeremiah Willcock wrote:
>> 
>>> FYI, I am having trouble finding a small test case that will trigger this 
>>> on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have 
>>> been fixed.  What are the triggering rules for different broadcast 
>>> algorithms?  It could be that only certain sizes or only certain BTLs 
>>> trigger it.
>>> -- Jeremiah Willcock
>>> On Thu, 10 Feb 2011, Jeff Squyres wrote:
 Nifty!  Yes, I agree that that's a poor error message.  It's probably 
 (unfortunately) being propagated up from the underlying point-to-point 
 system, where an ERR_IN_STATUS would actually make sense.
 I'll file a ticket about this.  Thanks for the heads up.
 On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote:
> On Wed, 9 Feb 2011, Jeremiah Willcock wrote:
>> I get the following Open MPI error from 1.4.1:
>> *** An error occurred in MPI_Bcast
>> *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
>> *** MPI_ERR_IN_STATUS: error code in status
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> (hostname and port removed from each line).  There is no MPI_Status 
>> returned by MPI_Bcast, so I don't know what the error is?  Is this 
>> something that people have seen before?
> For the record, this appears to be caused by specifying inconsistent data 
> sizes on the different ranks in the broadcast operation.  The error 
> message could still be improved, though.
> -- Jeremiah Willcock
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
 -- 
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?

2011-02-10 Thread Jeremiah Willcock
I forgot to mention that this was tested with 3 or 4 ranks, connected via 
TCP.


-- Jeremiah Willcock

On Thu, 10 Feb 2011, Jeremiah Willcock wrote:


Here is a small test case that hits the bug on 1.4.1:

#include 

int arr[1142];

int main(int argc, char** argv) {
 int rank, my_size;
 MPI_Init(, );
 MPI_Comm_rank(MPI_COMM_WORLD, );
 my_size = (rank == 1) ? 1142 : 1088;
 MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD);
 MPI_Finalize();
 return 0;
}

I tried it on 1.5.1, and I get MPI_ERR_TRUNCATE instead, so this might have 
already been fixed.


-- Jeremiah Willcock


On Thu, 10 Feb 2011, Jeremiah Willcock wrote:

FYI, I am having trouble finding a small test case that will trigger this 
on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have 
been fixed.  What are the triggering rules for different broadcast 
algorithms?  It could be that only certain sizes or only certain BTLs 
trigger it.


-- Jeremiah Willcock

On Thu, 10 Feb 2011, Jeff Squyres wrote:

Nifty!  Yes, I agree that that's a poor error message.  It's probably 
(unfortunately) being propagated up from the underlying point-to-point 
system, where an ERR_IN_STATUS would actually make sense.


I'll file a ticket about this.  Thanks for the heads up.


On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote:


On Wed, 9 Feb 2011, Jeremiah Willcock wrote:


I get the following Open MPI error from 1.4.1:

*** An error occurred in MPI_Bcast
*** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
*** MPI_ERR_IN_STATUS: error code in status
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

(hostname and port removed from each line).  There is no MPI_Status 
returned by MPI_Bcast, so I don't know what the error is?  Is this 
something that people have seen before?


For the record, this appears to be caused by specifying inconsistent data 
sizes on the different ranks in the broadcast operation.  The error 
message could still be improved, though.


-- Jeremiah Willcock
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?

2011-02-10 Thread Jeremiah Willcock

Here is a small test case that hits the bug on 1.4.1:

#include 

int arr[1142];

int main(int argc, char** argv) {
  int rank, my_size;
  MPI_Init(, );
  MPI_Comm_rank(MPI_COMM_WORLD, );
  my_size = (rank == 1) ? 1142 : 1088;
  MPI_Bcast(arr, my_size, MPI_INT, 0, MPI_COMM_WORLD);
  MPI_Finalize();
  return 0;
}

I tried it on 1.5.1, and I get MPI_ERR_TRUNCATE instead, so this might 
have already been fixed.


-- Jeremiah Willcock


On Thu, 10 Feb 2011, Jeremiah Willcock wrote:

FYI, I am having trouble finding a small test case that will trigger this on 
1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have been 
fixed.  What are the triggering rules for different broadcast algorithms?  It 
could be that only certain sizes or only certain BTLs trigger it.


-- Jeremiah Willcock

On Thu, 10 Feb 2011, Jeff Squyres wrote:

Nifty!  Yes, I agree that that's a poor error message.  It's probably 
(unfortunately) being propagated up from the underlying point-to-point 
system, where an ERR_IN_STATUS would actually make sense.


I'll file a ticket about this.  Thanks for the heads up.


On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote:


On Wed, 9 Feb 2011, Jeremiah Willcock wrote:


I get the following Open MPI error from 1.4.1:

*** An error occurred in MPI_Bcast
*** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
*** MPI_ERR_IN_STATUS: error code in status
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

(hostname and port removed from each line).  There is no MPI_Status 
returned by MPI_Bcast, so I don't know what the error is?  Is this 
something that people have seen before?


For the record, this appears to be caused by specifying inconsistent data 
sizes on the different ranks in the broadcast operation.  The error 
message could still be improved, though.


-- Jeremiah Willcock
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?

2011-02-10 Thread Jeremiah Willcock
FYI, I am having trouble finding a small test case that will trigger this 
on 1.5; I'm either getting deadlocks or MPI_ERR_TRUNCATE, so it could have 
been fixed.  What are the triggering rules for different broadcast 
algorithms?  It could be that only certain sizes or only certain BTLs 
trigger it.


-- Jeremiah Willcock

On Thu, 10 Feb 2011, Jeff Squyres wrote:


Nifty!  Yes, I agree that that's a poor error message.  It's probably 
(unfortunately) being propagated up from the underlying point-to-point system, 
where an ERR_IN_STATUS would actually make sense.

I'll file a ticket about this.  Thanks for the heads up.


On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote:


On Wed, 9 Feb 2011, Jeremiah Willcock wrote:


I get the following Open MPI error from 1.4.1:

*** An error occurred in MPI_Bcast
*** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
*** MPI_ERR_IN_STATUS: error code in status
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

(hostname and port removed from each line).  There is no MPI_Status returned by 
MPI_Bcast, so I don't know what the error is?  Is this something that people 
have seen before?


For the record, this appears to be caused by specifying inconsistent data sizes 
on the different ranks in the broadcast operation.  The error message could 
still be improved, though.

-- Jeremiah Willcock
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?

2011-02-10 Thread Jeff Squyres
Nifty!  Yes, I agree that that's a poor error message.  It's probably 
(unfortunately) being propagated up from the underlying point-to-point system, 
where an ERR_IN_STATUS would actually make sense.

I'll file a ticket about this.  Thanks for the heads up.


On Feb 9, 2011, at 4:49 PM, Jeremiah Willcock wrote:

> On Wed, 9 Feb 2011, Jeremiah Willcock wrote:
> 
>> I get the following Open MPI error from 1.4.1:
>> 
>> *** An error occurred in MPI_Bcast
>> *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
>> *** MPI_ERR_IN_STATUS: error code in status
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> 
>> (hostname and port removed from each line).  There is no MPI_Status returned 
>> by MPI_Bcast, so I don't know what the error is?  Is this something that 
>> people have seen before?
> 
> For the record, this appears to be caused by specifying inconsistent data 
> sizes on the different ranks in the broadcast operation.  The error message 
> could still be improved, though.
> 
> -- Jeremiah Willcock
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] MPI_ERR_IN_STATUS from MPI_Bcast?

2011-02-09 Thread Jeremiah Willcock

On Wed, 9 Feb 2011, Jeremiah Willcock wrote:


I get the following Open MPI error from 1.4.1:

*** An error occurred in MPI_Bcast
*** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
*** MPI_ERR_IN_STATUS: error code in status
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

(hostname and port removed from each line).  There is no MPI_Status returned 
by MPI_Bcast, so I don't know what the error is?  Is this something that 
people have seen before?


For the record, this appears to be caused by specifying inconsistent data 
sizes on the different ranks in the broadcast operation.  The error 
message could still be improved, though.


-- Jeremiah Willcock