Re: [OMPI users] Bad MPI_Bcast behaviour when running over openib

2009-09-11 Thread Ake Sandgren
On Fri, 2009-09-11 at 13:18 +0200, Ake Sandgren wrote:
> Hi!
> 
> The following code shows a bad behaviour when running over openib.

Oops. Red Face big time.
I happened to run the IB test between two systems that don't have IB
connectivity.

Goes and hide in a dark corner...

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Bad MPI_Bcast behaviour when running over openib

2009-09-11 Thread Jeff Squyres
Cisco is no longer an IB vendor, but I seem to recall that these kinds  
of errors typically indicated a fabric problem.  Have you run layer 0  
and 1 diagnostics to ensure that the fabric is clean?



On Sep 11, 2009, at 8:09 AM, Rolf Vandevaart wrote:


Hi, how exactly do you run this to get this error?  I tried and it
worked for me.

burl-ct-x2200-16 50 =>mpirun -mca btl_openib_warn_default_gid_prefix 0
-mca btl self,sm,openib -np 2 -host burl-ct-x2200-16,burl-ct-x2200-17
-mca btl_openib_ib_timeout 16 a.out
I am 0 at 1252670691
I am 1 at 1252670559
I am 0 at 1252670692
I am 1 at 1252670559
  burl-ct-x2200-16 51 =>

Rolf

On 09/11/09 07:18, Ake Sandgren wrote:
> Hi!
>
> The following code shows a bad behaviour when running over openib.
>
> Openmpi: 1.3.3
> With openib it dies with "error polling HP CQ with status WORK  
REQUEST
> FLUSHED ERROR status number 5 ", with tcp or shmem it works as  
expected.

>
>
> #include 
> #include 
> #include 
> #include "mpi.h"
>
> int main(int argc, char *argv[])
> {
> int  rank;
> int  n;
>
> MPI_Init( ,  );
>
> MPI_Comm_rank( MPI_COMM_WORLD,  );
>
> fprintf(stderr, "I am %d at %d\n", rank, time(NULL));
> fflush(stderr);
>
> n = 4;
> MPI_Bcast(, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);
> fprintf(stderr, "I am %d at %d\n", rank, time(NULL));
> fflush(stderr);
> if (rank == 0) {
>   sleep(60);
> }
> MPI_Barrier(MPI_COMM_WORLD);
>
> MPI_Finalize( );
> exit(0);
> }
>
> I know about the internal openmpi reason for it do behave as it  
does.

> But i think that it should be allowed to behave as it does.
>
> This example is a bit engineered but there are codes where a similar
> situation can occur, i.e. the Bcast sender doing lots of other work
> after the Bcast before the next MPI call. VASP is a candidate for  
this.

>


--

=
rolf.vandeva...@sun.com
781-442-3043
=
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] Bad MPI_Bcast behaviour when running over openib

2009-09-11 Thread Rolf Vandevaart
Hi, how exactly do you run this to get this error?  I tried and it 
worked for me.


burl-ct-x2200-16 50 =>mpirun -mca btl_openib_warn_default_gid_prefix 0 
-mca btl self,sm,openib -np 2 -host burl-ct-x2200-16,burl-ct-x2200-17 
-mca btl_openib_ib_timeout 16 a.out

I am 0 at 1252670691
I am 1 at 1252670559
I am 0 at 1252670692
I am 1 at 1252670559
 burl-ct-x2200-16 51 =>

Rolf

On 09/11/09 07:18, Ake Sandgren wrote:

Hi!

The following code shows a bad behaviour when running over openib.

Openmpi: 1.3.3
With openib it dies with "error polling HP CQ with status WORK REQUEST
FLUSHED ERROR status number 5 ", with tcp or shmem it works as expected.


#include 
#include 
#include 
#include "mpi.h"

int main(int argc, char *argv[])
{
int  rank;
int  n;

MPI_Init( ,  );

MPI_Comm_rank( MPI_COMM_WORLD,  );

fprintf(stderr, "I am %d at %d\n", rank, time(NULL));
fflush(stderr);

n = 4;
MPI_Bcast(, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);
fprintf(stderr, "I am %d at %d\n", rank, time(NULL));
fflush(stderr);
if (rank == 0) {
sleep(60);
}
MPI_Barrier(MPI_COMM_WORLD);

MPI_Finalize( );
exit(0);
}

I know about the internal openmpi reason for it do behave as it does.
But i think that it should be allowed to behave as it does.

This example is a bit engineered but there are codes where a similar
situation can occur, i.e. the Bcast sender doing lots of other work
after the Bcast before the next MPI call. VASP is a candidate for this.




--

=
rolf.vandeva...@sun.com
781-442-3043
=


[OMPI users] Bad MPI_Bcast behaviour when running over openib

2009-09-11 Thread Ake Sandgren
Hi!

The following code shows a bad behaviour when running over openib.

Openmpi: 1.3.3
With openib it dies with "error polling HP CQ with status WORK REQUEST
FLUSHED ERROR status number 5 ", with tcp or shmem it works as expected.


#include 
#include 
#include 
#include "mpi.h"

int main(int argc, char *argv[])
{
int  rank;
int  n;

MPI_Init( ,  );

MPI_Comm_rank( MPI_COMM_WORLD,  );

fprintf(stderr, "I am %d at %d\n", rank, time(NULL));
fflush(stderr);

n = 4;
MPI_Bcast(, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);
fprintf(stderr, "I am %d at %d\n", rank, time(NULL));
fflush(stderr);
if (rank == 0) {
sleep(60);
}
MPI_Barrier(MPI_COMM_WORLD);

MPI_Finalize( );
exit(0);
}

I know about the internal openmpi reason for it do behave as it does.
But i think that it should be allowed to behave as it does.

This example is a bit engineered but there are codes where a similar
situation can occur, i.e. the Bcast sender doing lots of other work
after the Bcast before the next MPI call. VASP is a candidate for this.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se