Re: [OMPI users] Bad MPI_Bcast behaviour when running over openib
On Fri, 2009-09-11 at 13:18 +0200, Ake Sandgren wrote: > Hi! > > The following code shows a bad behaviour when running over openib. Oops. Red Face big time. I happened to run the IB test between two systems that don't have IB connectivity. Goes and hide in a dark corner... -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
Re: [OMPI users] Bad MPI_Bcast behaviour when running over openib
Cisco is no longer an IB vendor, but I seem to recall that these kinds of errors typically indicated a fabric problem. Have you run layer 0 and 1 diagnostics to ensure that the fabric is clean? On Sep 11, 2009, at 8:09 AM, Rolf Vandevaart wrote: Hi, how exactly do you run this to get this error? I tried and it worked for me. burl-ct-x2200-16 50 =>mpirun -mca btl_openib_warn_default_gid_prefix 0 -mca btl self,sm,openib -np 2 -host burl-ct-x2200-16,burl-ct-x2200-17 -mca btl_openib_ib_timeout 16 a.out I am 0 at 1252670691 I am 1 at 1252670559 I am 0 at 1252670692 I am 1 at 1252670559 burl-ct-x2200-16 51 => Rolf On 09/11/09 07:18, Ake Sandgren wrote: > Hi! > > The following code shows a bad behaviour when running over openib. > > Openmpi: 1.3.3 > With openib it dies with "error polling HP CQ with status WORK REQUEST > FLUSHED ERROR status number 5 ", with tcp or shmem it works as expected. > > > #include > #include > #include > #include "mpi.h" > > int main(int argc, char *argv[]) > { > int rank; > int n; > > MPI_Init( , ); > > MPI_Comm_rank( MPI_COMM_WORLD, ); > > fprintf(stderr, "I am %d at %d\n", rank, time(NULL)); > fflush(stderr); > > n = 4; > MPI_Bcast(, 1, MPI_INTEGER, 0, MPI_COMM_WORLD); > fprintf(stderr, "I am %d at %d\n", rank, time(NULL)); > fflush(stderr); > if (rank == 0) { > sleep(60); > } > MPI_Barrier(MPI_COMM_WORLD); > > MPI_Finalize( ); > exit(0); > } > > I know about the internal openmpi reason for it do behave as it does. > But i think that it should be allowed to behave as it does. > > This example is a bit engineered but there are codes where a similar > situation can occur, i.e. the Bcast sender doing lots of other work > after the Bcast before the next MPI call. VASP is a candidate for this. > -- = rolf.vandeva...@sun.com 781-442-3043 = ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] Bad MPI_Bcast behaviour when running over openib
Hi, how exactly do you run this to get this error? I tried and it worked for me. burl-ct-x2200-16 50 =>mpirun -mca btl_openib_warn_default_gid_prefix 0 -mca btl self,sm,openib -np 2 -host burl-ct-x2200-16,burl-ct-x2200-17 -mca btl_openib_ib_timeout 16 a.out I am 0 at 1252670691 I am 1 at 1252670559 I am 0 at 1252670692 I am 1 at 1252670559 burl-ct-x2200-16 51 => Rolf On 09/11/09 07:18, Ake Sandgren wrote: Hi! The following code shows a bad behaviour when running over openib. Openmpi: 1.3.3 With openib it dies with "error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 ", with tcp or shmem it works as expected. #include #include #include #include "mpi.h" int main(int argc, char *argv[]) { int rank; int n; MPI_Init( , ); MPI_Comm_rank( MPI_COMM_WORLD, ); fprintf(stderr, "I am %d at %d\n", rank, time(NULL)); fflush(stderr); n = 4; MPI_Bcast(, 1, MPI_INTEGER, 0, MPI_COMM_WORLD); fprintf(stderr, "I am %d at %d\n", rank, time(NULL)); fflush(stderr); if (rank == 0) { sleep(60); } MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize( ); exit(0); } I know about the internal openmpi reason for it do behave as it does. But i think that it should be allowed to behave as it does. This example is a bit engineered but there are codes where a similar situation can occur, i.e. the Bcast sender doing lots of other work after the Bcast before the next MPI call. VASP is a candidate for this. -- = rolf.vandeva...@sun.com 781-442-3043 =
[OMPI users] Bad MPI_Bcast behaviour when running over openib
Hi! The following code shows a bad behaviour when running over openib. Openmpi: 1.3.3 With openib it dies with "error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 ", with tcp or shmem it works as expected. #include #include #include #include "mpi.h" int main(int argc, char *argv[]) { int rank; int n; MPI_Init( , ); MPI_Comm_rank( MPI_COMM_WORLD, ); fprintf(stderr, "I am %d at %d\n", rank, time(NULL)); fflush(stderr); n = 4; MPI_Bcast(, 1, MPI_INTEGER, 0, MPI_COMM_WORLD); fprintf(stderr, "I am %d at %d\n", rank, time(NULL)); fflush(stderr); if (rank == 0) { sleep(60); } MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize( ); exit(0); } I know about the internal openmpi reason for it do behave as it does. But i think that it should be allowed to behave as it does. This example is a bit engineered but there are codes where a similar situation can occur, i.e. the Bcast sender doing lots of other work after the Bcast before the next MPI call. VASP is a candidate for this. -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se