Re: [OMPI users] SIGV at MPI_Cart_sub
Anyway, after compiling my code with icc/11.1.069, the job is running without stuck or that sigv which it occurred before when using icc/12.1.0 module. Also I have to point that when I was using icc/12.1.0 I was getting strange outputs or stuck, and I solved them by changing the name of parameters inside the function, for example, if I call a func like this time( ..., size_t *P, ...){} and call it like this: time(..,p,..); then I have to change the name of *P inside the time functions as follows: time( ..., size_t *P, ...) { int bestP = *P; // and maybe again as the later bug that I solved int bP = bestP; // then start using bP :) ... } Thanks guys for the help, I guess that the problem is solved when compiling with the old one.
Re: [OMPI users] SIGV at MPI_Cart_sub
it is a good question I asked it myself at the first but then I said it should be correct but anyway I want to confirm that: her is the code snippet of the program: ... int ranks[size]; for(i=0; i < size; ++i) { ranks[i] = i; } ... for(p=8; p <= (size); p+=4) { MPI_Barrier(MPI_COMM_WORLD); if(!grid_init(p, 1)) continue; if( (p>=m) || (p>=k) || (p>=n) ) break; MPI_Group_incl(world_group, p, ranks, _group); MPI_Comm_create(MPI_COMM_WORLD, working_group, _comm); if(working_comm != MPI_COMM_NULL) { ... variant_run(, C, m, k, n, my_rank, p, working_comm); ... MPI_Group_free(_group); MPI_Comm_free(_comm); } Inside variant_run, it calls this function where the error is: void Compute_SUMMA1(Matrix* A, Matrix* B, Matrix *C, size_t M, size_t K, size_t N, size_t my_rank, size_t size, MPI_Comm comm) { C->block_matrix = gsl_matrix_calloc(A->block_matrix->size1, B->block_matrix->size2); C->distribution_type = TwoD_Block; MPI_Comm grid_comm; int dim[2], period[2], reorder = 0, ndims = 2; int coord[2], id; dim[0] = global.PR; dim[1] = global.PC; period[0] = 0; period[1] = 0; int ss, rr; MPI_Group comm_group; MPI_Comm_group(comm, _group ); MPI_Group_size( comm_group, ); MPI_Group_rank( comm_group, ); if(ss == 6) { //printf("M %d K %d N %d //printf("my_rank in comm %d my_rank in world_comm %d\n", rr, my_rank); //printf(" comm size %d my_rank in comm %d my_rank in world_comm %d\n", ss, rr, my_rank); //printf("SUMMA ... PR %d PC %d\n", global.PR, global.PC); } //MPI_Barrier(comm); // if(my_rank == 0) // printf("my_rank %d ndims %d dim[0] %d dim[1] %d period[0] %d period[1] %d reorder %d\n", //my_rank, ndims, dim[0], dim[1], period[0], period[1], reorder); // if(comm == MPI_COMM_NULL) // printf("my_rank %d comm is empty\n", my_rank); // MPI_Cart_create(comm, ndims, dim, period, reorder, _comm); MPI_Comm Acomm, Bcomm; // create column subgrids int remain[2]; //, mdims, dims[2], row_coords[2]; remain[0] = 1; remain[1] = 0; MPI_Cart_sub(grid_comm, remain, ); remain[0] = 0; remain[1] = 1; MPI_Cart_sub(grid_comm, remain, ); ... } As you can see, all ranks will call grid_init which is a global func that returns the grid dims, if it is executed for ranks 24 will produce 4X6 and for 96 produce 8X12 and will store the result in global structure with PR and PC. As it is executed by all prcesses and I checked for rank 0 and some other processes and the result is correct so I assume it should be correct for all other processes. So the grid_comm is correct which is an input to MPI_Cart_sub. The ranks in the working_comm and in MPI_COMM_WORLD should be the same and this should be correct and it is according to filling the rank array at the beginning of this code snippet. On Tue, Jan 10, 2012 at 5:25 PM, Jeff Squyreswrote: > This may be a dumb question, but are you 100% sure that the input values > are correct? > > On Jan 10, 2012, at 8:16 AM, Anas Al-Trad wrote: > > > Hi Ralph, I changed the intel icc module from 12.1.0 to 11.1.069, the > previous default one used at a Neolith Cluster. I submitted the job and I > still waiting for the result. Here is the message of the segmentation fault: > > > > [n764:29867] *** Process received signal *** > > [n764:29867] Signal: Floating point exception (8) > > [n764:29867] Signal code: Integer divide-by-zero (1) > > [n764:29867] Failing at address: 0x2ba640e74627 > > [n764:29867] [ 0] /lib64/libc.so.6 [0x2ba641e162d0] > > [n764:29867] [ 1] > /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_coords+0x43) > [0x2ba640e74627] > > [n764:29867] [ 2] > /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_sub+0x1d5) > [0x2ba640e74acd] > > [n764:29867] [ 3] > /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(MPI_Cart_sub+0x35) > [0x2ba640e472d9] > > [n764:29867] [ 4] > /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(Compute_SUMMA1+0x226) > [0x4088da] > > [n764:29867] [ 5] > /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(variant_run+0xb2) > [0x409058] > > [n764:29867] [ 6] > /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(main+0xf90) [0x40eeba] > > [n764:29867] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x2ba641e03994] > > [n764:29867] [ 8] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o > [0x403fd9] > > [n764:29867] *** End of error message *** > > > > when I run my application, sometimes I get this error and sometimes it > is stuck in the middle. > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users
Re: [OMPI users] SIGV at MPI_Cart_sub
This may be a dumb question, but are you 100% sure that the input values are correct? On Jan 10, 2012, at 8:16 AM, Anas Al-Trad wrote: > Hi Ralph, I changed the intel icc module from 12.1.0 to 11.1.069, the > previous default one used at a Neolith Cluster. I submitted the job and I > still waiting for the result. Here is the message of the segmentation fault: > > [n764:29867] *** Process received signal *** > [n764:29867] Signal: Floating point exception (8) > [n764:29867] Signal code: Integer divide-by-zero (1) > [n764:29867] Failing at address: 0x2ba640e74627 > [n764:29867] [ 0] /lib64/libc.so.6 [0x2ba641e162d0] > [n764:29867] [ 1] > /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_coords+0x43) > [0x2ba640e74627] > [n764:29867] [ 2] > /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_sub+0x1d5) > [0x2ba640e74acd] > [n764:29867] [ 3] > /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(MPI_Cart_sub+0x35) > [0x2ba640e472d9] > [n764:29867] [ 4] > /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(Compute_SUMMA1+0x226) > [0x4088da] > [n764:29867] [ 5] > /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(variant_run+0xb2) > [0x409058] > [n764:29867] [ 6] > /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(main+0xf90) [0x40eeba] > [n764:29867] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ba641e03994] > [n764:29867] [ 8] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o > [0x403fd9] > [n764:29867] *** End of error message *** > > when I run my application, sometimes I get this error and sometimes it is > stuck in the middle. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] SIGV at MPI_Cart_sub
Hi Ralph, I changed the intel icc module from 12.1.0 to 11.1.069, the previous default one used at a Neolith Cluster. I submitted the job and I still waiting for the result. Here is the message of the segmentation fault: [n764:29867] *** Process received signal *** [n764:29867] Signal: Floating point exception (8) [n764:29867] Signal code: Integer divide-by-zero (1) [n764:29867] Failing at address: 0x2ba640e74627 [n764:29867] [ 0] /lib64/libc.so.6 [0x2ba641e162d0] [n764:29867] [ 1] /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_coords+0x43) [0x2ba640e74627] [n764:29867] [ 2] /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_sub+0x1d5) [0x2ba640e74acd] [n764:29867] [ 3] /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(MPI_Cart_sub+0x35) [0x2ba640e472d9] [n764:29867] [ 4] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(Compute_SUMMA1+0x226) [0x4088da] [n764:29867] [ 5] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(variant_run+0xb2) [0x409058] [n764:29867] [ 6] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(main+0xf90) [0x40eeba] [n764:29867] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ba641e03994] [n764:29867] [ 8] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o [0x403fd9] [n764:29867] *** End of error message *** when I run my application, sometimes I get this error and sometimes it is stuck in the middle.
Re: [OMPI users] SIGV at MPI_Cart_sub
Have you tried the suggested fix from the email thread Paul cited? Sounds to me like the most likely cause of the problem, assuming it comes from inside OMPI. Have you looked at the backtrace to see if it is indeed inside OMPI vs your code? On Jan 10, 2012, at 6:13 AM, Anas Al-Trad wrote: > > Thanks Paul, > yes I use Intel 12.1.0, and this error is intermittent, not always produced > but most of the times it occurs. > My program is large and contains many files that are related to each other, I > don't think it will help if I take the snippet of the code. The program run > parallel matrix multiplication algorithms. I don't know if it is because of > my code or not, but I run the program for small matrices sizes and the > program completes until the end without error while for large inputs it will > hang or give that sigv. > > Regards, > Anas > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] SIGV at MPI_Cart_sub
Thanks Paul, yes I use Intel 12.1.0, and this error is intermittent, not always produced but most of the times it occurs. My program is large and contains many files that are related to each other, I don't think it will help if I take the snippet of the code. The program run parallel matrix multiplication algorithms. I don't know if it is because of my code or not, but I run the program for small matrices sizes and the program completes until the end without error while for large inputs it will hang or give that sigv. Regards, Anas
Re: [OMPI users] SIGV at MPI_Cart_sub
A blind guess: did you use Intel compiler? If so, there is/was an error leading to SIGSEGV _in Open MPI itselv_. http://www.open-mpi.org/community/lists/users/2012/01/18091.php If the SIGSEGV arise not in OpenMPI but in application itself it may be a programming issue.. In any case, more precisely answer are impossible without seeing any codes snippet and/or logs. Best, Paul Anas Al-Trad wrote: Dear people, In my application, I have the segmentation fault of Integer Divide-by-zero when calling MPI_cart_sub routine. My program is as follows, I have 128 ranks, I make a new communicator of the first 96 ranks via MPI_Comm_creat. Then I create a grid of 8X12 by calling MPI_Cart_create. After creating the grid if I call MPI_Cart_sub then I have that error. This error happens also when I use a communicator of 24 ranks and create a grid of 4X6. Can you please help me in solving this? Regards, Anas ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Dipl.-Inform. Paul Kapinos - High Performance Computing, RWTH Aachen University, Center for Computing and Communication Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241/80-24915 smime.p7s Description: S/MIME Cryptographic Signature
[OMPI users] SIGV at MPI_Cart_sub
Dear people, In my application, I have the segmentation fault of Integer Divide-by-zero when calling MPI_cart_sub routine. My program is as follows, I have 128 ranks, I make a new communicator of the first 96 ranks via MPI_Comm_creat. Then I create a grid of 8X12 by calling MPI_Cart_create. After creating the grid if I call MPI_Cart_sub then I have that error. This error happens also when I use a communicator of 24 ranks and create a grid of 4X6. Can you please help me in solving this? Regards, Anas