Re: [OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Anas Al-Trad
Anyway, after compiling my code with icc/11.1.069, the job is running
without stuck or that sigv which it occurred before when using icc/12.1.0
module.

Also I have to point that when I was using icc/12.1.0 I was getting strange
outputs or stuck, and I solved them by changing the name of parameters
inside the function, for example, if I call a func like this

time( ..., size_t *P, ...){}

and call it like this:
time(..,p,..);

then I have to change the name of *P inside the time functions as follows:
time( ..., size_t *P, ...)
{
int bestP = *P; // and maybe again as the later bug that I solved
int bP = bestP;
// then start using bP :)
...
}

Thanks guys for the help, I guess that the problem is solved when compiling
with the old one.


Re: [OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Anas Al-Trad
it is a good question I asked it myself at the first but then I said it
should be correct but anyway I want to confirm that:
her is the code snippet of the program:
...
int ranks[size];
for(i=0; i < size; ++i)
{
ranks[i] = i;
}
...

for(p=8; p <= (size); p+=4)
{
  MPI_Barrier(MPI_COMM_WORLD);
  if(!grid_init(p, 1)) continue;
  if( (p>=m) || (p>=k) || (p>=n) )
break;

  MPI_Group_incl(world_group, p, ranks, _group);
  MPI_Comm_create(MPI_COMM_WORLD, working_group, _comm);

  if(working_comm != MPI_COMM_NULL)
  {
...
variant_run(, C, m, k, n, my_rank, p, working_comm);
...
MPI_Group_free(_group);
MPI_Comm_free(_comm);
}

Inside variant_run, it calls this function where the error is:
void Compute_SUMMA1(Matrix* A, Matrix* B, Matrix *C, size_t M, size_t K,
size_t N, size_t my_rank, size_t size, MPI_Comm comm)
{
C->block_matrix = gsl_matrix_calloc(A->block_matrix->size1,
B->block_matrix->size2);
C->distribution_type = TwoD_Block;

MPI_Comm grid_comm;
int dim[2], period[2], reorder = 0, ndims = 2;
int coord[2], id;

dim[0] = global.PR; dim[1] = global.PC;
period[0] = 0; period[1] = 0;

int ss, rr;
MPI_Group comm_group;
MPI_Comm_group(comm, _group );
MPI_Group_size( comm_group, );
MPI_Group_rank( comm_group, );
if(ss == 6)
{
//printf("M %d K %d N %d
//printf("my_rank in comm %d   my_rank in world_comm %d\n", rr, my_rank);
//printf(" comm size %d  my_rank in comm %d   my_rank in world_comm %d\n",
ss, rr, my_rank);
//printf("SUMMA ... PR %d  PC %d\n", global.PR, global.PC);
}
//MPI_Barrier(comm);
// if(my_rank == 0)
// printf("my_rank %d  ndims %d  dim[0] %d  dim[1] %d  period[0] %d
 period[1] %d  reorder %d\n",
//my_rank, ndims, dim[0], dim[1], period[0], period[1], reorder);
// if(comm == MPI_COMM_NULL)
//   printf("my_rank %d  comm is empty\n", my_rank);
//
MPI_Cart_create(comm, ndims, dim, period, reorder, _comm);

MPI_Comm Acomm, Bcomm;

// create column subgrids
int remain[2]; //, mdims, dims[2], row_coords[2];
remain[0] = 1;
remain[1] = 0;
MPI_Cart_sub(grid_comm, remain, );

remain[0] = 0;
remain[1] = 1;
MPI_Cart_sub(grid_comm, remain, );
...
}


As you can see, all ranks will call grid_init which is a global func that
returns the grid dims, if it is executed for ranks 24 will produce 4X6 and
for 96 produce 8X12 and will store the result in global structure with PR
and PC. As it is executed by all prcesses and I checked for rank 0 and some
other processes and the result is correct so I assume it should be correct
for all other processes.

So the grid_comm is correct which is an input to MPI_Cart_sub. The ranks in
the working_comm and in MPI_COMM_WORLD should be the same and this should
be correct and it is according to filling the rank array at the beginning
of this code snippet.



On Tue, Jan 10, 2012 at 5:25 PM, Jeff Squyres  wrote:

> This may be a dumb question, but are you 100% sure that the input values
> are correct?
>
> On Jan 10, 2012, at 8:16 AM, Anas Al-Trad wrote:
>
> >  Hi Ralph, I changed the intel icc module from 12.1.0 to 11.1.069, the
> previous default one used at a Neolith Cluster. I submitted the job and I
> still waiting for the result. Here is the message of the segmentation fault:
> >
> > [n764:29867] *** Process received signal ***
> > [n764:29867] Signal: Floating point exception (8)
> > [n764:29867] Signal code: Integer divide-by-zero (1)
> > [n764:29867] Failing at address: 0x2ba640e74627
> > [n764:29867] [ 0] /lib64/libc.so.6 [0x2ba641e162d0]
> > [n764:29867] [ 1]
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_coords+0x43)
> [0x2ba640e74627]
> > [n764:29867] [ 2]
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_sub+0x1d5)
> [0x2ba640e74acd]
> > [n764:29867] [ 3]
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(MPI_Cart_sub+0x35)
> [0x2ba640e472d9]
> > [n764:29867] [ 4]
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(Compute_SUMMA1+0x226)
> [0x4088da]
> > [n764:29867] [ 5]
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(variant_run+0xb2)
> [0x409058]
> > [n764:29867] [ 6]
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(main+0xf90) [0x40eeba]
> > [n764:29867] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2ba641e03994]
> > [n764:29867] [ 8] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o
> [0x403fd9]
> > [n764:29867] *** End of error message ***
> >
> > when I run my application, sometimes I get this error and sometimes it
> is stuck in the middle.
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users 

Re: [OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Jeff Squyres
This may be a dumb question, but are you 100% sure that the input values are 
correct?

On Jan 10, 2012, at 8:16 AM, Anas Al-Trad wrote:

>  Hi Ralph, I changed the intel icc module from 12.1.0 to 11.1.069, the 
> previous default one used at a Neolith Cluster. I submitted the job and I 
> still waiting for the result. Here is the message of the segmentation fault:
> 
> [n764:29867] *** Process received signal ***
> [n764:29867] Signal: Floating point exception (8)
> [n764:29867] Signal code: Integer divide-by-zero (1)
> [n764:29867] Failing at address: 0x2ba640e74627
> [n764:29867] [ 0] /lib64/libc.so.6 [0x2ba641e162d0]
> [n764:29867] [ 1] 
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_coords+0x43)
>  [0x2ba640e74627]
> [n764:29867] [ 2] 
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_sub+0x1d5)
>  [0x2ba640e74acd]
> [n764:29867] [ 3] 
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(MPI_Cart_sub+0x35) 
> [0x2ba640e472d9]
> [n764:29867] [ 4] 
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(Compute_SUMMA1+0x226) 
> [0x4088da]
> [n764:29867] [ 5] 
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(variant_run+0xb2) 
> [0x409058]
> [n764:29867] [ 6] 
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(main+0xf90) [0x40eeba]
> [n764:29867] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ba641e03994]
> [n764:29867] [ 8] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o 
> [0x403fd9]
> [n764:29867] *** End of error message ***
> 
> when I run my application, sometimes I get this error and sometimes it is 
> stuck in the middle.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Anas Al-Trad
 Hi Ralph, I changed the intel icc module from 12.1.0 to 11.1.069, the
previous default one used at a Neolith Cluster. I submitted the job and I
still waiting for the result. Here is the message of the segmentation fault:

[n764:29867] *** Process received signal ***
[n764:29867] Signal: Floating point exception (8)
[n764:29867] Signal code: Integer divide-by-zero (1)
[n764:29867] Failing at address: 0x2ba640e74627
[n764:29867] [ 0] /lib64/libc.so.6 [0x2ba641e162d0]
[n764:29867] [ 1]
/software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_coords+0x43)
[0x2ba640e74627]
[n764:29867] [ 2]
/software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_sub+0x1d5)
[0x2ba640e74acd]
[n764:29867] [ 3]
/software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(MPI_Cart_sub+0x35)
[0x2ba640e472d9]
[n764:29867] [ 4]
/home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(Compute_SUMMA1+0x226)
[0x4088da]
[n764:29867] [ 5]
/home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(variant_run+0xb2)
[0x409058]
[n764:29867] [ 6]
/home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(main+0xf90) [0x40eeba]
[n764:29867] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ba641e03994]
[n764:29867] [ 8] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o
[0x403fd9]
[n764:29867] *** End of error message ***

when I run my application, sometimes I get this error and sometimes it is
stuck in the middle.


Re: [OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Ralph Castain
Have you tried the suggested fix from the email thread Paul cited? Sounds to me 
like the most likely cause of the problem, assuming it comes from inside OMPI.

Have you looked at the backtrace to see if it is indeed inside OMPI vs your 
code?

On Jan 10, 2012, at 6:13 AM, Anas Al-Trad wrote:

> 
> Thanks Paul, 
> yes I use Intel 12.1.0, and this error is intermittent, not always produced 
> but most of the times it occurs.
> My program is large and contains many files that are related to each other, I 
> don't think it will help if I take the snippet of the code. The program run 
> parallel matrix multiplication algorithms. I don't know if it is because of 
> my code or not, but I run the program for small matrices sizes and the 
> program completes until the end without error while for large inputs it will 
> hang or give that sigv.
> 
> Regards,
> Anas
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Anas Al-Trad
Thanks Paul,
yes I use Intel 12.1.0, and this error is intermittent, not always produced
but most of the times it occurs.
My program is large and contains many files that are related to each other,
I don't think it will help if I take the snippet of the code. The program
run parallel matrix multiplication algorithms. I don't know if it is
because of my code or not, but I run the program for small matrices sizes
and the program completes until the end without error while for large
inputs it will hang or give that sigv.

Regards,
Anas


Re: [OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Paul Kapinos

A blind guess: did you use Intel compiler?
If so, there is/was an error leading to SIGSEGV _in Open MPI itselv_.

http://www.open-mpi.org/community/lists/users/2012/01/18091.php

If the SIGSEGV arise not in OpenMPI but in application itself it may be 
a programming issue.. In any case, more precisely answer are impossible 
without seeing any codes snippet and/or logs.


Best,
Paul


Anas Al-Trad wrote:
Dear people, 
   In my application, I have the segmentation fault of 
Integer Divide-by-zero when calling MPI_cart_sub routine. My program is 
as follows, I have 128 ranks, I make a new communicator of the first 96 
ranks via MPI_Comm_creat. Then I create a grid of 8X12 by calling 
MPI_Cart_create. After creating the grid if I call MPI_Cart_sub then I 
have that error.


This error happens also when I use a communicator of 24 ranks and create 
a grid of 4X6. Can you please help me in solving this?


Regards,
Anas






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915


smime.p7s
Description: S/MIME Cryptographic Signature


[OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Anas Al-Trad
Dear people,
   In my application, I have the segmentation fault of Integer
Divide-by-zero when calling MPI_cart_sub routine. My program is as follows,
I have 128 ranks, I make a new communicator of the first 96 ranks via
MPI_Comm_creat. Then I create a grid of 8X12 by calling MPI_Cart_create.
After creating the grid if I call MPI_Cart_sub then I have that error.

This error happens also when I use a communicator of 24 ranks and create a
grid of 4X6. Can you please help me in solving this?

Regards,
Anas