Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-06-03 Thread Ralph Castain
Yeah, I think we've concluded that this is just a bug in the compiler and not something wrong in OMPI itself. Sadly, compilers (just like all software) also have bugs. I'd just use the upgraded version as they apparently fixed the problem. On Jun 3, 2014, at 4:43 AM, Alain Miniussi wrote: >

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-06-03 Thread Alain Miniussi
Please note that I had the problem with 13.1.0 but not with the 13.1.1 On 28/05/2014 00:47, Ralph Castain wrote: On May 27, 2014, at 3:32 PM, Alain Miniussi wrote: Unfortunately, the debug library works like a charm (which make the uninitialized variable issue more likely). Indeed - sounds

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-27 Thread Nathan Hjelm
On Wed, May 28, 2014 at 12:32:35AM +0200, Alain Miniussi wrote: > Unfortunately, the debug library works like a charm (which make the > uninitialized variable issue more likely). > > Still, the stack trace point to mca_btl_openib_add_procs in > ompi/mca/btl/openib/btl_openib.c and there is only on

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-27 Thread Ralph Castain
On May 27, 2014, at 3:32 PM, Alain Miniussi wrote: > Unfortunately, the debug library works like a charm (which make the > uninitialized variable issue more likely). Indeed - sounds like there is some optimization occurring that triggers the problem. > > Still, the stack trace point to mca_

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-27 Thread Alain Miniussi
Unfortunately, the debug library works like a charm (which make the uninitialized variable issue more likely). Still, the stack trace point to mca_btl_openib_add_procs in ompi/mca/btl/openib/btl_openib.c and there is only one division in that function (although not floating point) at the end:

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-27 Thread Ralph Castain
Ah, good. On the setup that fails, could you use gdb to find the line number where it is dividing by zero? It could be an uninitialized variable that gcc inits one way and icc inits another. On May 27, 2014, at 4:49 AM, Alain Miniussi wrote: > So it's working with a gcc compiled openmpi: > >

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-27 Thread Alain Miniussi
So it's working with a gcc compiled openmpi: [alainm@gurney mpi]$ /softs/openmpi-1.8.1-gnu447/bin/mpicc --showme gcc -I/softs/openmpi-1.8.1-gnu447/include -pthread -Wl,-rpath -Wl,/softs/openmpi-1.8.1-gnu447/lib -Wl,--enable-new-dtags -L/softs/openmpi-1.8.1-gnu447/lib -lmpi (reverse-i-search)`m

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-27 Thread Alain Miniussi
Hi Gus, Yes I did, with the same result on each process. Actually the problem was spotted on a real code although I just posted the minimal version. Alain On 26/05/2014 17:14, Gustavo Correa wrote: Hi Alain Have you tried this? mpiexec -np 2 ./a.out Note: mpicc to compile, mpiexec to exec

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-26 Thread Ralph Castain
If you wouldn't mind, yes - let's see if it is a problem with icc. We know some versions have bugs, though this may not be the issue here On May 26, 2014, at 7:39 AM, Alain Miniussi wrote: > > Hi, > > Did that too, with the same result: > > [alainm@tagir mpi]$ mpirun -n 1 ./a.out > [tagir:05

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-26 Thread Gustavo Correa
Hi Alain Have you tried this? mpiexec -np 2 ./a.out Note: mpicc to compile, mpiexec to execute. I hope this helps, Gus Correa On May 26, 2014, at 9:59 AM, Alain Miniussi wrote: > > Hi, > > I have a failure with the following minimalistic testcase: > $: more ./test.c > #include "mpi.h" > >

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-26 Thread Alain Miniussi
Hi, Did that too, with the same result: [alainm@tagir mpi]$ mpirun -n 1 ./a.out [tagir:05123] *** Process received signal *** [tagir:05123] Signal: Floating point exception (8) [tagir:05123] Signal code: Integer divide-by-zero (1) [tagir:05123] Failing at address: 0x2adb507b3d9f [tagir:05123] [

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-26 Thread Ralph Castain
Strange - I note that you are running these as singletons. Can you try running it under mpirun? mpirun -n 1 ./a.out just to see if it is the singleton that is causing the problem, or something in the openib btl itself. On May 26, 2014, at 6:59 AM, Alain Miniussi wrote: > > Hi, > > I have

[OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-26 Thread Alain Miniussi
Hi, I have a failure with the following minimalistic testcase: $: more ./test.c #include "mpi.h" int main(int argc, char* argv[]) { MPI_Init(&argc,&argv); MPI_Finalize(); return 0; } $: mpicc -v icc version 13.1.1 (gcc version 4.4.7 compatibility) $: mpicc ./test.c $: ./a.out [tagir