Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-22 Thread Mahmood Naderan
​>Thx for sharing, quite interesting. But does this mean, that there is no
working command line flag for gcc to switch this >off (like -march=bdver1
what Gilles mentioned) or to tell me what he thinks it should compile for?
​
Well that didn't work. maybe I messed somethings since I did recompile the
programs multiple times with different configs and options. I will try one
more time.



Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-22 Thread Reuti

> Am 22.09.2016 um 17:20 schrieb Mahmood Naderan :
> 
> Although this problem is not related to OMPI *at all*, I think it is good to 
> tell the others what was going on. Finally, I caught the illegal instruction 
> :)
> 
> Briefly, I built the serial version of Siesta on the frontend and ran it 
> directly on the compute node. Fortunately, "x/i $pc" from GDB showed that the 
> illegal instruction was a FMA3 instruction. More detail is available at 
> https://gcc.gnu.org/ml/gcc-help/2016-09/msg00084.html
> 
> According to the Wikipedia,
> 
>   • FMA4 is supported in AMD processors starting with the Bulldozer 
> architecture. FMA4 was realized in hardware before FMA3.
>   • FMA3 is supported in AMD processors starting with the Piledriver 
> architecture and Intel starting with Haswell processors and Broadwell 
> processors since 2014.
> Therefore, the frontend (piledriver) inserts a FMA3 instruction while the 
> compute node (Bulldozer) doesn't recognize it.

Thx for sharing, quite interesting. But does this mean, that there is no 
working command line flag for gcc to switch this off (like -march=bdver1 what 
Gilles mentioned) or to tell me what he thinks it should compile for?

For pgcc there is -show and I can spot the target it discovered in the 
USETPVAL= line.

-- Reuti

> 
> The solution was (as stated by guys) building Siesta on the compute node. I 
> have to say that I tested all related programs (OMPI​,​ Scalapack, OpenBLAS​) 
> sequentially on the compute node in order to find who generate the illegal 
> instruction.
> 
> Anyway... thanks a lot for your comments. Hope this helps others in the 
> future.
> ​
> 
> 
> Regards,
> Mahmood
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-21 Thread Mahmood Naderan
Dear Gilles,
It seems that using GDB with MPI is a bit tricky. I read the FAQ about that.

Please see the post at https://gcc.gnu.org/ml/gcc-help/2016-09/msg00078.html



>i guess your gdb is also a bit too old to support all operations on a core
file
>(fwiw, i am able to do that on RHEL7)
This is a Rocks-6 and the GBD is 7.2. It seems that it doesn't support
"info proc mapping" command


I will try your suggestion by modifying the code. Meanwhile do you have any
comment about that post (the link above)?

Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-16 Thread Mahmood Naderan
OK Gilles, let me try that. I will troubleshoot with gcc mailing list and
will come back later.


Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Gilles Gouaillardet
Mahmood,

note you have to compile the source file that contains the snippet
with '-g -O0', and link with '-g -O0'

also, there was a typo in the gdb command,
please read "frame 1" instead of "frame #1"

Cheers,

Gilles

On Fri, Sep 16, 2016 at 12:53 PM, Gilles Gouaillardet
 wrote:
> Mahmood,
>
> -march=bdver1
>
> should be ok on your nodes.
> from the gcc command line, i was expecting -march=xxx, but it is
> missing (your gcc might be a bit older for that)
> note you have to recompile all your libs (openblas and friends) with
> -march=bdver1
>
> i guess your gdb is also a bit too old to support all operations on a core 
> file
> (fwiw, i am able to do that on RHEL7)
>
> at first, i recommend you find the smallest number of nodes necessary
> to reproduce the issue.
> ideally, you would confirm the app is working fine by running it
> exclusively on the frontend.
>
> if you do not have a parallel debugger, then you have to manually
> parallel debug your app.
>
> i usually update my main app like this
>
> int _dbg=1;
>
> MPI_Init(...);
> printf("gdb --pid=%d\n", getpid());
> while (_dbg) poll(NULL, 0, 1);
>
> rebuild and run.
>
> then log into the compute nodes, and run the gdb command that was
> displayed previously
> you usually have to (for all your MPI tasks, in different terminals)
> bt
> frame #1
> set _dbg=0
> c
>
> and wait for a crash
>
> hopefully, you will be able to run
> disas
> info proc mapping
> x /100x $rp
>
> Cheers,
>
> Gilles
>
>
> On Fri, Sep 16, 2016 at 2:54 AM, Mahmood Naderan  wrote:
>> The differences are very very minor
>>
>> root@cluster:tpar# echo | gcc -v -E - 2>&1 | grep cc1
>>  /usr/libexec/gcc/x86_64-redhat-linux/4.4.7/cc1 -E -quiet -v -
>> -mtune=generic
>>
>> [root@compute-0-1 ~]# echo | gcc -v -E - 2>&1 | grep cc1
>>  /usr/libexec/gcc/x86_64-redhat-linux/4.4.6/cc1 -E -quiet -v -
>> -mtune=generic
>>
>>
>> Even I tried to compile the program with -march=amdfam10. Something like
>> these
>>
>> /export/apps/siesta/openmpi-2.0.0/bin/mpifort -c -g -Os -march=amdfam10
>> `FoX/FoX-config --fcflags`  -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT
>> -DTRANSIESTA/export/apps/siesta/siesta-4.0/Src/pspltm1.F
>>
>> But got the same error.
>>
>> /proc/cpuinfo on the frontend shows (family 21, model 2) and on the compute
>> node it shows (family 21, model 1).
>>
>>
>>
>>>That being said, my best bet is you compile on a compute node ...
>> gcc is there on the computes, but the NFS permission is another issue. It
>> seems that nodes are not able to write on /share (the one which is shared
>> between frontend and computes).
>>
>>
>>
>> An important question is that, how can I find out what is the name of the
>> illegal instruction. Then, I hope to find the document that points which
>> instruction set (avx, sse4, ...) contains that instruction.
>>
>> Is there any option in mpirun to turn on the verbosity to see more
>> information?
>>
>> Regards,
>> Mahmood
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Gilles Gouaillardet
Mahmood,

-march=bdver1

should be ok on your nodes.
from the gcc command line, i was expecting -march=xxx, but it is
missing (your gcc might be a bit older for that)
note you have to recompile all your libs (openblas and friends) with
-march=bdver1

i guess your gdb is also a bit too old to support all operations on a core file
(fwiw, i am able to do that on RHEL7)

at first, i recommend you find the smallest number of nodes necessary
to reproduce the issue.
ideally, you would confirm the app is working fine by running it
exclusively on the frontend.

if you do not have a parallel debugger, then you have to manually
parallel debug your app.

i usually update my main app like this

int _dbg=1;

MPI_Init(...);
printf("gdb --pid=%d\n", getpid());
while (_dbg) poll(NULL, 0, 1);

rebuild and run.

then log into the compute nodes, and run the gdb command that was
displayed previously
you usually have to (for all your MPI tasks, in different terminals)
bt
frame #1
set _dbg=0
c

and wait for a crash

hopefully, you will be able to run
disas
info proc mapping
x /100x $rp

Cheers,

Gilles


On Fri, Sep 16, 2016 at 2:54 AM, Mahmood Naderan  wrote:
> The differences are very very minor
>
> root@cluster:tpar# echo | gcc -v -E - 2>&1 | grep cc1
>  /usr/libexec/gcc/x86_64-redhat-linux/4.4.7/cc1 -E -quiet -v -
> -mtune=generic
>
> [root@compute-0-1 ~]# echo | gcc -v -E - 2>&1 | grep cc1
>  /usr/libexec/gcc/x86_64-redhat-linux/4.4.6/cc1 -E -quiet -v -
> -mtune=generic
>
>
> Even I tried to compile the program with -march=amdfam10. Something like
> these
>
> /export/apps/siesta/openmpi-2.0.0/bin/mpifort -c -g -Os -march=amdfam10
> `FoX/FoX-config --fcflags`  -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT
> -DTRANSIESTA/export/apps/siesta/siesta-4.0/Src/pspltm1.F
>
> But got the same error.
>
> /proc/cpuinfo on the frontend shows (family 21, model 2) and on the compute
> node it shows (family 21, model 1).
>
>
>
>>That being said, my best bet is you compile on a compute node ...
> gcc is there on the computes, but the NFS permission is another issue. It
> seems that nodes are not able to write on /share (the one which is shared
> between frontend and computes).
>
>
>
> An important question is that, how can I find out what is the name of the
> illegal instruction. Then, I hope to find the document that points which
> instruction set (avx, sse4, ...) contains that instruction.
>
> Is there any option in mpirun to turn on the verbosity to see more
> information?
>
> Regards,
> Mahmood
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Matthieu Brucher
I don't think there is anything OpenMPI can do for you here. The issue is
clearly on how you are compiling your application.
To start, you can try to compile without the --march=generic and use
something as generic as possible (i.e. only SSE2). Then if this doesn't
work for your app, do the same for any 3rd party library.

Cheers,

2016-09-15 19:01 GMT+01:00 Mahmood Naderan :

> Excuse me, which is most suitable for me to find the name of the illegal
> instruction?
>
> --verbose
> --debug-level
> --debug-daemons
> --debug-daemons-file
>
>
> Regards,
> Mahmood
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 
Information System Engineer, Ph.D.
Blog: http://blog.audio-tk.com/
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Reuti

Am 15.09.2016 um 19:54 schrieb Mahmood Naderan:

> The differences are very very minor
> 
> root@cluster:tpar# echo | gcc -v -E - 2>&1 | grep cc1
>  /usr/libexec/gcc/x86_64-redhat-linux/4.4.7/cc1 -E -quiet -v - -mtune=generic
> 
> [root@compute-0-1 ~]# echo | gcc -v -E - 2>&1 | grep cc1
>  /usr/libexec/gcc/x86_64-redhat-linux/4.4.6/cc1 -E -quiet -v - -mtune=generic
> 
> 
> Even I tried to compile the program with -march=amdfam10. Something like these
> 
> /export/apps/siesta/openmpi-2.0.0/bin/mpifort -c -g -Os -march=amdfam10   
> `FoX/FoX-config --fcflags`  -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT 
> -DTRANSIESTA/export/apps/siesta/siesta-4.0/Src/pspltm1.F
> 
> But got the same error.
> 
> /proc/cpuinfo on the frontend shows (family 21, model 2) and on the compute 
> node it shows (family 21, model 1).

Just for curiosity: what is the model name of them?


> >That being said, my best bet is you compile on a compute node ...
> gcc is there on the computes, but the NFS permission is another issue. It 
> seems that nodes are not able to write on /share (the one which is shared 
> between frontend and computes).

Would it work to compile with a shared target and copy it to /shared on the 
frontend?

-- Reuti


> An important question is that, how can I find out what is the name of the 
> illegal instruction. Then, I hope to find the document that points which 
> instruction set (avx, sse4, ...) contains that instruction.
> 
> Is there any option in mpirun to turn on the verbosity to see more 
> information?
> 
> Regards,
> Mahmood
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Mahmood Naderan
Excuse me, which is most suitable for me to find the name of the illegal
instruction?

--verbose
--debug-level
--debug-daemons
--debug-daemons-file


Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Mahmood Naderan
The differences are very very minor

root@cluster:tpar# echo | gcc -v -E - 2>&1 | grep cc1
 /usr/libexec/gcc/x86_64-redhat-linux/4.4.7/cc1 -E -quiet -v -
-mtune=generic

[root@compute-0-1 ~]# echo | gcc -v -E - 2>&1 | grep cc1
 /usr/libexec/gcc/x86_64-redhat-linux/4.4.6/cc1 -E -quiet -v -
-mtune=generic


Even I tried to compile the program with -march=amdfam10. Something like
these

/export/apps/siesta/openmpi-2.0.0/bin/mpifort -c -g -Os -march=amdfam10
`FoX/FoX-config --fcflags`  -DMPI -DFC_HAVE_FLUSH -DFC_HAVE_ABORT
-DTRANSIESTA/export/apps/siesta/siesta-4.0/Src/pspltm1.F

But got the same error.

/proc/cpuinfo on the frontend shows (family 21, model 2) and on the compute
node it shows (family 21, model 1).



>That being said, my best bet is you compile on a compute node ...
gcc is there on the computes, but the NFS permission is another issue. It
seems that nodes are not able to write on /share (the one which is shared
between frontend and computes).



An important question is that, how can I find out what is the name of the
illegal instruction. Then, I hope to find the document that points which
instruction set (avx, sse4, ...) contains that instruction.

Is there any option in mpirun to turn on the verbosity to see more
information?

Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Gilles Gouaillardet
if gcc is installed on your compute node, you can run

echo | gcc -v -E - 2>&1 | grep cc1

and look for the -march=xxx parameter
/* you might want to compare that with your fronted */

And/or you can run
grep family /proc/cpuinfo
on your compute node
Then
man gcc
on your front end node

>From my gcc, -march=bdver1 for Family 15h, -march=barcelona for family 10h

That being said, my best bet is you compile on a compute node ...

Cheers,

Gilles


On Thursday, September 15, 2016, Mahmood Naderan 
wrote:

> Although the CPUs are nearly the same, but the CPU flags are different.
> I noticed that the frontend has fma, f16c, tch, tce, tbm and bmi1 while
> the compute nodes don't have them.
>
> I guess that since the programs were compiled on the frontend (6380),
> there are some especial instructions in the optimization phase which aren't
> available in compute nodes (6282).
>
> Maybe this is not really related to OMPI, but anybody know which compiler
> flags are related to these special instructions?
>
>
>
>
> >Ok, you can try this under gdb
> >info proc mapping
> >info registers
> >x /100x $rip
> >x /100x $eip
>
> The process is dead, so some commands are invalid.
>
> Program terminated with signal 4, Illegal instruction.
> #0  0x008da76e in ?? ()
> (gdb) info proc mapping
> No /proc directory: '/proc/5383'
> (gdb) info registers
> rax0x0  0
> rbx0x448f98071891328
> rcx0x7fff52810b00   140734577576704
> rdx0x448f98071891328
> rsi0x448f98071891328
> rdi0x8  8
> rbp0x448f9800x448f980
> rsp0x7fff52810ae8   0x7fff52810ae8
> r8 0x1  1
> r9 0x9c02496
> r100x44af48072021120
> r110x44b1b8072031104
> r120x8  8
> r130x8  8
> r140x9  9
> r150x13880  8
> rip0x8da76e 0x8da76e
> eflags 0x10246  [ PF ZF IF RF ]
> cs 0x33 51
> ss 0x2b 43
> ds 0x0  0
> es 0x0  0
> fs 0x0  0
> gs 0x0  0
> (gdb) x /100x $rip
> 0x8da76e:   Cannot access memory at address 0x8da76e
> (gdb) x /100x $eip
> Value can't be converted to integer.
> (gdb)
>
>
>
> Regards,
> Mahmood
> ​​
>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Mahmood Naderan
Although the CPUs are nearly the same, but the CPU flags are different.
I noticed that the frontend has fma, f16c, tch, tce, tbm and bmi1 while the
compute nodes don't have them.

I guess that since the programs were compiled on the frontend (6380), there
are some especial instructions in the optimization phase which aren't
available in compute nodes (6282).

Maybe this is not really related to OMPI, but anybody know which compiler
flags are related to these special instructions?




>Ok, you can try this under gdb
>info proc mapping
>info registers
>x /100x $rip
>x /100x $eip

The process is dead, so some commands are invalid.

Program terminated with signal 4, Illegal instruction.
#0  0x008da76e in ?? ()
(gdb) info proc mapping
No /proc directory: '/proc/5383'
(gdb) info registers
rax0x0  0
rbx0x448f98071891328
rcx0x7fff52810b00   140734577576704
rdx0x448f98071891328
rsi0x448f98071891328
rdi0x8  8
rbp0x448f9800x448f980
rsp0x7fff52810ae8   0x7fff52810ae8
r8 0x1  1
r9 0x9c02496
r100x44af48072021120
r110x44b1b8072031104
r120x8  8
r130x8  8
r140x9  9
r150x13880  8
rip0x8da76e 0x8da76e
eflags 0x10246  [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0  0
es 0x0  0
fs 0x0  0
gs 0x0  0
(gdb) x /100x $rip
0x8da76e:   Cannot access memory at address 0x8da76e
(gdb) x /100x $eip
Value can't be converted to integer.
(gdb)



Regards,
Mahmood
​​
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Gilles Gouaillardet
Ok, you can try this under gdb

info proc mapping

info registers

x /100x $rip

x /100x $eip



I remember you are running on AMD cpus that is why INTEL is only
instructions must be avoided


Cheers,

Gilles

On Thursday, September 15, 2016, Mahmood Naderan 
wrote:

> disas command fails.
>
> Program terminated with signal 4, Illegal instruction.
> #0  0x008da76e in ?? ()
> (gdb) bt
> #0  0x008da76e in ?? ()
> #1  0x008da970 in ?? ()
> #2  0x00bfe9f8 in ?? ()
> #3  0x in ?? ()
> (gdb) disas
> No function contains program counter for selected frame.
>
>
> >Btw, did you run some simple applications with openmpi 2.0.0 ?
> >We do have bits of assembly code, and even if i do not believe they are
> specific to intel cpus, i might be wrong >and that could be the root cause.
>
> I didn't run the tests. But I am pretty sure that OpenMPI is working
> because, other applications (not siesta) have no problem.
> Please note that the CPUs are AMD. Frontend is Opteron 6380 and the
> compute nodes are 6282SE
>
> >Also, did you run
> >make check
> >After you built openmpi ?
>
> All are OK. Please see below.
>
> 
> 
> Testsuite summary for Open MPI 2.0.0
> 
> 
> # TOTAL: 2
> # PASS:  2
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  0
> # XPASS: 0
> # ERROR: 0
> 
> 
>
>
> Regards,
> Mahmood
>
>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Mahmood Naderan
disas command fails.

Program terminated with signal 4, Illegal instruction.
#0  0x008da76e in ?? ()
(gdb) bt
#0  0x008da76e in ?? ()
#1  0x008da970 in ?? ()
#2  0x00bfe9f8 in ?? ()
#3  0x in ?? ()
(gdb) disas
No function contains program counter for selected frame.


>Btw, did you run some simple applications with openmpi 2.0.0 ?
>We do have bits of assembly code, and even if i do not believe they are
specific to intel cpus, i might be wrong >and that could be the root cause.

I didn't run the tests. But I am pretty sure that OpenMPI is working
because, other applications (not siesta) have no problem.
Please note that the CPUs are AMD. Frontend is Opteron 6380 and the compute
nodes are 6282SE

>Also, did you run
>make check
>After you built openmpi ?

All are OK. Please see below.


Testsuite summary for Open MPI 2.0.0

# TOTAL: 2
# PASS:  2
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0



Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Still "illegal instruction"

2016-09-15 Thread Gilles Gouaillardet
--core=... is the right syntax, sorry about that
No need to recompile with -g, binary is good enough here

Then you need to run
disas
in gdb, to disassemble the instruction at 0x08da76e
And then, still in gdb
info maps
or
show maps
To find out the library this instruction is coming from

OpenBLAS is fine, my question is if you compiled it by yourself, and on the 
same platform 

Btw, did you run some simple applications with openmpi 2.0.0 ?
We do have bits of assembly code, and even if i do not believe they are 
specific to intel cpus, i might be wrong and that could be the root cause.
Also, did you run
make check
After you built openmpi ?

Cheers,

Gilles

Mahmood Naderan  wrote:
>>gdb --pid=core.5383
>
>
>​Are you sure about the syntax?​
>
>​PID must be a running process. I see --core which seems to be relevant here.
>
>
>Both OpenMPI and Siesta were compiled with O flags. This is not appropriate 
>for gdb. Should I compile both of them with debug symbols?
>
>
>>Btw, did you compile lapack and friends by yourself ?
>
>I use Scalapack which need BLAS. I use OpenBLAS instead of netllib's BLAS?
>
>
>
>​$ gdb --core=core.5383
>
>Try: yum --enablerepo='*-debug*' install 
>/usr/lib/debug/.build-id/e1/ddc85f7caa9f2571545a58479d64ba676217dd
>[New Thread 5383]
>[New Thread 5416]
>[New Thread 5401]
>[New Thread 5388]
>[New Thread 5407]
>[New Thread 5406]
>[New Thread 5418]
>[New Thread 5393]
>[New Thread 5391]
>[New Thread 5387]
>[New Thread 5405]
>[New Thread 5389]
>[New Thread 5408]
>[New Thread 5417]
>[New Thread 5394]
>[New Thread 5506]
>[New Thread 5404]
>[New Thread 5392]
>[New Thread 5410]
>[New Thread 5411]
>[New Thread 5395]
>[New Thread 5409]
>[New Thread 5403]
>[New Thread 5414]
>[New Thread 5396]
>[New Thread 5412]
>[New Thread 5419]
>[New Thread 5413]
>[New Thread 5509]
>[New Thread 5415]
>[New Thread 5397]
>[New Thread 5420]
>[New Thread 5398]
>[New Thread 5399]
>Core was generated by `/share/apps/siesta/siesta-4.0/tpar/transiesta'.
>Program terminated with signal 4, Illegal instruction.
>#0  0x008da76e in ?? ()
>(gdb) bt
>#0  0x008da76e in ?? ()
>#1  0x008da970 in ?? ()
>#2  0x00bfe9f8 in ?? ()
>#3  0x in ?? ()
>(gdb)
>​
>
>
>Regards,
>Mahmood
>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Still "illegal instruction"

2016-09-15 Thread Mahmood Naderan
>gdb --pid=core.5383

​Are you sure about the syntax?​
​PID must be a running process. I see --core which seems to be relevant
here.

Both OpenMPI and Siesta were compiled with O flags. This is not appropriate
for gdb. Should I compile both of them with debug symbols?

>Btw, did you compile lapack and friends by yourself ?
I use Scalapack which need BLAS. I use OpenBLAS instead of netllib's BLAS?


​$ gdb --core=core.5383

Try: yum --enablerepo='*-debug*' install
/usr/lib/debug/.build-id/e1/ddc85f7caa9f2571545a58479d64ba676217dd
[New Thread 5383]
[New Thread 5416]
[New Thread 5401]
[New Thread 5388]
[New Thread 5407]
[New Thread 5406]
[New Thread 5418]
[New Thread 5393]
[New Thread 5391]
[New Thread 5387]
[New Thread 5405]
[New Thread 5389]
[New Thread 5408]
[New Thread 5417]
[New Thread 5394]
[New Thread 5506]
[New Thread 5404]
[New Thread 5392]
[New Thread 5410]
[New Thread 5411]
[New Thread 5395]
[New Thread 5409]
[New Thread 5403]
[New Thread 5414]
[New Thread 5396]
[New Thread 5412]
[New Thread 5419]
[New Thread 5413]
[New Thread 5509]
[New Thread 5415]
[New Thread 5397]
[New Thread 5420]
[New Thread 5398]
[New Thread 5399]
Core was generated by `/share/apps/siesta/siesta-4.0/tpar/transiesta'.
Program terminated with signal 4, Illegal instruction.
#0  0x008da76e in ?? ()
(gdb) bt
#0  0x008da76e in ?? ()
#1  0x008da970 in ?? ()
#2  0x00bfe9f8 in ?? ()
#3  0x in ?? ()
(gdb)
​

Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Still "illegal instruction"

2016-09-15 Thread Gilles Gouaillardet
Mahmood,

You can

gdb --pid=core.5383
And then
bt
An then
disas
And "scroll" until the current instruction
Iirc, there is a star at the beginning of this line
You can also try
show maps
Or
info maps
(I cannot remember the syntax...)

Btw, did you compile lapack and friends by yourself ?

Mahmood Naderan  wrote:
>Hi,
>
>After upgrading OpenMPI (from 1.6.5 to 2.0.0) and my program (from 3.2 to 
>4.0), still the parallel run aborts with the "Illegal instruction" error in 
>the middle on the run.
>
>
>I wonder why this happens and how can I debug more? How can I find that this 
>error is related to the program itself, mpi or system libraries?
>
>
>Gilles gave a suggestion about using ulimit to create a core file 
>(https://mail-archive.com/users@lists.open-mpi.org/msg29919.html). Please see 
>the following:
>
>
>mahmood@cluster:tran$ cat sc.sh
>#!/bin/bash
>ulimit -c unlimited
>exec /share/apps/siesta/siesta-4.0/tpar/transiesta < trans-cc.fdf
>mahmood@cluster:tran$ cat hosts.txt
>compute-0-1
>mahmood@cluster:tran$ hostname
>cluster
>mahmood@cluster:tran$ #/share/apps/siesta/openmpi-2.0.0/bin/mpirun -hostfile 
>hosts.txt -np 15 sc.sh
>
>
>
>--
>mpirun noticed that process rank 0 with PID 5383 on node compute-0-1 exited on 
>signal 4 (Illegal instruction).
>--
>
>
>
>Now I see a file core.5383
>
>It is a very huge file (1290018816 bytes)!!! 
>
>How can I process that?
>
>
>Regards,
>Mahmood
>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Still "illegal instruction"

2016-09-15 Thread Mahmood Naderan
Hi,
After upgrading OpenMPI (from 1.6.5 to 2.0.0) and my program (from 3.2 to
4.0), still the parallel run aborts with the "Illegal instruction" error in
the middle on the run.

I wonder why this happens and how can I debug more? How can I find that
this error is related to the program itself, mpi or system libraries?

Gilles gave a suggestion about using ulimit to create a core file (
https://mail-archive.com/users@lists.open-mpi.org/msg29919.html). Please
see the following:

mahmood@cluster:tran$ cat sc.sh
#!/bin/bash
ulimit -c unlimited
exec /share/apps/siesta/siesta-4.0/tpar/transiesta < trans-cc.fdf
mahmood@cluster:tran$ cat hosts.txt
compute-0-1
mahmood@cluster:tran$ hostname
cluster
mahmood@cluster:tran$ #/share/apps/siesta/openmpi-2.0.0/bin/mpirun
-hostfile hosts.txt -np 15 sc.sh

--
mpirun noticed that process rank 0 with PID 5383 on node compute-0-1 exited
on signal 4 (Illegal instruction).
--



Now I see a file core.5383
It is a very huge file (1290018816 bytes)!!!
How can I process that?

Regards,
Mahmood
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users