Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-03 Thread Nate Chambers
Gilles,
Yes I saw that github thread, but wasn't certain this was the same issue.
Very possible that it is. Oddly enough, that github code doesn't crash for
us.

Adding a sleep call doesn't help. It's actually now crashing on the
MPI.init(args) call itself, and the JVM is reporting the error. Earlier it
would get past this point. I'm not certain why this has changed all of a
sudden. We did change a bit in our unrelated java code...

Below is the output. It does match more closely to that previous report.


Nate

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x2b00ad2807cf, pid=28537, tid=47281916847872
#
# JRE version: 7.0_21-b11
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.21-b01 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x57c7cf]  jni_GetStringUTFChars+0x9f
#
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /gpfs/home/nchamber/hs_err_pid28537.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x2b198c15b7cf, pid=28538, tid=47388736182016
#
# JRE version: 7.0_21-b11
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.21-b01 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x57c7cf]#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
  jni_GetStringUTFChars+0x9f
#
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /gpfs/home/nchamber/hs_err_pid28538.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
--
mpirun noticed that process rank 0 with PID 28537 on node r3n70 exited on
signal 6 (Aborted).
--










On Mon, Aug 3, 2015 at 2:47 PM, Gilles Gouaillardet 
wrote:

> Nate,
>
> a similar issue has already been reported at
> https://github.com/open-mpi/ompi/issues/369, but we have
> not yet been able to figure out what is going wrong.
>
> right after MPI_Init(), can you add
> Thread.sleep(5000);
> and see if it helps ?
>
> Cheers,
>
> Gilles
>
>
> On 8/4/2015 8:36 AM, Nate Chambers wrote:
>
> We've been struggling with this error for a while, so hoping someone more
> knowledgeable can help!
>
> Our java MPI code exits with a segfault during its normal operation, *but
> the segfault occurs before our code ever uses MPI functionality like
> sending/receiving. *We've removed all message calls and any use of
> MPI.COMM_WORLD from the code. The segfault occurs if we call MPI.init(args)
> in our code, and does not if we comment that line out. Further vexing us,
> the crash doesn't happen at the point of the MPI.init call, but later on in
> the program. I don't have an easy-to-run example here because our non-MPI
> code is so large and complicated. We have run simpler test programs with
> MPI and the segfault does not occur.
>
> We have isolated the line where the segfault occurs. However, if we
> comment that out, the program will run longer, but then randomly (but
> deterministically) segfault later on in the code. Does anyone have tips on
> how to debug this? We have tried several flags with mpirun, but no good
> clues.
>
> We have also tried several MPI versions, including stable 1.8.7 and the
> most recent 1.8.8rc1
>
>
> ATTACHED
> - config.log from installation
> - output from `ompi_info -all`
>
>
> OUTPUT FROM RUNNING
>
> > mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt
> ...
> some normal output from our code
> ...
> --
> mpirun noticed that process rank 0 with PID 29646 on node r9n69 exited on
> signal 11 (Segmentation fault).
> --
>
>
>
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/08/27386.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27387.php
>


Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-03 Thread Gilles Gouaillardet

Nate,

a similar issue has already been reported at 
https://github.com/open-mpi/ompi/issues/369, but we have

not yet been able to figure out what is going wrong.

right after MPI_Init(), can you add
Thread.sleep(5000);
and see if it helps ?

Cheers,

Gilles

On 8/4/2015 8:36 AM, Nate Chambers wrote:
We've been struggling with this error for a while, so hoping someone 
more knowledgeable can help!


Our java MPI code exits with a segfault during its normal operation, 
*but the segfault occurs before our code ever uses MPI functionality 
like sending/receiving. *We've removed all message calls and any use 
of MPI.COMM_WORLD from the code. The segfault occurs if we call 
MPI.init(args) in our code, and does not if we comment that line out. 
Further vexing us, the crash doesn't happen at the point of the 
MPI.init call, but later on in the program. I don't have an 
easy-to-run example here because our non-MPI code is so large and 
complicated. We have run simpler test programs with MPI and the 
segfault does not occur.


We have isolated the line where the segfault occurs. However, if we 
comment that out, the program will run longer, but then randomly (but 
deterministically) segfault later on in the code. Does anyone have 
tips on how to debug this? We have tried several flags with mpirun, 
but no good clues.


We have also tried several MPI versions, including stable 1.8.7 and 
the most recent 1.8.8rc1



ATTACHED
- config.log from installation
- output from `ompi_info -all`


OUTPUT FROM RUNNING

> mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt
...
some normal output from our code
...
--
mpirun noticed that process rank 0 with PID 29646 on node r9n69 exited 
on signal 11 (Segmentation fault).

--





___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27386.php




[OMPI users] segfault on java binding from MPI.init()

2015-08-03 Thread Nate Chambers
We've been struggling with this error for a while, so hoping someone more
knowledgeable can help!

Our java MPI code exits with a segfault during its normal operation, *but
the segfault occurs before our code ever uses MPI functionality like
sending/receiving. *We've removed all message calls and any use of
MPI.COMM_WORLD from the code. The segfault occurs if we call MPI.init(args)
in our code, and does not if we comment that line out. Further vexing us,
the crash doesn't happen at the point of the MPI.init call, but later on in
the program. I don't have an easy-to-run example here because our non-MPI
code is so large and complicated. We have run simpler test programs with
MPI and the segfault does not occur.

We have isolated the line where the segfault occurs. However, if we comment
that out, the program will run longer, but then randomly (but
deterministically) segfault later on in the code. Does anyone have tips on
how to debug this? We have tried several flags with mpirun, but no good
clues.

We have also tried several MPI versions, including stable 1.8.7 and the
most recent 1.8.8rc1


ATTACHED
- config.log from installation
- output from `ompi_info -all`


OUTPUT FROM RUNNING

> mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt
...
some normal output from our code
...
--
mpirun noticed that process rank 0 with PID 29646 on node r9n69 exited on
signal 11 (Segmentation fault).
--


config.log.bz2
Description: BZip2 compressed data


ompi_info.txt.bz2
Description: BZip2 compressed data


Re: [OMPI users] pbs vs openmpi node allocation

2015-08-03 Thread Gus Correa

Hi Abhisek

On 08/03/2015 12:59 PM, abhisek Mondal wrote:

Hi,

   I'm using openmpi-1.6.4 to distribute a jobs in 2 different nodes
using this command:
/"mpirun --hostfile myhostfile -np 10 nwchem my_code.nw"/
Here, "myhostfile" contains:
/cx0937 slots=5 /
/cx0934 slots=5/


I am assuming by pbs you mean Torque.
If your Open MPI was built with Torque support (--with-tm),
then you don't even need the --hostfile option
(and probably shouldn't use it).
Unless newchem behaves in a very non-standard way,
which I don't really know.

Open MPI will use the nodes provided by Torque.

To check this do;

ompi_info |grep tm

In the Open MPI parlance Torque is "tm".



But as I have to submit the jobs using .pbs script, I'm wondering in
this case, how "mpirun" going to choose the node (free node allocation
is done by pbs) from "myhostfile".
I mean, does it happen that until the specific-nodes (as mentioned in
myhostfile) become free "mpirun" is going to wait and then start ?
How can I forward the allocated node name(by pbs) to /mpirun/ command ?

A little light on this matter would be really great.



Your script only starts after Torque allocates the nodes and starts the 
script on the first node.

Mpirun doesn't choose the nodes, it uses it.
If you are using Torque it may be worth looking into some of its 
environment variables.


"man qsub" will tell you a lot about them, and probably will clarify
many things more.

Some very useful ones are:

   PBS_O_WORKDIR
  the absolute path of the current working directory of the 
qsub command.


   PBS_JOBID
  the job identifier assigned to the job by the batch system.

   PBS_JOBNAME
  the job name supplied by the user.

   PBS_NODEFILE
  the name of the file contain the list of nodes assigned 
to the job (for parallel and cluster systems).



***

In your script you can

cd $PBS_O_WORKDIR

as by default Torque puts you in your home directory in the compute 
node, which may not be where you want to be.


Another way to document the nodes that you're using is to put this
line in your script:

cat $PBS_NODEFILE

which will list the nodes (repeated by as many cores/CPUs as you have 
requested from each node).  Actually, if you ever want to use the

mpirun --hostfile option, the actual node file would be $PBS_NODEFILE.
[You don't need to do it if Open MPI was built with Torque support.]



I hope this helps.
Gus Correa



Thank you.

--
Abhisek Mondal
/Research Fellow
/
/Structural Biology and Bioinformatics
/
/Indian Institute of Chemical Biology/
/Kolkata 700032
/
/INDIA
/


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27383.php





Re: [OMPI users] pbs vs openmpi node allocation

2015-08-03 Thread Andrus, Brian Contractor
Abhisek,

Generally, if you built openmpi with PBS support, it will automatically using 
the appropriate nodes with ‘mpirun ’
If not, you can use the environment variables provided to your session:

mpirun --hostfile $PBS_NODEFILE -np $(cat $PBS_NODEFILE|wc -l) 


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of abhisek Mondal
Sent: Monday, August 03, 2015 10:00 AM
To: Open MPI Users
Subject: [OMPI users] pbs vs openmpi node allocation

Hi,

  I'm using openmpi-1.6.4 to distribute a jobs in 2 different nodes using this 
command:
"mpirun --hostfile myhostfile -np 10 nwchem my_code.nw"
Here, "myhostfile" contains:
cx0937 slots=5
cx0934 slots=5

But as I have to submit the jobs using .pbs script, I'm wondering in this case, 
how "mpirun" going to choose the node (free node allocation is done by pbs) 
from "myhostfile".
I mean, does it happen that until the specific-nodes (as mentioned in 
myhostfile) become free "mpirun" is going to wait and then start ?
How can I forward the allocated node name(by pbs) to mpirun command ?

A little light on this matter would be really great.

Thank you.

--
Abhisek Mondal
Research Fellow
Structural Biology and Bioinformatics
Indian Institute of Chemical Biology
Kolkata 700032
INDIA


[OMPI users] pbs vs openmpi node allocation

2015-08-03 Thread abhisek Mondal
Hi,

  I'm using openmpi-1.6.4 to distribute a jobs in 2 different nodes using
this command:
*"mpirun --hostfile myhostfile -np 10 nwchem my_code.nw"*
Here, "myhostfile" contains:
*cx0937 slots=5*
*cx0934 slots=5*

But as I have to submit the jobs using .pbs script, I'm wondering in this
case, how "mpirun" going to choose the node (free node allocation is done
by pbs) from "myhostfile".
I mean, does it happen that until the specific-nodes (as mentioned in
myhostfile) become free "mpirun" is going to wait and then start ?
How can I forward the allocated node name(by pbs) to *mpirun* command ?

A little light on this matter would be really great.

Thank you.

-- 
Abhisek Mondal

*Research Fellow*

*Structural Biology and Bioinformatics*
*Indian Institute of Chemical Biology*

*Kolkata 700032*

*INDIA*