Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686 on Solaris

2014-10-24 Thread Gilles Gouaillardet
Siegmar,

how did you configure openmpi ? which java version did you use ?

i just found a regression and you currently have to explicitly add
CFLAGS=-D_REENTRANT CPPFLAGS=-D_REENTRANT
to your configure command line

if you want to debug this issue (i cannot reproduce it on a solaris 11
x86 virtual machine)
you can apply the attached patch, and make sure you configure with
--enable-debug and run

OMPI_ATTACH=1 mpiexec -n 1 java InitFinalizeMain

then you will need to attach the *java* process with gdb, set the _dbg
local variable to zero and continue
you should get a clean stack trace and hopefully we will be able to help

Cheers,

Gilles

On 2014/10/24 0:03, Siegmar Gross wrote:
> Hello Oscar,
>
> do you have time to look into my problem? Probably Takahiro has a
> point and gdb behaves differently on Solaris and Linux, so that
> the differing outputs have no meaning. I tried to debug my Java
> program, but without success so far, because I wasn't able to get
> into the Java program to set a breakpoint or to see the code. Have
> you succeeded to debug a mpiJava program? If so, how must I call
> gdb (I normally use "gdb mipexec" and then "run -np 1 java ...")?
> What can I do to get helpful information to track the error down?
> I have attached the error log file. Perhaps you can see if something
> is going wrong with the Java interface. Thank you very much for your
> help and any hints for the usage of gdb with mpiJava in advance.
> Please let me know if I can provide anything else.
>
>
> Kind regards
>
> Siegmar
>
>
>>> I think that it must have to do with MPI, because everything
>>> works fine on Linux and my Java program works fine with an older
>>> MPI version (openmpi-1.8.2a1r31804) as well.
>> Yes. I also think it must have to do with MPI.
>> But java process side, not mpiexec process side.
>>
>> When you run Java MPI program via mpiexec, a mpiexec process
>> process launch a java process. When the java process (your
>> Java program) calls a MPI method, native part (written in C/C++)
>> of the MPI library is called. It runs in java process, not in
>> mpiexec process. I suspect that part.
>>
>>> On Solaris things are different.
>> Are you saying the following difference?
>> After this line,
>>> 881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT);
>> Linux shows
>>> orte_job_state_to_str (state=1)
>>> at ../../openmpi-dev-124-g91e9686/orte/util/error_strings.c:217
>>> 217 switch(state) {
>> but Solaris shows
>>> orte_util_print_name_args (name=0x100118380 )
>>> at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122
>>> 122 if (NULL == name) {
>> Each macro is defined as:
>>
>> #define ORTE_ACTIVATE_JOB_STATE(j, s)   \
>> do {\
>> orte_job_t *shadow=(j); \
>> opal_output_verbose(1, orte_state_base_framework.framework_output, \
>> "%s ACTIVATE JOB %s STATE %s AT %s:%d",  \
>> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \
>> (NULL == shadow) ? "NULL" : \
>> ORTE_JOBID_PRINT(shadow->jobid), \
>> orte_job_state_to_str((s)), \
>> __FILE__, __LINE__); \
>> orte_state.activate_job_state(shadow, (s)); \
>> } while(0);
>>
>> #define ORTE_NAME_PRINT(n) \
>> orte_util_print_name_args(n)
>>
>> #define ORTE_JOBID_PRINT(n) \
>> orte_util_print_jobids(n)
>>
>> I'm not sure, but I think the gdb on Solaris steps into
>> orte_util_print_name_args, but gdb on Linux doesn't step into
>> orte_util_print_name_args and orte_util_print_jobids for some
>> reason, or orte_job_state_to_str is evaluated before them.
>>
>> So I think it's not an important difference.
>>
>> You showed the following lines.
> orterun (argc=5, argv=0x7fffe0d8)
> at 
> ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c:1084
> 1084while (orte_event_base_active) {
> (gdb) 
> 1085opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
> (gdb) 
>> I'm not familiar with this code but I think this part (in mpiexec
>> process) is only waiting the java process to terminate (normally
>> or abnormally). So I think the problem is not in a mpiexec process
>> but in a java process.
>>
>> Regards,
>> Takahiro
>>
>>> Hi Takahiro,
>>>
 mpiexec and java run as distinct processes. Your JRE message
 says java process raises SEGV. So you should trace the java
 process, not the mpiexec process. And more, your JRE message
 says the crash happened outside the Java Virtual Machine in
 native code. So usual Java program debugger is useless.
 You should trace native code part of the java process.
 

Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686 on Solaris

2014-10-23 Thread Kawashima, Takahiro
Hi Siegmar,

The attached JRE log shows very important information.

When JRE loads the MPI class, JNI_OnLoad function in
libmpi_java.so (Open MPI library; written in C) is called.
And probably mca_base_var_cache_files function passes NULL
to asprintf function. I don't know how this situation occurs.
You may be able to track this by inserting debug printf in
Open MPI code shown in the stack trace, or by using gdb or
something.

hs_err_pid13080.log:

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), 
si_addr=0x

Stack: [0x7b40,0x7b50],  sp=0x7b4fc730,  free 
space=1009k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libc.so.1+0x3c7f0]  strlen+0x50
C  [libc.so.1+0xaf640]  vsnprintf+0x84
C  [libc.so.1+0xaadb4]  vasprintf+0x20
C  [libc.so.1+0xaaf04]  asprintf+0x28
C  [libopen-pal.so.0.0.0+0xaf3cc]  mca_base_var_cache_files+0x160
C  [libopen-pal.so.0.0.0+0xaed90]  mca_base_var_init+0x4e8
C  [libopen-pal.so.0.0.0+0xb260c]  register_variable+0x214
C  [libopen-pal.so.0.0.0+0xb36a0]  mca_base_var_register+0x104
C  [libmpi_java.so.0.0.0+0x221e8]  JNI_OnLoad+0x128
C  [libjava.so+0x10860]  Java_java_lang_ClassLoader_00024NativeLibrary_load+0xb8
j  java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+-665819
j  java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+0
j  java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+328
j  java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+290
j  java.lang.Runtime.loadLibrary0(Ljava/lang/Class;Ljava/lang/String;)V+54
j  java.lang.System.loadLibrary(Ljava/lang/String;)V+7
j  mpi.MPI.()V+28


Regards,
Takahiro

> Hello Oscar,
> 
> do you have time to look into my problem? Probably Takahiro has a
> point and gdb behaves differently on Solaris and Linux, so that
> the differing outputs have no meaning. I tried to debug my Java
> program, but without success so far, because I wasn't able to get
> into the Java program to set a breakpoint or to see the code. Have
> you succeeded to debug a mpiJava program? If so, how must I call
> gdb (I normally use "gdb mipexec" and then "run -np 1 java ...")?
> What can I do to get helpful information to track the error down?
> I have attached the error log file. Perhaps you can see if something
> is going wrong with the Java interface. Thank you very much for your
> help and any hints for the usage of gdb with mpiJava in advance.
> Please let me know if I can provide anything else.
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> > > I think that it must have to do with MPI, because everything
> > > works fine on Linux and my Java program works fine with an older
> > > MPI version (openmpi-1.8.2a1r31804) as well.
> > 
> > Yes. I also think it must have to do with MPI.
> > But java process side, not mpiexec process side.
> > 
> > When you run Java MPI program via mpiexec, a mpiexec process
> > process launch a java process. When the java process (your
> > Java program) calls a MPI method, native part (written in C/C++)
> > of the MPI library is called. It runs in java process, not in
> > mpiexec process. I suspect that part.
> > 
> > > On Solaris things are different.
> > 
> > Are you saying the following difference?
> > After this line,
> > > 881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT);
> > Linux shows
> > > orte_job_state_to_str (state=1)
> > > at ../../openmpi-dev-124-g91e9686/orte/util/error_strings.c:217
> > > 217 switch(state) {
> > but Solaris shows
> > > orte_util_print_name_args (name=0x100118380 )
> > > at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122
> > > 122 if (NULL == name) {
> > 
> > Each macro is defined as:
> > 
> > #define ORTE_ACTIVATE_JOB_STATE(j, s)   \
> > do {\
> > orte_job_t *shadow=(j); \
> > opal_output_verbose(1, orte_state_base_framework.framework_output, \
> > "%s ACTIVATE JOB %s STATE %s AT %s:%d", \
> > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \
> > (NULL == shadow) ? "NULL" : \
> > ORTE_JOBID_PRINT(shadow->jobid),
> > \
> > orte_job_state_to_str((s)), \
> > __FILE__, __LINE__);\
> > orte_state.activate_job_state(shadow, (s)); \
> > } while(0);
> > 
> > #define ORTE_NAME_PRINT(n) \
> > orte_util_print_name_args(n)
> > 
> > #define ORTE_JOBID_PRINT(n) \
> > orte_util_print_jobids(n)
> > 
> > I'm not sure, but I think the gdb on Solaris steps 

Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686 on Solaris

2014-10-23 Thread Oscar Vega-Gisbert

Hello Siegmar,

If your Java program only calls to MPI.Init and MPI.Finalize you don't 
need debug Java. The JNI layer is very thin, so I think the problem is 
not in Java. Also, if the process crash is in the JNI side, debugging 
won't provides you useful information.


But if you want debug 2 processes, you can do the following.

You must launch 2 instances of the Java debugger (jdb) or NetBeans, 
Eclipse,... listening on port 8000.


The 2 processes must be launched with the necessary parameters to attach 
to the listening debuggers:


mpirun -np 2 java -agentlib:jdwp=transport=dt_socket,\
server=n,address=localhost:8000 Hello

I checked it on NetBeans and it works.
Here you have more details about debugging:

http://docs.oracle.com/javase/8/docs/technotes/guides/jpda/conninv.html

Regards,
Oscar

El 23/10/14 17:03, Siegmar Gross escribió:

Hello Oscar,

do you have time to look into my problem? Probably Takahiro has a
point and gdb behaves differently on Solaris and Linux, so that
the differing outputs have no meaning. I tried to debug my Java
program, but without success so far, because I wasn't able to get
into the Java program to set a breakpoint or to see the code. Have
you succeeded to debug a mpiJava program? If so, how must I call
gdb (I normally use "gdb mipexec" and then "run -np 1 java ...")?
What can I do to get helpful information to track the error down?
I have attached the error log file. Perhaps you can see if something
is going wrong with the Java interface. Thank you very much for your
help and any hints for the usage of gdb with mpiJava in advance.
Please let me know if I can provide anything else.


Kind regards

Siegmar



I think that it must have to do with MPI, because everything
works fine on Linux and my Java program works fine with an older
MPI version (openmpi-1.8.2a1r31804) as well.

Yes. I also think it must have to do with MPI.
But java process side, not mpiexec process side.

When you run Java MPI program via mpiexec, a mpiexec process
process launch a java process. When the java process (your
Java program) calls a MPI method, native part (written in C/C++)
of the MPI library is called. It runs in java process, not in
mpiexec process. I suspect that part.


On Solaris things are different.

Are you saying the following difference?
After this line,

881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT);

Linux shows

orte_job_state_to_str (state=1)
 at ../../openmpi-dev-124-g91e9686/orte/util/error_strings.c:217
217 switch(state) {

but Solaris shows

orte_util_print_name_args (name=0x100118380 )
 at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122
122 if (NULL == name) {

Each macro is defined as:

#define ORTE_ACTIVATE_JOB_STATE(j, s)   \
 do {\
 orte_job_t *shadow=(j); \
 opal_output_verbose(1, orte_state_base_framework.framework_output, \
 "%s ACTIVATE JOB %s STATE %s AT %s:%d",  \
 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \
 (NULL == shadow) ? "NULL" : \
 ORTE_JOBID_PRINT(shadow->jobid),\
 orte_job_state_to_str((s)), \
 __FILE__, __LINE__);   \
 orte_state.activate_job_state(shadow, (s)); \
 } while(0);

#define ORTE_NAME_PRINT(n) \
 orte_util_print_name_args(n)

#define ORTE_JOBID_PRINT(n) \
 orte_util_print_jobids(n)

I'm not sure, but I think the gdb on Solaris steps into
orte_util_print_name_args, but gdb on Linux doesn't step into
orte_util_print_name_args and orte_util_print_jobids for some
reason, or orte_job_state_to_str is evaluated before them.

So I think it's not an important difference.

You showed the following lines.

orterun (argc=5, argv=0x7fffe0d8)
 at

../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c:1084

1084while (orte_event_base_active) {
(gdb)
1085opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
(gdb)

I'm not familiar with this code but I think this part (in mpiexec
process) is only waiting the java process to terminate (normally
or abnormally). So I think the problem is not in a mpiexec process
but in a java process.

Regards,
Takahiro


Hi Takahiro,


mpiexec and java run as distinct processes. Your JRE message
says java process raises SEGV. So you should trace the java
process, not the mpiexec process. And more, your JRE message
says the crash happened outside the Java Virtual Machine in
native code. So usual Java program debugger is useless.
You should trace native code part of the java process.
Unfortunately I don't know how to debug such one.

I think that it must have to do with MPI, 

Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686 on Solaris

2014-10-23 Thread Siegmar Gross
Hello Oscar,

do you have time to look into my problem? Probably Takahiro has a
point and gdb behaves differently on Solaris and Linux, so that
the differing outputs have no meaning. I tried to debug my Java
program, but without success so far, because I wasn't able to get
into the Java program to set a breakpoint or to see the code. Have
you succeeded to debug a mpiJava program? If so, how must I call
gdb (I normally use "gdb mipexec" and then "run -np 1 java ...")?
What can I do to get helpful information to track the error down?
I have attached the error log file. Perhaps you can see if something
is going wrong with the Java interface. Thank you very much for your
help and any hints for the usage of gdb with mpiJava in advance.
Please let me know if I can provide anything else.


Kind regards

Siegmar


> > I think that it must have to do with MPI, because everything
> > works fine on Linux and my Java program works fine with an older
> > MPI version (openmpi-1.8.2a1r31804) as well.
> 
> Yes. I also think it must have to do with MPI.
> But java process side, not mpiexec process side.
> 
> When you run Java MPI program via mpiexec, a mpiexec process
> process launch a java process. When the java process (your
> Java program) calls a MPI method, native part (written in C/C++)
> of the MPI library is called. It runs in java process, not in
> mpiexec process. I suspect that part.
> 
> > On Solaris things are different.
> 
> Are you saying the following difference?
> After this line,
> > 881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT);
> Linux shows
> > orte_job_state_to_str (state=1)
> > at ../../openmpi-dev-124-g91e9686/orte/util/error_strings.c:217
> > 217 switch(state) {
> but Solaris shows
> > orte_util_print_name_args (name=0x100118380 )
> > at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122
> > 122 if (NULL == name) {
> 
> Each macro is defined as:
> 
> #define ORTE_ACTIVATE_JOB_STATE(j, s)   \
> do {\
> orte_job_t *shadow=(j); \
> opal_output_verbose(1, orte_state_base_framework.framework_output, \
> "%s ACTIVATE JOB %s STATE %s AT %s:%d",   \
> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \
> (NULL == shadow) ? "NULL" : \
> ORTE_JOBID_PRINT(shadow->jobid),  \
> orte_job_state_to_str((s)), \
> __FILE__, __LINE__);  \
> orte_state.activate_job_state(shadow, (s)); \
> } while(0);
> 
> #define ORTE_NAME_PRINT(n) \
> orte_util_print_name_args(n)
> 
> #define ORTE_JOBID_PRINT(n) \
> orte_util_print_jobids(n)
> 
> I'm not sure, but I think the gdb on Solaris steps into
> orte_util_print_name_args, but gdb on Linux doesn't step into
> orte_util_print_name_args and orte_util_print_jobids for some
> reason, or orte_job_state_to_str is evaluated before them.
> 
> So I think it's not an important difference.
> 
> You showed the following lines.
> > > > orterun (argc=5, argv=0x7fffe0d8)
> > > > at 
../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c:1084
> > > > 1084while (orte_event_base_active) {
> > > > (gdb) 
> > > > 1085opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
> > > > (gdb) 
> 
> I'm not familiar with this code but I think this part (in mpiexec
> process) is only waiting the java process to terminate (normally
> or abnormally). So I think the problem is not in a mpiexec process
> but in a java process.
> 
> Regards,
> Takahiro
> 
> > Hi Takahiro,
> > 
> > > mpiexec and java run as distinct processes. Your JRE message
> > > says java process raises SEGV. So you should trace the java
> > > process, not the mpiexec process. And more, your JRE message
> > > says the crash happened outside the Java Virtual Machine in
> > > native code. So usual Java program debugger is useless.
> > > You should trace native code part of the java process.
> > > Unfortunately I don't know how to debug such one.
> > 
> > I think that it must have to do with MPI, because everything
> > works fine on Linux and my Java program works fine with an older
> > MPI version (openmpi-1.8.2a1r31804) as well.
> > 
> > linpc1 x 112 mpiexec -np 1 java InitFinalizeMain
> > Hello!
> > linpc1 x 113 
> > 
> > Therefore I single stepped through the program on Linux as well
> > and found a difference launching the process. On Linux I get the
> > following sequence.
> > 
> > Breakpoint 1, rsh_launch (jdata=0x614aa0)
> > at 
../../../../../openmpi-dev-124-g91e9686/orte/mca/plm/rsh/plm_rsh_module.c:876
> > 876 if (ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_RESTART)) {
> > (gdb) s
> > 881