Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686 on Solaris
Siegmar, how did you configure openmpi ? which java version did you use ? i just found a regression and you currently have to explicitly add CFLAGS=-D_REENTRANT CPPFLAGS=-D_REENTRANT to your configure command line if you want to debug this issue (i cannot reproduce it on a solaris 11 x86 virtual machine) you can apply the attached patch, and make sure you configure with --enable-debug and run OMPI_ATTACH=1 mpiexec -n 1 java InitFinalizeMain then you will need to attach the *java* process with gdb, set the _dbg local variable to zero and continue you should get a clean stack trace and hopefully we will be able to help Cheers, Gilles On 2014/10/24 0:03, Siegmar Gross wrote: > Hello Oscar, > > do you have time to look into my problem? Probably Takahiro has a > point and gdb behaves differently on Solaris and Linux, so that > the differing outputs have no meaning. I tried to debug my Java > program, but without success so far, because I wasn't able to get > into the Java program to set a breakpoint or to see the code. Have > you succeeded to debug a mpiJava program? If so, how must I call > gdb (I normally use "gdb mipexec" and then "run -np 1 java ...")? > What can I do to get helpful information to track the error down? > I have attached the error log file. Perhaps you can see if something > is going wrong with the Java interface. Thank you very much for your > help and any hints for the usage of gdb with mpiJava in advance. > Please let me know if I can provide anything else. > > > Kind regards > > Siegmar > > >>> I think that it must have to do with MPI, because everything >>> works fine on Linux and my Java program works fine with an older >>> MPI version (openmpi-1.8.2a1r31804) as well. >> Yes. I also think it must have to do with MPI. >> But java process side, not mpiexec process side. >> >> When you run Java MPI program via mpiexec, a mpiexec process >> process launch a java process. When the java process (your >> Java program) calls a MPI method, native part (written in C/C++) >> of the MPI library is called. It runs in java process, not in >> mpiexec process. I suspect that part. >> >>> On Solaris things are different. >> Are you saying the following difference? >> After this line, >>> 881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT); >> Linux shows >>> orte_job_state_to_str (state=1) >>> at ../../openmpi-dev-124-g91e9686/orte/util/error_strings.c:217 >>> 217 switch(state) { >> but Solaris shows >>> orte_util_print_name_args (name=0x100118380) >>> at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122 >>> 122 if (NULL == name) { >> Each macro is defined as: >> >> #define ORTE_ACTIVATE_JOB_STATE(j, s) \ >> do {\ >> orte_job_t *shadow=(j); \ >> opal_output_verbose(1, orte_state_base_framework.framework_output, \ >> "%s ACTIVATE JOB %s STATE %s AT %s:%d", \ >> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \ >> (NULL == shadow) ? "NULL" : \ >> ORTE_JOBID_PRINT(shadow->jobid), \ >> orte_job_state_to_str((s)), \ >> __FILE__, __LINE__); \ >> orte_state.activate_job_state(shadow, (s)); \ >> } while(0); >> >> #define ORTE_NAME_PRINT(n) \ >> orte_util_print_name_args(n) >> >> #define ORTE_JOBID_PRINT(n) \ >> orte_util_print_jobids(n) >> >> I'm not sure, but I think the gdb on Solaris steps into >> orte_util_print_name_args, but gdb on Linux doesn't step into >> orte_util_print_name_args and orte_util_print_jobids for some >> reason, or orte_job_state_to_str is evaluated before them. >> >> So I think it's not an important difference. >> >> You showed the following lines. > orterun (argc=5, argv=0x7fffe0d8) > at > ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c:1084 > 1084while (orte_event_base_active) { > (gdb) > 1085opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE); > (gdb) >> I'm not familiar with this code but I think this part (in mpiexec >> process) is only waiting the java process to terminate (normally >> or abnormally). So I think the problem is not in a mpiexec process >> but in a java process. >> >> Regards, >> Takahiro >> >>> Hi Takahiro, >>> mpiexec and java run as distinct processes. Your JRE message says java process raises SEGV. So you should trace the java process, not the mpiexec process. And more, your JRE message says the crash happened outside the Java Virtual Machine in native code. So usual Java program debugger is useless. You should trace native code part of the java process.
Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686 on Solaris
Hi Siegmar, The attached JRE log shows very important information. When JRE loads the MPI class, JNI_OnLoad function in libmpi_java.so (Open MPI library; written in C) is called. And probably mca_base_var_cache_files function passes NULL to asprintf function. I don't know how this situation occurs. You may be able to track this by inserting debug printf in Open MPI code shown in the stack trace, or by using gdb or something. hs_err_pid13080.log: siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x Stack: [0x7b40,0x7b50], sp=0x7b4fc730, free space=1009k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libc.so.1+0x3c7f0] strlen+0x50 C [libc.so.1+0xaf640] vsnprintf+0x84 C [libc.so.1+0xaadb4] vasprintf+0x20 C [libc.so.1+0xaaf04] asprintf+0x28 C [libopen-pal.so.0.0.0+0xaf3cc] mca_base_var_cache_files+0x160 C [libopen-pal.so.0.0.0+0xaed90] mca_base_var_init+0x4e8 C [libopen-pal.so.0.0.0+0xb260c] register_variable+0x214 C [libopen-pal.so.0.0.0+0xb36a0] mca_base_var_register+0x104 C [libmpi_java.so.0.0.0+0x221e8] JNI_OnLoad+0x128 C [libjava.so+0x10860] Java_java_lang_ClassLoader_00024NativeLibrary_load+0xb8 j java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+-665819 j java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+0 j java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+328 j java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+290 j java.lang.Runtime.loadLibrary0(Ljava/lang/Class;Ljava/lang/String;)V+54 j java.lang.System.loadLibrary(Ljava/lang/String;)V+7 j mpi.MPI.()V+28 Regards, Takahiro > Hello Oscar, > > do you have time to look into my problem? Probably Takahiro has a > point and gdb behaves differently on Solaris and Linux, so that > the differing outputs have no meaning. I tried to debug my Java > program, but without success so far, because I wasn't able to get > into the Java program to set a breakpoint or to see the code. Have > you succeeded to debug a mpiJava program? If so, how must I call > gdb (I normally use "gdb mipexec" and then "run -np 1 java ...")? > What can I do to get helpful information to track the error down? > I have attached the error log file. Perhaps you can see if something > is going wrong with the Java interface. Thank you very much for your > help and any hints for the usage of gdb with mpiJava in advance. > Please let me know if I can provide anything else. > > > Kind regards > > Siegmar > > > > > I think that it must have to do with MPI, because everything > > > works fine on Linux and my Java program works fine with an older > > > MPI version (openmpi-1.8.2a1r31804) as well. > > > > Yes. I also think it must have to do with MPI. > > But java process side, not mpiexec process side. > > > > When you run Java MPI program via mpiexec, a mpiexec process > > process launch a java process. When the java process (your > > Java program) calls a MPI method, native part (written in C/C++) > > of the MPI library is called. It runs in java process, not in > > mpiexec process. I suspect that part. > > > > > On Solaris things are different. > > > > Are you saying the following difference? > > After this line, > > > 881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT); > > Linux shows > > > orte_job_state_to_str (state=1) > > > at ../../openmpi-dev-124-g91e9686/orte/util/error_strings.c:217 > > > 217 switch(state) { > > but Solaris shows > > > orte_util_print_name_args (name=0x100118380) > > > at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122 > > > 122 if (NULL == name) { > > > > Each macro is defined as: > > > > #define ORTE_ACTIVATE_JOB_STATE(j, s) \ > > do {\ > > orte_job_t *shadow=(j); \ > > opal_output_verbose(1, orte_state_base_framework.framework_output, \ > > "%s ACTIVATE JOB %s STATE %s AT %s:%d", \ > > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \ > > (NULL == shadow) ? "NULL" : \ > > ORTE_JOBID_PRINT(shadow->jobid), > > \ > > orte_job_state_to_str((s)), \ > > __FILE__, __LINE__);\ > > orte_state.activate_job_state(shadow, (s)); \ > > } while(0); > > > > #define ORTE_NAME_PRINT(n) \ > > orte_util_print_name_args(n) > > > > #define ORTE_JOBID_PRINT(n) \ > > orte_util_print_jobids(n) > > > > I'm not sure, but I think the gdb on Solaris steps
Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686 on Solaris
Hello Siegmar, If your Java program only calls to MPI.Init and MPI.Finalize you don't need debug Java. The JNI layer is very thin, so I think the problem is not in Java. Also, if the process crash is in the JNI side, debugging won't provides you useful information. But if you want debug 2 processes, you can do the following. You must launch 2 instances of the Java debugger (jdb) or NetBeans, Eclipse,... listening on port 8000. The 2 processes must be launched with the necessary parameters to attach to the listening debuggers: mpirun -np 2 java -agentlib:jdwp=transport=dt_socket,\ server=n,address=localhost:8000 Hello I checked it on NetBeans and it works. Here you have more details about debugging: http://docs.oracle.com/javase/8/docs/technotes/guides/jpda/conninv.html Regards, Oscar El 23/10/14 17:03, Siegmar Gross escribió: Hello Oscar, do you have time to look into my problem? Probably Takahiro has a point and gdb behaves differently on Solaris and Linux, so that the differing outputs have no meaning. I tried to debug my Java program, but without success so far, because I wasn't able to get into the Java program to set a breakpoint or to see the code. Have you succeeded to debug a mpiJava program? If so, how must I call gdb (I normally use "gdb mipexec" and then "run -np 1 java ...")? What can I do to get helpful information to track the error down? I have attached the error log file. Perhaps you can see if something is going wrong with the Java interface. Thank you very much for your help and any hints for the usage of gdb with mpiJava in advance. Please let me know if I can provide anything else. Kind regards Siegmar I think that it must have to do with MPI, because everything works fine on Linux and my Java program works fine with an older MPI version (openmpi-1.8.2a1r31804) as well. Yes. I also think it must have to do with MPI. But java process side, not mpiexec process side. When you run Java MPI program via mpiexec, a mpiexec process process launch a java process. When the java process (your Java program) calls a MPI method, native part (written in C/C++) of the MPI library is called. It runs in java process, not in mpiexec process. I suspect that part. On Solaris things are different. Are you saying the following difference? After this line, 881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT); Linux shows orte_job_state_to_str (state=1) at ../../openmpi-dev-124-g91e9686/orte/util/error_strings.c:217 217 switch(state) { but Solaris shows orte_util_print_name_args (name=0x100118380) at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122 122 if (NULL == name) { Each macro is defined as: #define ORTE_ACTIVATE_JOB_STATE(j, s) \ do {\ orte_job_t *shadow=(j); \ opal_output_verbose(1, orte_state_base_framework.framework_output, \ "%s ACTIVATE JOB %s STATE %s AT %s:%d", \ ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \ (NULL == shadow) ? "NULL" : \ ORTE_JOBID_PRINT(shadow->jobid),\ orte_job_state_to_str((s)), \ __FILE__, __LINE__); \ orte_state.activate_job_state(shadow, (s)); \ } while(0); #define ORTE_NAME_PRINT(n) \ orte_util_print_name_args(n) #define ORTE_JOBID_PRINT(n) \ orte_util_print_jobids(n) I'm not sure, but I think the gdb on Solaris steps into orte_util_print_name_args, but gdb on Linux doesn't step into orte_util_print_name_args and orte_util_print_jobids for some reason, or orte_job_state_to_str is evaluated before them. So I think it's not an important difference. You showed the following lines. orterun (argc=5, argv=0x7fffe0d8) at ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c:1084 1084while (orte_event_base_active) { (gdb) 1085opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE); (gdb) I'm not familiar with this code but I think this part (in mpiexec process) is only waiting the java process to terminate (normally or abnormally). So I think the problem is not in a mpiexec process but in a java process. Regards, Takahiro Hi Takahiro, mpiexec and java run as distinct processes. Your JRE message says java process raises SEGV. So you should trace the java process, not the mpiexec process. And more, your JRE message says the crash happened outside the Java Virtual Machine in native code. So usual Java program debugger is useless. You should trace native code part of the java process. Unfortunately I don't know how to debug such one. I think that it must have to do with MPI,
Re: [OMPI users] which info is needed for SIGSEGV in Java foropenmpi-dev-124-g91e9686 on Solaris
Hello Oscar, do you have time to look into my problem? Probably Takahiro has a point and gdb behaves differently on Solaris and Linux, so that the differing outputs have no meaning. I tried to debug my Java program, but without success so far, because I wasn't able to get into the Java program to set a breakpoint or to see the code. Have you succeeded to debug a mpiJava program? If so, how must I call gdb (I normally use "gdb mipexec" and then "run -np 1 java ...")? What can I do to get helpful information to track the error down? I have attached the error log file. Perhaps you can see if something is going wrong with the Java interface. Thank you very much for your help and any hints for the usage of gdb with mpiJava in advance. Please let me know if I can provide anything else. Kind regards Siegmar > > I think that it must have to do with MPI, because everything > > works fine on Linux and my Java program works fine with an older > > MPI version (openmpi-1.8.2a1r31804) as well. > > Yes. I also think it must have to do with MPI. > But java process side, not mpiexec process side. > > When you run Java MPI program via mpiexec, a mpiexec process > process launch a java process. When the java process (your > Java program) calls a MPI method, native part (written in C/C++) > of the MPI library is called. It runs in java process, not in > mpiexec process. I suspect that part. > > > On Solaris things are different. > > Are you saying the following difference? > After this line, > > 881 ORTE_ACTIVATE_JOB_STATE(jdata, ORTE_JOB_STATE_INIT); > Linux shows > > orte_job_state_to_str (state=1) > > at ../../openmpi-dev-124-g91e9686/orte/util/error_strings.c:217 > > 217 switch(state) { > but Solaris shows > > orte_util_print_name_args (name=0x100118380) > > at ../../openmpi-dev-124-g91e9686/orte/util/name_fns.c:122 > > 122 if (NULL == name) { > > Each macro is defined as: > > #define ORTE_ACTIVATE_JOB_STATE(j, s) \ > do {\ > orte_job_t *shadow=(j); \ > opal_output_verbose(1, orte_state_base_framework.framework_output, \ > "%s ACTIVATE JOB %s STATE %s AT %s:%d", \ > ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), \ > (NULL == shadow) ? "NULL" : \ > ORTE_JOBID_PRINT(shadow->jobid), \ > orte_job_state_to_str((s)), \ > __FILE__, __LINE__); \ > orte_state.activate_job_state(shadow, (s)); \ > } while(0); > > #define ORTE_NAME_PRINT(n) \ > orte_util_print_name_args(n) > > #define ORTE_JOBID_PRINT(n) \ > orte_util_print_jobids(n) > > I'm not sure, but I think the gdb on Solaris steps into > orte_util_print_name_args, but gdb on Linux doesn't step into > orte_util_print_name_args and orte_util_print_jobids for some > reason, or orte_job_state_to_str is evaluated before them. > > So I think it's not an important difference. > > You showed the following lines. > > > > orterun (argc=5, argv=0x7fffe0d8) > > > > at ../../../../openmpi-dev-124-g91e9686/orte/tools/orterun/orterun.c:1084 > > > > 1084while (orte_event_base_active) { > > > > (gdb) > > > > 1085opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE); > > > > (gdb) > > I'm not familiar with this code but I think this part (in mpiexec > process) is only waiting the java process to terminate (normally > or abnormally). So I think the problem is not in a mpiexec process > but in a java process. > > Regards, > Takahiro > > > Hi Takahiro, > > > > > mpiexec and java run as distinct processes. Your JRE message > > > says java process raises SEGV. So you should trace the java > > > process, not the mpiexec process. And more, your JRE message > > > says the crash happened outside the Java Virtual Machine in > > > native code. So usual Java program debugger is useless. > > > You should trace native code part of the java process. > > > Unfortunately I don't know how to debug such one. > > > > I think that it must have to do with MPI, because everything > > works fine on Linux and my Java program works fine with an older > > MPI version (openmpi-1.8.2a1r31804) as well. > > > > linpc1 x 112 mpiexec -np 1 java InitFinalizeMain > > Hello! > > linpc1 x 113 > > > > Therefore I single stepped through the program on Linux as well > > and found a difference launching the process. On Linux I get the > > following sequence. > > > > Breakpoint 1, rsh_launch (jdata=0x614aa0) > > at ../../../../../openmpi-dev-124-g91e9686/orte/mca/plm/rsh/plm_rsh_module.c:876 > > 876 if (ORTE_FLAG_TEST(jdata, ORTE_JOB_FLAG_RESTART)) { > > (gdb) s > > 881