Re: [OMPI users] No core dump in some cases

2016-05-16 Thread Dave Love
Gilles Gouaillardet  writes:

> Are you sure ulimit -c unlimited is *really* applied on all hosts
>
>
> can you please run the simple program below and confirm that ?

Nothing specifically wrong with that, but it's worth installing
procenv(1) as a general solution to checking the (generalized)
environment of a job.  It's packaged for Debian/Ubuntu and Fedora/EPEL,
at least.


Re: [OMPI users] No core dump in some cases

2016-05-12 Thread Gilles Gouaillardet
e queues (bytes, -q) 819200
>>> >>> real-time priority  (-r) 0
>>> >>> stack size  (kbytes, -s) 8192
>>> >>> cpu time   (seconds, -t) unlimited
>>> >>> max user processes  (-u) 4096
>>> >>> virtual memory  (kbytes, -v) unlimited
>>> >>> file locks  (-x) unlimited
>>> >>> [durga@smallMPI ~]$
>>> >>>
>>> >>>
>>> >>> I do realize that my setup is very unusual (I am a quasi-developer
>>> of MPI
>>> >>> whereas most other folks in this list are likely end-users), but
>>> somehow
>>> >>> just disabling this 'execinfo' MCA would allow me to make progress
>>> (and also
>>> >>> find out why/where MPI_Init() is crashing!). Is there any way I can
>>> do that?
>>> >>>
>>> >>> Thank you
>>> >>> Durga
>>> >>>
>>> >>> The surgeon general advises you to eat right, exercise regularly and
>>> quit
>>> >>> ageing.
>>> >>>
>>> >>> On Wed, May 11, 2016 at 8:58 PM, Gilles Gouaillardet <
>>> gil...@rist.or.jp >
>>> >>> wrote:
>>> >>>>
>>> >>>> Are you sure ulimit -c unlimited is *really* applied on all hosts
>>> >>>>
>>> >>>>
>>> >>>> can you please run the simple program below and confirm that ?
>>> >>>>
>>> >>>>
>>> >>>> Cheers,
>>> >>>>
>>> >>>>
>>> >>>> Gilles
>>> >>>>
>>> >>>>
>>> >>>> #include 
>>> >>>> #include 
>>> >>>> #include 
>>> >>>> #include 
>>> >>>>
>>> >>>> int main(int argc, char *argv[]) {
>>> >>>> struct rlimit rlim;
>>> >>>> char * c = (char *)0;
>>> >>>> getrlimit(RLIMIT_CORE, &rlim);
>>> >>>> printf ("before MPI_Init : %d %d\n", rlim.rlim_cur,
>>> rlim.rlim_max);
>>> >>>> MPI_Init(&argc, &argv);
>>> >>>> getrlimit(RLIMIT_CORE, &rlim);
>>> >>>> printf ("after MPI_Init : %d %d\n", rlim.rlim_cur,
>>> rlim.rlim_max);
>>> >>>> *c = 0;
>>> >>>> MPI_Finalize();
>>> >>>> return 0;
>>> >>>> }
>>> >>>>
>>> >>>>
>>> >>>> On 5/12/2016 4:22 AM, dpchoudh . wrote:
>>> >>>>
>>> >>>> Hello Gilles
>>> >>>>
>>> >>>> Thank you for the advice. However, that did not seem to make any
>>> >>>> difference. Here is what I did (on the cluster that generates .btr
>>> files for
>>> >>>> core dumps):
>>> >>>>
>>> >>>> [durga@smallMPI git]$ ompi_info --all | grep opal_signal
>>> >>>>MCA opal base: parameter "opal_signal" (current value:
>>> >>>> "6,7,8,11", data source: default, level: 3 user/all, type: string)
>>> >>>> [durga@smallMPI git]$
>>> >>>>
>>> >>>>
>>> >>>> According to , signals 6.7,8,11 are this:
>>> >>>>
>>> >>>> #defineSIGABRT6/* Abort (ANSI).  */
>>> >>>> #defineSIGBUS7/* BUS error (4.2 BSD).  */
>>> >>>> #defineSIGFPE8/* Floating-point exception (ANSI).
>>> */
>>> >>>> #defineSIGSEGV11/* Segmentation violation (ANSI).
>>> */
>>> >>>>
>>> >>>> And thus I added the following just after MPI_Init()
>>> >>>>
>>> >>>> MPI_Init(&argc, &argv);
>>> >>>> signal(SIGABRT, SIG_DFL);
>>> >>>> signal(SIGBUS, SIG_DFL);
>>> >>>> signal(SIGFPE, SIG_DFL);
>>> >>>> signal(SIGSEGV, SIG_DFL);
>>> >>>> signal(SIGTERM, SIG_DFL);
>>> >&

Re: [OMPI users] OMPI users] No core dump in some cases

2016-05-12 Thread Gilles Gouaillardet
>>>> data seg size           (kbytes, -d) unlimited
>>>> scheduling priority             (-e) 0
>>>> file size               (blocks, -f) unlimited
>>>> pending signals                 (-i) 216524
>>>> max locked memory       (kbytes, -l) unlimited
>>>> max memory size         (kbytes, -m) unlimited
>>>> open files                      (-n) 1024
>>>> pipe size            (512 bytes, -p) 8
>>>> POSIX message queues     (bytes, -q) 819200
>>>> real-time priority              (-r) 0
>>>> stack size              (kbytes, -s) 8192
>>>> cpu time               (seconds, -t) unlimited
>>>> max user processes              (-u) 4096
>>>> virtual memory          (kbytes, -v) unlimited
>>>> file locks                      (-x) unlimited
>>>> [durga@smallMPI ~]$
>>>>
>>>>
>>>> I do realize that my setup is very unusual (I am a quasi-developer of MPI
>>>> whereas most other folks in this list are likely end-users), but somehow
>>>> just disabling this 'execinfo' MCA would allow me to make progress (and 
>>>> also
>>>> find out why/where MPI_Init() is crashing!). Is there any way I can do 
>>>> that?
>>>>
>>>> Thank you
>>>> Durga
>>>>
>>>> The surgeon general advises you to eat right, exercise regularly and quit
>>>> ageing.
>>>>
>>>> On Wed, May 11, 2016 at 8:58 PM, Gilles Gouaillardet 
>>>> wrote:
>>>>>
>>>>> Are you sure ulimit -c unlimited is *really* applied on all hosts
>>>>>
>>>>>
>>>>> can you please run the simple program below and confirm that ?
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>>
>>>>> Gilles
>>>>>
>>>>>
>>>>> #include 
>>>>> #include 
>>>>> #include 
>>>>> #include 
>>>>>
>>>>> int main(int argc, char *argv[]) {
>>>>>     struct rlimit rlim;
>>>>>     char * c = (char *)0;
>>>>>     getrlimit(RLIMIT_CORE, &rlim);
>>>>>     printf ("before MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
>>>>>     MPI_Init(&argc, &argv);
>>>>>     getrlimit(RLIMIT_CORE, &rlim);
>>>>>     printf ("after MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
>>>>>     *c = 0;
>>>>>     MPI_Finalize();
>>>>>     return 0;
>>>>> }
>>>>>
>>>>>
>>>>> On 5/12/2016 4:22 AM, dpchoudh . wrote:
>>>>>
>>>>> Hello Gilles
>>>>>
>>>>> Thank you for the advice. However, that did not seem to make any
>>>>> difference. Here is what I did (on the cluster that generates .btr files 
>>>>> for
>>>>> core dumps):
>>>>>
>>>>> [durga@smallMPI git]$ ompi_info --all | grep opal_signal
>>>>>            MCA opal base: parameter "opal_signal" (current value:
>>>>> "6,7,8,11", data source: default, level: 3 user/all, type: string)
>>>>> [durga@smallMPI git]$
>>>>>
>>>>>
>>>>> According to , signals 6.7,8,11 are this:
>>>>>
>>>>> #define    SIGABRT        6    /* Abort (ANSI).  */
>>>>> #define    SIGBUS        7    /* BUS error (4.2 BSD).  */
>>>>> #define    SIGFPE        8    /* Floating-point exception (ANSI).  */
>>>>> #define    SIGSEGV        11    /* Segmentation violation (ANSI).  */
>>>>>
>>>>> And thus I added the following just after MPI_Init()
>>>>>
>>>>>     MPI_Init(&argc, &argv);
>>>>>     signal(SIGABRT, SIG_DFL);
>>>>>     signal(SIGBUS, SIG_DFL);
>>>>>     signal(SIGFPE, SIG_DFL);
>>>>>     signal(SIGSEGV, SIG_DFL);
>>>>>     signal(SIGTERM, SIG_DFL);
>>>>>
>>>>> (I added the 'SIGTERM' part later, just in case it would make a
>>>>> difference; i didn't)
>>>>>
>>>>> The resulting code still generates .btr files instead of core files.
>>>>>
>>>>> It looks like the 'execinfo' MCA component is being used as the
>>>>> back

Re: [OMPI users] No core dump in some cases

2016-05-12 Thread dpchoudh .
 ./btrtest[0x400829]
>> >>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9473bbeb15]
>> >>> ./btrtest[0x4006d9]
>> >>> ./btrtest[0x400829]
>> >>> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fdfe2d8eb15]
>> >>> ./btrtest[0x4006d9]
>> >>> after MPI_Init : -1 -1
>> >>> after MPI_Init : -1 -1
>> >>> ---
>> >>> Primary job  terminated normally, but 1 process returned
>> >>> a non-zero exit code. Per user-direction, the job has been aborted.
>> >>> ---
>> >>>
>> >>>
>> --
>> >>> mpirun detected that one or more processes exited with non-zero
>> status,
>> >>> thus causing
>> >>> the job to be terminated. The first process to do so was:
>> >>>
>> >>>   Process name: [[9384,1],1]
>> >>>   Exit code:1
>> >>>
>> >>>
>> --
>> >>>
>> >>>
>> >>> [durga@smallMPI ~]$ ulimit -a
>> >>> core file size  (blocks, -c) unlimited
>> >>> data seg size   (kbytes, -d) unlimited
>> >>> scheduling priority (-e) 0
>> >>> file size   (blocks, -f) unlimited
>> >>> pending signals (-i) 216524
>> >>> max locked memory   (kbytes, -l) unlimited
>> >>> max memory size (kbytes, -m) unlimited
>> >>> open files  (-n) 1024
>> >>> pipe size(512 bytes, -p) 8
>> >>> POSIX message queues (bytes, -q) 819200
>> >>> real-time priority  (-r) 0
>> >>> stack size  (kbytes, -s) 8192
>> >>> cpu time   (seconds, -t) unlimited
>> >>> max user processes  (-u) 4096
>> >>> virtual memory  (kbytes, -v) unlimited
>> >>> file locks  (-x) unlimited
>> >>> [durga@smallMPI ~]$
>> >>>
>> >>>
>> >>> I do realize that my setup is very unusual (I am a quasi-developer of
>> MPI
>> >>> whereas most other folks in this list are likely end-users), but
>> somehow
>> >>> just disabling this 'execinfo' MCA would allow me to make progress
>> (and also
>> >>> find out why/where MPI_Init() is crashing!). Is there any way I can
>> do that?
>> >>>
>> >>> Thank you
>> >>> Durga
>> >>>
>> >>> The surgeon general advises you to eat right, exercise regularly and
>> quit
>> >>> ageing.
>> >>>
>> >>> On Wed, May 11, 2016 at 8:58 PM, Gilles Gouaillardet <
>> gil...@rist.or.jp>
>> >>> wrote:
>> >>>>
>> >>>> Are you sure ulimit -c unlimited is *really* applied on all hosts
>> >>>>
>> >>>>
>> >>>> can you please run the simple program below and confirm that ?
>> >>>>
>> >>>>
>> >>>> Cheers,
>> >>>>
>> >>>>
>> >>>> Gilles
>> >>>>
>> >>>>
>> >>>> #include 
>> >>>> #include 
>> >>>> #include 
>> >>>> #include 
>> >>>>
>> >>>> int main(int argc, char *argv[]) {
>> >>>> struct rlimit rlim;
>> >>>> char * c = (char *)0;
>> >>>> getrlimit(RLIMIT_CORE, &rlim);
>> >>>> printf ("before MPI_Init : %d %d\n", rlim.rlim_cur,
>> rlim.rlim_max);
>> >>>> MPI_Init(&argc, &argv);
>> >>>> getrlimit(RLIMIT_CORE, &rlim);
>> >>>> printf ("after MPI_Init : %d %d\n", rlim.rlim_cur,
>> rlim.rlim_max);
>> >>>> *c = 0;
>> >>>> MPI_Finalize();
>> >>>> return 0;
>> >>>> }
>> >>>>
>> >>>>
>> >>>> On 5/12/2016 4:22 AM, dpchoudh . wrote:
>> >>>>
>> >>>> Hello Gilles
>> >>>>
>> >>

Re: [OMPI users] No core dump in some cases

2016-05-12 Thread Gilles Gouaillardet
les
>>>>
>>>> Thank you for the advice. However, that did not seem to make any
>>>> difference. Here is what I did (on the cluster that generates
.btr files for
>>>> core dumps):
>>>>
>>>> [durga@smallMPI git]$ ompi_info --all | grep opal_signal
>>>>MCA opal base: parameter "opal_signal" (current value:
>>>> "6,7,8,11", data source: default, level: 3 user/all, type:
string)
>>>> [durga@smallMPI git]$
>>>>
>>>>
>>>> According to , signals 6.7,8,11 are this:
>>>>
>>>> #defineSIGABRT6/* Abort (ANSI).  */
>>>> #defineSIGBUS7/* BUS error (4.2 BSD).  */
>>>> #defineSIGFPE8/* Floating-point exception
(ANSI).  */
>>>> #defineSIGSEGV11/* Segmentation violation
(ANSI).  */
>>>>
>>>> And thus I added the following just after MPI_Init()
>>>>
>>>> MPI_Init(&argc, &argv);
>>>> signal(SIGABRT, SIG_DFL);
>>>> signal(SIGBUS, SIG_DFL);
>>>> signal(SIGFPE, SIG_DFL);
>>>> signal(SIGSEGV, SIG_DFL);
>>>> signal(SIGTERM, SIG_DFL);
>>>>
>>>> (I added the 'SIGTERM' part later, just in case it would make a
>>>> difference; i didn't)
>>>>
>>>> The resulting code still generates .btr files instead of core
files.
>>>>
>>>> It looks like the 'execinfo' MCA component is being used as the
>>>> backtrace mechanism:
>>>>
>>>> [durga@smallMPI git]$ ompi_info | grep backtrace
>>>>MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0,
Component
>>>> v3.0.0)
>>>>
>>>> However, I could not find any way to choose 'none' instead of
'execinfo'
>>>>
>>>> And the strange thing is, on the cluster where regular core
dump is
>>>> happening, the output of
>>>> $ ompi_info | grep backtrace
>>>> is identical to the above. (Which kind of makes sense because
they were
>>>> created from the same source with the same configure options.)
>>>>
>>>> Sorry to harp on this, but without a core file it is hard to
debug the
>>>> application (e.g. examine stack variables).
>>>>
>>>> Thank you
>>>> Durga
>>>>
>>>>
>>>> The surgeon general advises you to eat right, exercise
regularly and
>>>> quit ageing.
>>>>
>>>> On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet
>>>> mailto:gilles.gouaillar...@gmail.com>> wrote:
>>>>>
>>>>> Durga,
>>>>>
>>>>> you might wanna try to restore the signal handler for other
signals as
>>>>> well
>>>>> (SIGSEGV, SIGBUS, ...)
>>>>> ompi_info --all | grep opal_signal
>>>>> does list the signal you should restore the handler
>>>>>
>>>>>
>>>>> only one backtrace component is built (out of several
candidates :
>>>>> execinfo, none, printstack)
>>>>> nm -l libopen-pal.so | grep backtrace
>>>>> will hint you which component was built
>>>>>
>>>>> your two similar distros might have different backtrace
component
>>>>>
>>>>>
>>>>>
>>>>> Gus,
>>>>>
>>>>> btr is a plain text file with a back trace "ala" gdb
>>>>>
>>>>>
>>>>>
>>>>> Nathan,
>>>>>
>>>>> i did a 'grep btr' and could not find anything :-(
>>>>> opal_backtrace_buffer and opal_backtrace_print are only used
with
>>>>> stderr.
>>>>> so i am puzzled who creates the tracefile name and where ...
>>>>> also, no stack is printed by default unless
opal_abort_print_stack is
>>>>> true
>>>>>
>>>>> Cheers,
>>>>>
>>>>> 

Re: [OMPI users] No core dump in some cases

2016-05-12 Thread dpchoudh .
gt; >>>> "6,7,8,11", data source: default, level: 3 user/all, type: string)
> >>>> [durga@smallMPI git]$
> >>>>
> >>>>
> >>>> According to , signals 6.7,8,11 are this:
> >>>>
> >>>> #defineSIGABRT6/* Abort (ANSI).  */
> >>>> #defineSIGBUS7/* BUS error (4.2 BSD).  */
> >>>> #defineSIGFPE8/* Floating-point exception (ANSI).  */
> >>>> #defineSIGSEGV11/* Segmentation violation (ANSI).  */
> >>>>
> >>>> And thus I added the following just after MPI_Init()
> >>>>
> >>>> MPI_Init(&argc, &argv);
> >>>> signal(SIGABRT, SIG_DFL);
> >>>> signal(SIGBUS, SIG_DFL);
> >>>> signal(SIGFPE, SIG_DFL);
> >>>> signal(SIGSEGV, SIG_DFL);
> >>>> signal(SIGTERM, SIG_DFL);
> >>>>
> >>>> (I added the 'SIGTERM' part later, just in case it would make a
> >>>> difference; i didn't)
> >>>>
> >>>> The resulting code still generates .btr files instead of core files.
> >>>>
> >>>> It looks like the 'execinfo' MCA component is being used as the
> >>>> backtrace mechanism:
> >>>>
> >>>> [durga@smallMPI git]$ ompi_info | grep backtrace
> >>>>MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component
> >>>> v3.0.0)
> >>>>
> >>>> However, I could not find any way to choose 'none' instead of
> 'execinfo'
> >>>>
> >>>> And the strange thing is, on the cluster where regular core dump is
> >>>> happening, the output of
> >>>> $ ompi_info | grep backtrace
> >>>> is identical to the above. (Which kind of makes sense because they
> were
> >>>> created from the same source with the same configure options.)
> >>>>
> >>>> Sorry to harp on this, but without a core file it is hard to debug the
> >>>> application (e.g. examine stack variables).
> >>>>
> >>>> Thank you
> >>>> Durga
> >>>>
> >>>>
> >>>> The surgeon general advises you to eat right, exercise regularly and
> >>>> quit ageing.
> >>>>
> >>>> On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet
> >>>>  wrote:
> >>>>>
> >>>>> Durga,
> >>>>>
> >>>>> you might wanna try to restore the signal handler for other signals
> as
> >>>>> well
> >>>>> (SIGSEGV, SIGBUS, ...)
> >>>>> ompi_info --all | grep opal_signal
> >>>>> does list the signal you should restore the handler
> >>>>>
> >>>>>
> >>>>> only one backtrace component is built (out of several candidates :
> >>>>> execinfo, none, printstack)
> >>>>> nm -l libopen-pal.so | grep backtrace
> >>>>> will hint you which component was built
> >>>>>
> >>>>> your two similar distros might have different backtrace component
> >>>>>
> >>>>>
> >>>>>
> >>>>> Gus,
> >>>>>
> >>>>> btr is a plain text file with a back trace "ala" gdb
> >>>>>
> >>>>>
> >>>>>
> >>>>> Nathan,
> >>>>>
> >>>>> i did a 'grep btr' and could not find anything :-(
> >>>>> opal_backtrace_buffer and opal_backtrace_print are only used with
> >>>>> stderr.
> >>>>> so i am puzzled who creates the tracefile name and where ...
> >>>>> also, no stack is printed by default unless opal_abort_print_stack is
> >>>>> true
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Gilles
> >>>>>
> >>>>>
> >>>>> On Wed, May 11, 2016 at 3:43 PM, dpchoudh . 
> wrote:
> >>>>> > Hello Nathan
> >>>>> >
> >>>>> > Thank you for your response. Could you please be more specific?
> >>>>> > Adding the
> >>>>> > following after MPI_Init() does not seem to make a difference.
&

Re: [OMPI users] No core dump in some cases

2016-05-12 Thread Gilles Gouaillardet
duling priority (-e) 0
>>> file size   (blocks, -f) unlimited
>>> pending signals (-i) 216524
>>> max locked memory   (kbytes, -l) unlimited
>>> max memory size (kbytes, -m) unlimited
>>> open files  (-n) 1024
>>> pipe size(512 bytes, -p) 8
>>> POSIX message queues (bytes, -q) 819200
>>> real-time priority  (-r) 0
>>> stack size  (kbytes, -s) 8192
>>> cpu time   (seconds, -t) unlimited
>>> max user processes  (-u) 4096
>>> virtual memory  (kbytes, -v) unlimited
>>> file locks  (-x) unlimited
>>> [durga@smallMPI ~]$
>>>
>>>
>>> I do realize that my setup is very unusual (I am a quasi-developer of MPI
>>> whereas most other folks in this list are likely end-users), but somehow
>>> just disabling this 'execinfo' MCA would allow me to make progress (and also
>>> find out why/where MPI_Init() is crashing!). Is there any way I can do that?
>>>
>>> Thank you
>>> Durga
>>>
>>> The surgeon general advises you to eat right, exercise regularly and quit
>>> ageing.
>>>
>>> On Wed, May 11, 2016 at 8:58 PM, Gilles Gouaillardet 
>>> wrote:
>>>>
>>>> Are you sure ulimit -c unlimited is *really* applied on all hosts
>>>>
>>>>
>>>> can you please run the simple program below and confirm that ?
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Gilles
>>>>
>>>>
>>>> #include 
>>>> #include 
>>>> #include 
>>>> #include 
>>>>
>>>> int main(int argc, char *argv[]) {
>>>> struct rlimit rlim;
>>>> char * c = (char *)0;
>>>> getrlimit(RLIMIT_CORE, &rlim);
>>>> printf ("before MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
>>>> MPI_Init(&argc, &argv);
>>>> getrlimit(RLIMIT_CORE, &rlim);
>>>> printf ("after MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
>>>> *c = 0;
>>>> MPI_Finalize();
>>>> return 0;
>>>> }
>>>>
>>>>
>>>> On 5/12/2016 4:22 AM, dpchoudh . wrote:
>>>>
>>>> Hello Gilles
>>>>
>>>> Thank you for the advice. However, that did not seem to make any
>>>> difference. Here is what I did (on the cluster that generates .btr files 
>>>> for
>>>> core dumps):
>>>>
>>>> [durga@smallMPI git]$ ompi_info --all | grep opal_signal
>>>>MCA opal base: parameter "opal_signal" (current value:
>>>> "6,7,8,11", data source: default, level: 3 user/all, type: string)
>>>> [durga@smallMPI git]$
>>>>
>>>>
>>>> According to , signals 6.7,8,11 are this:
>>>>
>>>> #defineSIGABRT6/* Abort (ANSI).  */
>>>> #defineSIGBUS7/* BUS error (4.2 BSD).  */
>>>> #defineSIGFPE8/* Floating-point exception (ANSI).  */
>>>> #defineSIGSEGV11/* Segmentation violation (ANSI).  */
>>>>
>>>> And thus I added the following just after MPI_Init()
>>>>
>>>> MPI_Init(&argc, &argv);
>>>> signal(SIGABRT, SIG_DFL);
>>>> signal(SIGBUS, SIG_DFL);
>>>> signal(SIGFPE, SIG_DFL);
>>>> signal(SIGSEGV, SIG_DFL);
>>>> signal(SIGTERM, SIG_DFL);
>>>>
>>>> (I added the 'SIGTERM' part later, just in case it would make a
>>>> difference; i didn't)
>>>>
>>>> The resulting code still generates .btr files instead of core files.
>>>>
>>>> It looks like the 'execinfo' MCA component is being used as the
>>>> backtrace mechanism:
>>>>
>>>> [durga@smallMPI git]$ ompi_info | grep backtrace
>>>>MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component
>>>> v3.0.0)
>>>>
>>>> However, I could not find any way to choose 'none' instead of 'execinfo'
>>>>
>>>> And the strange thing is, on the cluster where regular core dump is
>>>> happening, the output of
>>>> $ ompi_info | grep bac

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
efault, level: 3 user/all, type: string)
>>> [durga@smallMPI git]$
>>>
>>>
>>> According to , signals 6.7,8,11 are this:
>>>
>>> #defineSIGABRT6/* Abort (ANSI).  */
>>> #defineSIGBUS7/* BUS error (4.2 BSD).  */
>>> #defineSIGFPE8/* Floating-point exception (ANSI).  */
>>> #defineSIGSEGV11/* Segmentation violation (ANSI).  */
>>>
>>> And thus I added the following just after MPI_Init()
>>>
>>> MPI_Init(&argc, &argv);
>>> signal(SIGABRT, SIG_DFL);
>>> signal(SIGBUS, SIG_DFL);
>>> signal(SIGFPE, SIG_DFL);
>>> signal(SIGSEGV, SIG_DFL);
>>> signal(SIGTERM, SIG_DFL);
>>>
>>> (I added the 'SIGTERM' part later, just in case it would make a
>>> difference; i didn't)
>>>
>>> The resulting code still generates .btr files instead of core files.
>>>
>>> It looks like the 'execinfo' MCA component is being used as the
>>> backtrace mechanism:
>>>
>>> [durga@smallMPI git]$ ompi_info | grep backtrace
>>>MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component
>>> v3.0.0)
>>>
>>> However, I could not find any way to choose 'none' instead of 'execinfo'
>>>
>>> And the strange thing is, on the cluster where regular core dump is
>>> happening, the output of
>>> $ ompi_info | grep backtrace
>>> is identical to the above. (Which kind of makes sense because they were
>>> created from the same source with the same configure options.)
>>>
>>> Sorry to harp on this, but without a core file it is hard to debug the
>>> application (e.g. examine stack variables).
>>>
>>> Thank you
>>> Durga
>>>
>>>
>>> The surgeon general advises you to eat right, exercise regularly and
>>> quit ageing.
>>>
>>> On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com> wrote:
>>>
>>>> Durga,
>>>>
>>>> you might wanna try to restore the signal handler for other signals as
>>>> well
>>>> (SIGSEGV, SIGBUS, ...)
>>>> ompi_info --all | grep opal_signal
>>>> does list the signal you should restore the handler
>>>>
>>>>
>>>> only one backtrace component is built (out of several candidates :
>>>> execinfo, none, printstack)
>>>> nm -l libopen-pal.so | grep backtrace
>>>> will hint you which component was built
>>>>
>>>> your two similar distros might have different backtrace component
>>>>
>>>>
>>>>
>>>> Gus,
>>>>
>>>> btr is a plain text file with a back trace "ala" gdb
>>>>
>>>>
>>>>
>>>> Nathan,
>>>>
>>>> i did a 'grep btr' and could not find anything :-(
>>>> opal_backtrace_buffer and opal_backtrace_print are only used with
>>>> stderr.
>>>> so i am puzzled who creates the tracefile name and where ...
>>>> also, no stack is printed by default unless opal_abort_print_stack is
>>>> true
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>>
>>>> On Wed, May 11, 2016 at 3:43 PM, dpchoudh .  wrote:
>>>> > Hello Nathan
>>>> >
>>>> > Thank you for your response. Could you please be more specific?
>>>> Adding the
>>>> > following after MPI_Init() does not seem to make a difference.
>>>> >
>>>> > MPI_Init(&argc, &argv);
>>>> >   signal(SIGABRT, SIG_DFL);
>>>> >   signal(SIGTERM, SIG_DFL);
>>>> >
>>>> > I also find it puzzling that nearly identical OMPI distro running on a
>>>> > different machine shows different behaviour.
>>>> >
>>>> > Best regards
>>>> > Durga
>>>> >
>>>> > The surgeon general advises you to eat right, exercise regularly and
>>>> quit
>>>> > ageing.
>>>> >
>>>> > On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas <
>>>> hje...@lanl.gov>
>>>> > wrote:
>>>> >>
>>>> >> btr files are indeed created by open mpi's backtrace mechanism.

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
hat?
>
> Thank you
> Durga
>
> The surgeon general advises you to eat right, exercise regularly and quit
> ageing.
>
> On Wed, May 11, 2016 at 8:58 PM, Gilles Gouaillardet 
> wrote:
>
>> Are you sure ulimit -c unlimited is *really* applied on all hosts
>>
>>
>> can you please run the simple program below and confirm that ?
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> int main(int argc, char *argv[]) {
>> struct rlimit rlim;
>> char * c = (char *)0;
>> getrlimit(RLIMIT_CORE, &rlim);
>> printf ("before MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
>> MPI_Init(&argc, &argv);
>> getrlimit(RLIMIT_CORE, &rlim);
>> printf ("after MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
>> *c = 0;
>> MPI_Finalize();
>> return 0;
>> }
>>
>>
>> On 5/12/2016 4:22 AM, dpchoudh . wrote:
>>
>> Hello Gilles
>>
>> Thank you for the advice. However, that did not seem to make any
>> difference. Here is what I did (on the cluster that generates .btr files
>> for core dumps):
>>
>> [durga@smallMPI git]$ ompi_info --all | grep opal_signal
>>MCA opal base: parameter "opal_signal" (current value:
>> "6,7,8,11", data source: default, level: 3 user/all, type: string)
>> [durga@smallMPI git]$
>>
>>
>> According to , signals 6.7,8,11 are this:
>>
>> #defineSIGABRT6/* Abort (ANSI).  */
>> #defineSIGBUS7/* BUS error (4.2 BSD).  */
>> #defineSIGFPE8/* Floating-point exception (ANSI).  */
>> #defineSIGSEGV11/* Segmentation violation (ANSI).  */
>>
>> And thus I added the following just after MPI_Init()
>>
>> MPI_Init(&argc, &argv);
>> signal(SIGABRT, SIG_DFL);
>> signal(SIGBUS, SIG_DFL);
>> signal(SIGFPE, SIG_DFL);
>> signal(SIGSEGV, SIG_DFL);
>> signal(SIGTERM, SIG_DFL);
>>
>> (I added the 'SIGTERM' part later, just in case it would make a
>> difference; i didn't)
>>
>> The resulting code still generates .btr files instead of core files.
>>
>> It looks like the 'execinfo' MCA component is being used as the backtrace
>> mechanism:
>>
>> [durga@smallMPI git]$ ompi_info | grep backtrace
>>MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component
>> v3.0.0)
>>
>> However, I could not find any way to choose 'none' instead of 'execinfo'
>>
>> And the strange thing is, on the cluster where regular core dump is
>> happening, the output of
>> $ ompi_info | grep backtrace
>> is identical to the above. (Which kind of makes sense because they were
>> created from the same source with the same configure options.)
>>
>> Sorry to harp on this, but without a core file it is hard to debug the
>> application (e.g. examine stack variables).
>>
>> Thank you
>> Durga
>>
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>>
>> On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> Durga,
>>>
>>> you might wanna try to restore the signal handler for other signals as
>>> well
>>> (SIGSEGV, SIGBUS, ...)
>>> ompi_info --all | grep opal_signal
>>> does list the signal you should restore the handler
>>>
>>>
>>> only one backtrace component is built (out of several candidates :
>>> execinfo, none, printstack)
>>> nm -l libopen-pal.so | grep backtrace
>>> will hint you which component was built
>>>
>>> your two similar distros might have different backtrace component
>>>
>>>
>>>
>>> Gus,
>>>
>>> btr is a plain text file with a back trace "ala" gdb
>>>
>>>
>>>
>>> Nathan,
>>>
>>> i did a 'grep btr' and could not find anything :-(
>>> opal_backtrace_buffer and opal_backtrace_print are only used with stderr.
>>> so i am puzzled who creates the tracefile name and where ...
>>> also, no stack is printed by default unless opal_abort_print_stack is
>>> true
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On Wed, May 11, 2016 at 3:43 PM, dpchoudh

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
11/* Segmentation violation (ANSI).  */
>>
>> And thus I added the following just after MPI_Init()
>>
>> MPI_Init(&argc, &argv);
>> signal(SIGABRT, SIG_DFL);
>> signal(SIGBUS, SIG_DFL);
>> signal(SIGFPE, SIG_DFL);
>> signal(SIGSEGV, SIG_DFL);
>> signal(SIGTERM, SIG_DFL);
>>
>> (I added the 'SIGTERM' part later, just in case it would make a
>> difference; i didn't)
>>
>> The resulting code still generates .btr files instead of core files.
>>
>> It looks like the 'execinfo' MCA component is being used as the backtrace
>> mechanism:
>>
>> [durga@smallMPI git]$ ompi_info | grep backtrace
>>MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component
>> v3.0.0)
>>
>> However, I could not find any way to choose 'none' instead of 'execinfo'
>>
>> And the strange thing is, on the cluster where regular core dump is
>> happening, the output of
>> $ ompi_info | grep backtrace
>> is identical to the above. (Which kind of makes sense because they were
>> created from the same source with the same configure options.)
>>
>> Sorry to harp on this, but without a core file it is hard to debug the
>> application (e.g. examine stack variables).
>>
>> Thank you
>> Durga
>>
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>>
>> On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> Durga,
>>>
>>> you might wanna try to restore the signal handler for other signals as
>>> well
>>> (SIGSEGV, SIGBUS, ...)
>>> ompi_info --all | grep opal_signal
>>> does list the signal you should restore the handler
>>>
>>>
>>> only one backtrace component is built (out of several candidates :
>>> execinfo, none, printstack)
>>> nm -l libopen-pal.so | grep backtrace
>>> will hint you which component was built
>>>
>>> your two similar distros might have different backtrace component
>>>
>>>
>>>
>>> Gus,
>>>
>>> btr is a plain text file with a back trace "ala" gdb
>>>
>>>
>>>
>>> Nathan,
>>>
>>> i did a 'grep btr' and could not find anything :-(
>>> opal_backtrace_buffer and opal_backtrace_print are only used with stderr.
>>> so i am puzzled who creates the tracefile name and where ...
>>> also, no stack is printed by default unless opal_abort_print_stack is
>>> true
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On Wed, May 11, 2016 at 3:43 PM, dpchoudh . < 
>>> dpcho...@gmail.com> wrote:
>>> > Hello Nathan
>>> >
>>> > Thank you for your response. Could you please be more specific? Adding
>>> the
>>> > following after MPI_Init() does not seem to make a difference.
>>> >
>>> > MPI_Init(&argc, &argv);
>>> >   signal(SIGABRT, SIG_DFL);
>>> >   signal(SIGTERM, SIG_DFL);
>>> >
>>> > I also find it puzzling that nearly identical OMPI distro running on a
>>> > different machine shows different behaviour.
>>> >
>>> > Best regards
>>> > Durga
>>> >
>>> > The surgeon general advises you to eat right, exercise regularly and
>>> quit
>>> > ageing.
>>> >
>>> > On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas <
>>> hje...@lanl.gov>
>>> > wrote:
>>> >>
>>> >> btr files are indeed created by open mpi's backtrace mechanism. I
>>> think we
>>> >> should revisit it at some point but for now the only effective way i
>>> have
>>> >> found to prevent it is to restore the default signal handlers after
>>> >> MPI_Init.
>>> >>
>>> >> Excuse the quoting style. Good sucks.
>>> >>
>>> >>
>>> >> 
>>> >> From: users on behalf of dpchoudh .
>>> >> Sent: Monday, May 09, 2016 2:59:37 PM
>>> >> To: Open MPI Users
>>> >> Subject: Re: [OMPI users] No core dump in some cases
>>> >>
>>> >> Hi Gus
>>> >>
>>> >> Thanks for your suggestion. But I am not using any resource man

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread Gilles Gouaillardet
t right, exercise regularly
and quit ageing.

On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet
mailto:gilles.gouaillar...@gmail.com>> wrote:

Durga,

you might wanna try to restore the signal handler for other
signals as well
(SIGSEGV, SIGBUS, ...)
ompi_info --all | grep opal_signal
does list the signal you should restore the handler


only one backtrace component is built (out of several
candidates :
execinfo, none, printstack)
nm -l libopen-pal.so | grep backtrace
will hint you which component was built

your two similar distros might have different backtrace component



Gus,

btr is a plain text file with a back trace "ala" gdb



Nathan,

i did a 'grep btr' and could not find anything :-(
opal_backtrace_buffer and opal_backtrace_print are only used
with stderr.
so i am puzzled who creates the tracefile name and where ...
also, no stack is printed by default unless
opal_abort_print_stack is true

Cheers,

Gilles


On Wed, May 11, 2016 at 3:43 PM, dpchoudh .
mailto:dpcho...@gmail.com>> wrote:
> Hello Nathan
>
> Thank you for your response. Could you please be more
specific? Adding the
> following after MPI_Init() does not seem to make a difference.
>
> MPI_Init(&argc, &argv);
>   signal(SIGABRT, SIG_DFL);
>   signal(SIGTERM, SIG_DFL);
>
> I also find it puzzling that nearly identical OMPI distro
running on a
> different machine shows different behaviour.
>
> Best regards
> Durga
>
> The surgeon general advises you to eat right, exercise
regularly and quit
> ageing.
>
> On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas
mailto:hje...@lanl.gov>>
> wrote:
>>
>> btr files are indeed created by open mpi's backtrace
mechanism. I think we
>> should revisit it at some point but for now the only
effective way i have
>> found to prevent it is to restore the default signal
handlers after
>> MPI_Init.
    >>
    >> Excuse the quoting style. Good sucks.
>>
>>
>> 
>> From: users on behalf of dpchoudh .
>> Sent: Monday, May 09, 2016 2:59:37 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] No core dump in some cases
>>
>> Hi Gus
>>
>> Thanks for your suggestion. But I am not using any
resource manager (i.e.
>> I am launching mpirun from the bash shell.). In fact, both
of the two
>> clusters I talked about run CentOS 7 and I launch the job
the same way on
>> both of these, yet one of them creates standard core files
and the other
>> creates the 'btr; files. Strange thing is, I could not
find anything on the
>> .btr (= Backtrace?) files on Google, which is any I asked
on this forum.
>>
>> Best regards
>> Durga
>>
>> The surgeon general advises you to eat right, exercise
regularly and quit
>> ageing.
>>
>> On Mon, May 9, 2016 at 12:04 PM, Gus Correa
>> mailto:g...@ldeo.columbia.edu><mailto:g...@ldeo.columbia.edu
<mailto:g...@ldeo.columbia.edu>>> wrote:
>> Hi Durga
>>
>> Just in case ...
>> If you're using a resource manager to start the jobs
(Torque, etc),
>> you need to have them set the limits (for coredump size,
stacksize, locked
>> memory size, etc).
>> This way the jobs will inherit the limits from the
>> resource manager daemon.
>> On Torque (which I use) I do this on the pbs_mom daemon
>> init script (I am still before the systemd era, that
lovely POS).
>> And set the hard/soft limits on /etc/security/limits.conf
as well.
>>
>> I hope this helps,
>> Gus Correa
>>
>> On 05/07/2016 12:27 PM, Jeff Squyres (jsquyres) wrote:
>> I'm afraid I don't know what a .btr file is -- that is not
something that
>> is controlled by Open MPI.
>>
>> You might want to look into your OS settings to see if it
has s

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
 MPI
> whereas most other folks in this list are likely end-users), but somehow
> just disabling this 'execinfo' MCA would allow me to make progress (and
> also find out why/where MPI_Init() is crashing!). Is there any way I can do
> that?
>
> Thank you
> Durga
>
> The surgeon general advises you to eat right, exercise regularly and quit
> ageing.
>
> On Wed, May 11, 2016 at 8:58 PM, Gilles Gouaillardet 
> wrote:
>
>> Are you sure ulimit -c unlimited is *really* applied on all hosts
>>
>>
>> can you please run the simple program below and confirm that ?
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> int main(int argc, char *argv[]) {
>> struct rlimit rlim;
>> char * c = (char *)0;
>> getrlimit(RLIMIT_CORE, &rlim);
>> printf ("before MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
>> MPI_Init(&argc, &argv);
>> getrlimit(RLIMIT_CORE, &rlim);
>> printf ("after MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
>> *c = 0;
>> MPI_Finalize();
>> return 0;
>> }
>>
>>
>> On 5/12/2016 4:22 AM, dpchoudh . wrote:
>>
>> Hello Gilles
>>
>> Thank you for the advice. However, that did not seem to make any
>> difference. Here is what I did (on the cluster that generates .btr files
>> for core dumps):
>>
>> [durga@smallMPI git]$ ompi_info --all | grep opal_signal
>>MCA opal base: parameter "opal_signal" (current value:
>> "6,7,8,11", data source: default, level: 3 user/all, type: string)
>> [durga@smallMPI git]$
>>
>>
>> According to , signals 6.7,8,11 are this:
>>
>> #defineSIGABRT6/* Abort (ANSI).  */
>> #defineSIGBUS7/* BUS error (4.2 BSD).  */
>> #defineSIGFPE8/* Floating-point exception (ANSI).  */
>> #defineSIGSEGV11/* Segmentation violation (ANSI).  */
>>
>> And thus I added the following just after MPI_Init()
>>
>> MPI_Init(&argc, &argv);
>> signal(SIGABRT, SIG_DFL);
>> signal(SIGBUS, SIG_DFL);
>> signal(SIGFPE, SIG_DFL);
>> signal(SIGSEGV, SIG_DFL);
>> signal(SIGTERM, SIG_DFL);
>>
>> (I added the 'SIGTERM' part later, just in case it would make a
>> difference; i didn't)
>>
>> The resulting code still generates .btr files instead of core files.
>>
>> It looks like the 'execinfo' MCA component is being used as the backtrace
>> mechanism:
>>
>> [durga@smallMPI git]$ ompi_info | grep backtrace
>>MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component
>> v3.0.0)
>>
>> However, I could not find any way to choose 'none' instead of 'execinfo'
>>
>> And the strange thing is, on the cluster where regular core dump is
>> happening, the output of
>> $ ompi_info | grep backtrace
>> is identical to the above. (Which kind of makes sense because they were
>> created from the same source with the same configure options.)
>>
>> Sorry to harp on this, but without a core file it is hard to debug the
>> application (e.g. examine stack variables).
>>
>> Thank you
>> Durga
>>
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>>
>> On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> Durga,
>>>
>>> you might wanna try to restore the signal handler for other signals as
>>> well
>>> (SIGSEGV, SIGBUS, ...)
>>> ompi_info --all | grep opal_signal
>>> does list the signal you should restore the handler
>>>
>>>
>>> only one backtrace component is built (out of several candidates :
>>> execinfo, none, printstack)
>>> nm -l libopen-pal.so | grep backtrace
>>> will hint you which component was built
>>>
>>> your two similar distros might have different backtrace component
>>>
>>>
>>>
>>> Gus,
>>>
>>> btr is a plain text file with a back trace "ala" gdb
>>>
>>>
>>>
>>> Nathan,
>>>
>>> i did a 'grep btr' and could not find anything :-(
>>> opal_backtrace_buffer and opal_backtrace_print are only used with stderr.
>>> so i am puzzled who creates the tracefile n

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread Ralph Castain
;> signal(SIGTERM, SIG_DFL);
>> 
>> (I added the 'SIGTERM' part later, just in case it would make a difference; 
>> i didn't)
>> 
>> The resulting code still generates .btr files instead of core files.
>> 
>> It looks like the 'execinfo' MCA component is being used as the backtrace 
>> mechanism:
>> 
>> [durga@smallMPI git]$ ompi_info | grep backtrace
>>MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v3.0.0)
>> 
>> However, I could not find any way to choose 'none' instead of 'execinfo'
>> 
>> And the strange thing is, on the cluster where regular core dump is 
>> happening, the output of 
>> $ ompi_info | grep backtrace
>> is identical to the above. (Which kind of makes sense because they were 
>> created from the same source with the same configure options.)
>> 
>> Sorry to harp on this, but without a core file it is hard to debug the 
>> application (e.g. examine stack variables).
>> 
>> Thank you
>> Durga
>> 
>> 
>> The surgeon general advises you to eat right, exercise regularly and quit 
>> ageing.
>> 
>> On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet < 
>> <mailto:gilles.gouaillar...@gmail.com>gilles.gouaillar...@gmail.com 
>> <mailto:gilles.gouaillar...@gmail.com>> wrote:
>> Durga,
>> 
>> you might wanna try to restore the signal handler for other signals as well
>> (SIGSEGV, SIGBUS, ...)
>> ompi_info --all | grep opal_signal
>> does list the signal you should restore the handler
>> 
>> 
>> only one backtrace component is built (out of several candidates :
>> execinfo, none, printstack)
>> nm -l libopen-pal.so | grep backtrace
>> will hint you which component was built
>> 
>> your two similar distros might have different backtrace component
>> 
>> 
>> 
>> Gus,
>> 
>> btr is a plain text file with a back trace "ala" gdb
>> 
>> 
>> 
>> Nathan,
>> 
>> i did a 'grep btr' and could not find anything :-(
>> opal_backtrace_buffer and opal_backtrace_print are only used with stderr.
>> so i am puzzled who creates the tracefile name and where ...
>> also, no stack is printed by default unless opal_abort_print_stack is true
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>> On Wed, May 11, 2016 at 3:43 PM, dpchoudh . < 
>> <mailto:dpcho...@gmail.com>dpcho...@gmail.com <mailto:dpcho...@gmail.com>> 
>> wrote:
>> > Hello Nathan
>> >
>> > Thank you for your response. Could you please be more specific? Adding the
>> > following after MPI_Init() does not seem to make a difference.
>> >
>> > MPI_Init(&argc, &argv);
>> >   signal(SIGABRT, SIG_DFL);
>> >   signal(SIGTERM, SIG_DFL);
>> >
>> > I also find it puzzling that nearly identical OMPI distro running on a
>> > different machine shows different behaviour.
>> >
>> > Best regards
>> > Durga
>> >
>> > The surgeon general advises you to eat right, exercise regularly and quit
>> > ageing.
>> >
>> > On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas > > <mailto:hje...@lanl.gov>>
>> > wrote:
>> >>
>> >> btr files are indeed created by open mpi's backtrace mechanism. I think we
>> >> should revisit it at some point but for now the only effective way i have
>> >> found to prevent it is to restore the default signal handlers after
>> >> MPI_Init.
>> >>
>> >> Excuse the quoting style. Good sucks.
>> >>
>> >>
>> >> 
>> >> From: users on behalf of dpchoudh .
>> >> Sent: Monday, May 09, 2016 2:59:37 PM
>> >> To: Open MPI Users
>> >> Subject: Re: [OMPI users] No core dump in some cases
>> >>
>> >> Hi Gus
>> >>
>> >> Thanks for your suggestion. But I am not using any resource manager (i.e.
>> >> I am launching mpirun from the bash shell.). In fact, both of the two
>> >> clusters I talked about run CentOS 7 and I launch the job the same way on
>> >> both of these, yet one of them creates standard core files and the other
>> >> creates the 'btr; files. Strange thing is, I could not find anything on 
>> >> the
>> >> .btr (= Backtrace?) files on Google, which is any I asked on this forum.
>> >

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
gouaillar...@gmail.com> wrote:
>
>> Durga,
>>
>> you might wanna try to restore the signal handler for other signals as
>> well
>> (SIGSEGV, SIGBUS, ...)
>> ompi_info --all | grep opal_signal
>> does list the signal you should restore the handler
>>
>>
>> only one backtrace component is built (out of several candidates :
>> execinfo, none, printstack)
>> nm -l libopen-pal.so | grep backtrace
>> will hint you which component was built
>>
>> your two similar distros might have different backtrace component
>>
>>
>>
>> Gus,
>>
>> btr is a plain text file with a back trace "ala" gdb
>>
>>
>>
>> Nathan,
>>
>> i did a 'grep btr' and could not find anything :-(
>> opal_backtrace_buffer and opal_backtrace_print are only used with stderr.
>> so i am puzzled who creates the tracefile name and where ...
>> also, no stack is printed by default unless opal_abort_print_stack is true
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Wed, May 11, 2016 at 3:43 PM, dpchoudh . < 
>> dpcho...@gmail.com> wrote:
>> > Hello Nathan
>> >
>> > Thank you for your response. Could you please be more specific? Adding
>> the
>> > following after MPI_Init() does not seem to make a difference.
>> >
>> > MPI_Init(&argc, &argv);
>> >   signal(SIGABRT, SIG_DFL);
>> >   signal(SIGTERM, SIG_DFL);
>> >
>> > I also find it puzzling that nearly identical OMPI distro running on a
>> > different machine shows different behaviour.
>> >
>> > Best regards
>> > Durga
>> >
>> > The surgeon general advises you to eat right, exercise regularly and
>> quit
>> > ageing.
>> >
>> > On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas > >
>> > wrote:
>> >>
>> >> btr files are indeed created by open mpi's backtrace mechanism. I
>> think we
>> >> should revisit it at some point but for now the only effective way i
>> have
>> >> found to prevent it is to restore the default signal handlers after
>> >> MPI_Init.
>> >>
>> >> Excuse the quoting style. Good sucks.
>> >>
>> >>
>> >> 
>> >> From: users on behalf of dpchoudh .
>> >> Sent: Monday, May 09, 2016 2:59:37 PM
>> >> To: Open MPI Users
>> >> Subject: Re: [OMPI users] No core dump in some cases
>> >>
>> >> Hi Gus
>> >>
>> >> Thanks for your suggestion. But I am not using any resource manager
>> (i.e.
>> >> I am launching mpirun from the bash shell.). In fact, both of the two
>> >> clusters I talked about run CentOS 7 and I launch the job the same way
>> on
>> >> both of these, yet one of them creates standard core files and the
>> other
>> >> creates the 'btr; files. Strange thing is, I could not find anything
>> on the
>> >> .btr (= Backtrace?) files on Google, which is any I asked on this
>> forum.
>> >>
>> >> Best regards
>> >> Durga
>> >>
>> >> The surgeon general advises you to eat right, exercise regularly and
>> quit
>> >> ageing.
>> >>
>> >> On Mon, May 9, 2016 at 12:04 PM, Gus Correa
>> >> 
>> g...@ldeo.columbia.edu>> wrote:
>> >> Hi Durga
>> >>
>> >> Just in case ...
>> >> If you're using a resource manager to start the jobs (Torque, etc),
>> >> you need to have them set the limits (for coredump size, stacksize,
>> locked
>> >> memory size, etc).
>> >> This way the jobs will inherit the limits from the
>> >> resource manager daemon.
>> >> On Torque (which I use) I do this on the pbs_mom daemon
>> >> init script (I am still before the systemd era, that lovely POS).
>> >> And set the hard/soft limits on /etc/security/limits.conf as well.
>> >>
>> >> I hope this helps,
>> >> Gus Correa
>> >>
>> >> On 05/07/2016 12:27 PM, Jeff Squyres (jsquyres) wrote:
>> >> I'm afraid I don't know what a .btr file is -- that is not something
>> that
>> >> is controlled by Open MPI.
>> >>
>> >> You might want to look into your OS settings to see if it has some
>> kind of
>> >> alternate corefile mechanism...?

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread Gilles Gouaillardet

Are you sure ulimit -c unlimited is *really* applied on all hosts


can you please run the simple program below and confirm that ?


Cheers,


Gilles


#include 
#include 
#include 
#include 

int main(int argc, char *argv[]) {
struct rlimit rlim;
char * c = (char *)0;
getrlimit(RLIMIT_CORE, &rlim);
printf ("before MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
MPI_Init(&argc, &argv);
getrlimit(RLIMIT_CORE, &rlim);
printf ("after MPI_Init : %d %d\n", rlim.rlim_cur, rlim.rlim_max);
*c = 0;
MPI_Finalize();
return 0;
}


On 5/12/2016 4:22 AM, dpchoudh . wrote:

Hello Gilles

Thank you for the advice. However, that did not seem to make any 
difference. Here is what I did (on the cluster that generates .btr 
files for core dumps):


[durga@smallMPI git]$ ompi_info --all | grep opal_signal
   MCA opal base: parameter "opal_signal" (current value: 
"6,7,8,11", data source: default, level: 3 user/all, type: string)

[durga@smallMPI git]$


According to , signals 6.7,8,11 are this:

#define SIGABRT6/* Abort (ANSI).  */
#defineSIGBUS7/* BUS error (4.2 BSD).  */
#defineSIGFPE8/* Floating-point exception (ANSI).  */
#defineSIGSEGV11/* Segmentation violation (ANSI).  */

And thus I added the following just after MPI_Init()

MPI_Init(&argc, &argv);
signal(SIGABRT, SIG_DFL);
signal(SIGBUS, SIG_DFL);
signal(SIGFPE, SIG_DFL);
signal(SIGSEGV, SIG_DFL);
signal(SIGTERM, SIG_DFL);

(I added the 'SIGTERM' part later, just in case it would make a 
difference; i didn't)


The resulting code still generates .btr files instead of core files.

It looks like the 'execinfo' MCA component is being used as the 
backtrace mechanism:


[durga@smallMPI git]$ ompi_info | grep backtrace
   MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component 
v3.0.0)


However, I could not find any way to choose 'none' instead of 'execinfo'

And the strange thing is, on the cluster where regular core dump is 
happening, the output of

$ ompi_info | grep backtrace
is identical to the above. (Which kind of makes sense because they 
were created from the same source with the same configure options.)


Sorry to harp on this, but without a core file it is hard to debug the 
application (e.g. examine stack variables).


Thank you
Durga


The surgeon general advises you to eat right, exercise regularly and 
quit ageing.


On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet 
mailto:gilles.gouaillar...@gmail.com>> 
wrote:


Durga,

you might wanna try to restore the signal handler for other
signals as well
(SIGSEGV, SIGBUS, ...)
ompi_info --all | grep opal_signal
does list the signal you should restore the handler


only one backtrace component is built (out of several candidates :
execinfo, none, printstack)
nm -l libopen-pal.so | grep backtrace
will hint you which component was built

your two similar distros might have different backtrace component



Gus,

btr is a plain text file with a back trace "ala" gdb



Nathan,

i did a 'grep btr' and could not find anything :-(
opal_backtrace_buffer and opal_backtrace_print are only used with
stderr.
so i am puzzled who creates the tracefile name and where ...
also, no stack is printed by default unless opal_abort_print_stack
is true

Cheers,

Gilles


On Wed, May 11, 2016 at 3:43 PM, dpchoudh . mailto:dpcho...@gmail.com>> wrote:
> Hello Nathan
>
> Thank you for your response. Could you please be more specific?
Adding the
> following after MPI_Init() does not seem to make a difference.
>
> MPI_Init(&argc, &argv);
>   signal(SIGABRT, SIG_DFL);
>   signal(SIGTERM, SIG_DFL);
>
> I also find it puzzling that nearly identical OMPI distro
running on a
> different machine shows different behaviour.
>
> Best regards
> Durga
>
> The surgeon general advises you to eat right, exercise regularly
and quit
> ageing.
>
> On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas
mailto:hje...@lanl.gov>>
> wrote:
>>
>> btr files are indeed created by open mpi's backtrace mechanism.
I think we
>> should revisit it at some point but for now the only effective
way i have
>> found to prevent it is to restore the default signal handlers after
>> MPI_Init.
    >>
    >> Excuse the quoting style. Good sucks.
>>
>>
>> 
>> From: users on behalf of dpchoudh .
>> Sent: Monday, May 09, 2016 2:59:37 PM
>> To: Open MPI Users
&

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
Hello Gilles

Thank you for the advice. However, that did not seem to make any
difference. Here is what I did (on the cluster that generates .btr files
for core dumps):

[durga@smallMPI git]$ ompi_info --all | grep opal_signal
   MCA opal base: parameter "opal_signal" (current value:
"6,7,8,11", data source: default, level: 3 user/all, type: string)
[durga@smallMPI git]$


According to , signals 6.7,8,11 are this:

#defineSIGABRT6/* Abort (ANSI).  */
#defineSIGBUS7/* BUS error (4.2 BSD).  */
#defineSIGFPE8/* Floating-point exception (ANSI).  */
#defineSIGSEGV11/* Segmentation violation (ANSI).  */

And thus I added the following just after MPI_Init()

MPI_Init(&argc, &argv);
signal(SIGABRT, SIG_DFL);
signal(SIGBUS, SIG_DFL);
signal(SIGFPE, SIG_DFL);
signal(SIGSEGV, SIG_DFL);
signal(SIGTERM, SIG_DFL);

(I added the 'SIGTERM' part later, just in case it would make a difference;
i didn't)

The resulting code still generates .btr files instead of core files.

It looks like the 'execinfo' MCA component is being used as the backtrace
mechanism:

[durga@smallMPI git]$ ompi_info | grep backtrace
   MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component
v3.0.0)

However, I could not find any way to choose 'none' instead of 'execinfo'

And the strange thing is, on the cluster where regular core dump is
happening, the output of
$ ompi_info | grep backtrace
is identical to the above. (Which kind of makes sense because they were
created from the same source with the same configure options.)

Sorry to harp on this, but without a core file it is hard to debug the
application (e.g. examine stack variables).

Thank you
Durga


The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Wed, May 11, 2016 at 3:37 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Durga,
>
> you might wanna try to restore the signal handler for other signals as well
> (SIGSEGV, SIGBUS, ...)
> ompi_info --all | grep opal_signal
> does list the signal you should restore the handler
>
>
> only one backtrace component is built (out of several candidates :
> execinfo, none, printstack)
> nm -l libopen-pal.so | grep backtrace
> will hint you which component was built
>
> your two similar distros might have different backtrace component
>
>
>
> Gus,
>
> btr is a plain text file with a back trace "ala" gdb
>
>
>
> Nathan,
>
> i did a 'grep btr' and could not find anything :-(
> opal_backtrace_buffer and opal_backtrace_print are only used with stderr.
> so i am puzzled who creates the tracefile name and where ...
> also, no stack is printed by default unless opal_abort_print_stack is true
>
> Cheers,
>
> Gilles
>
>
> On Wed, May 11, 2016 at 3:43 PM, dpchoudh .  wrote:
> > Hello Nathan
> >
> > Thank you for your response. Could you please be more specific? Adding
> the
> > following after MPI_Init() does not seem to make a difference.
> >
> > MPI_Init(&argc, &argv);
> >   signal(SIGABRT, SIG_DFL);
> >   signal(SIGTERM, SIG_DFL);
> >
> > I also find it puzzling that nearly identical OMPI distro running on a
> > different machine shows different behaviour.
> >
> > Best regards
> > Durga
> >
> > The surgeon general advises you to eat right, exercise regularly and quit
> > ageing.
> >
> > On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas 
> > wrote:
> >>
> >> btr files are indeed created by open mpi's backtrace mechanism. I think
> we
> >> should revisit it at some point but for now the only effective way i
> have
> >> found to prevent it is to restore the default signal handlers after
> >> MPI_Init.
> >>
> >> Excuse the quoting style. Good sucks.
> >>
> >>
> >> 
> >> From: users on behalf of dpchoudh .
> >> Sent: Monday, May 09, 2016 2:59:37 PM
> >> To: Open MPI Users
> >> Subject: Re: [OMPI users] No core dump in some cases
> >>
> >> Hi Gus
> >>
> >> Thanks for your suggestion. But I am not using any resource manager
> (i.e.
> >> I am launching mpirun from the bash shell.). In fact, both of the two
> >> clusters I talked about run CentOS 7 and I launch the job the same way
> on
> >> both of these, yet one of them creates standard core files and the other
> >> creates the 'btr; files. Strange thing is, I could not find anything on
> the
> >> .btr (= Backtrace?) files on Google, which is any I as

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread Gilles Gouaillardet
Durga,

you might wanna try to restore the signal handler for other signals as well
(SIGSEGV, SIGBUS, ...)
ompi_info --all | grep opal_signal
does list the signal you should restore the handler


only one backtrace component is built (out of several candidates :
execinfo, none, printstack)
nm -l libopen-pal.so | grep backtrace
will hint you which component was built

your two similar distros might have different backtrace component



Gus,

btr is a plain text file with a back trace "ala" gdb



Nathan,

i did a 'grep btr' and could not find anything :-(
opal_backtrace_buffer and opal_backtrace_print are only used with stderr.
so i am puzzled who creates the tracefile name and where ...
also, no stack is printed by default unless opal_abort_print_stack is true

Cheers,

Gilles


On Wed, May 11, 2016 at 3:43 PM, dpchoudh .  wrote:
> Hello Nathan
>
> Thank you for your response. Could you please be more specific? Adding the
> following after MPI_Init() does not seem to make a difference.
>
> MPI_Init(&argc, &argv);
>   signal(SIGABRT, SIG_DFL);
>   signal(SIGTERM, SIG_DFL);
>
> I also find it puzzling that nearly identical OMPI distro running on a
> different machine shows different behaviour.
>
> Best regards
> Durga
>
> The surgeon general advises you to eat right, exercise regularly and quit
> ageing.
>
> On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas 
> wrote:
>>
>> btr files are indeed created by open mpi's backtrace mechanism. I think we
>> should revisit it at some point but for now the only effective way i have
>> found to prevent it is to restore the default signal handlers after
>> MPI_Init.
>>
>> Excuse the quoting style. Good sucks.
>>
>>
>> ________________
>> From: users on behalf of dpchoudh .
>> Sent: Monday, May 09, 2016 2:59:37 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] No core dump in some cases
>>
>> Hi Gus
>>
>> Thanks for your suggestion. But I am not using any resource manager (i.e.
>> I am launching mpirun from the bash shell.). In fact, both of the two
>> clusters I talked about run CentOS 7 and I launch the job the same way on
>> both of these, yet one of them creates standard core files and the other
>> creates the 'btr; files. Strange thing is, I could not find anything on the
>> .btr (= Backtrace?) files on Google, which is any I asked on this forum.
>>
>> Best regards
>> Durga
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>>
>> On Mon, May 9, 2016 at 12:04 PM, Gus Correa
>> mailto:g...@ldeo.columbia.edu>> wrote:
>> Hi Durga
>>
>> Just in case ...
>> If you're using a resource manager to start the jobs (Torque, etc),
>> you need to have them set the limits (for coredump size, stacksize, locked
>> memory size, etc).
>> This way the jobs will inherit the limits from the
>> resource manager daemon.
>> On Torque (which I use) I do this on the pbs_mom daemon
>> init script (I am still before the systemd era, that lovely POS).
>> And set the hard/soft limits on /etc/security/limits.conf as well.
>>
>> I hope this helps,
>> Gus Correa
>>
>> On 05/07/2016 12:27 PM, Jeff Squyres (jsquyres) wrote:
>> I'm afraid I don't know what a .btr file is -- that is not something that
>> is controlled by Open MPI.
>>
>> You might want to look into your OS settings to see if it has some kind of
>> alternate corefile mechanism...?
>>
>>
>> On May 6, 2016, at 8:58 PM, dpchoudh .
>> mailto:dpcho...@gmail.com>> wrote:
>>
>> Hello all
>>
>> I run MPI jobs (for test purpose only) on two different 'clusters'. Both
>> 'clusters' have two nodes only, connected back-to-back. The two are very
>> similar, but not identical, both software and hardware wise.
>>
>> Both have ulimit -c set to unlimited. However, only one of the two creates
>> core files when an MPI job crashes. The other creates a text file named
>> something like
>>
>> .80s-,.btr
>>
>> I'd much prefer a core file because that allows me to debug with a lot
>> more options than a static text file with addresses. How do I get a core
>> file in all situations? I am using MPI source from the master branch.
>>
>> Thanks in advance
>> Durga
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>> ___
>> users mailing list
>> us...@open-mpi.org<mailto

Re: [OMPI users] No core dump in some cases

2016-05-11 Thread dpchoudh .
Hello Nathan

Thank you for your response. Could you please be more specific? Adding the
following after MPI_Init() does not seem to make a difference.

MPI_Init(&argc, &argv);

* signal(SIGABRT, SIG_DFL);  signal(SIGTERM, SIG_DFL);*

I also find it puzzling that nearly identical OMPI distro running on a
different machine shows different behaviour.

Best regards
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Tue, May 10, 2016 at 10:02 AM, Hjelm, Nathan Thomas 
wrote:

> btr files are indeed created by open mpi's backtrace mechanism. I think we
> should revisit it at some point but for now the only effective way i have
> found to prevent it is to restore the default signal handlers after
> MPI_Init.
>
> Excuse the quoting style. Good sucks.
>
>
> 
> From: users on behalf of dpchoudh .
> Sent: Monday, May 09, 2016 2:59:37 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] No core dump in some cases
>
> Hi Gus
>
> Thanks for your suggestion. But I am not using any resource manager (i.e.
> I am launching mpirun from the bash shell.). In fact, both of the two
> clusters I talked about run CentOS 7 and I launch the job the same way on
> both of these, yet one of them creates standard core files and the other
> creates the 'btr; files. Strange thing is, I could not find anything on the
> .btr (= Backtrace?) files on Google, which is any I asked on this forum.
>
> Best regards
> Durga
>
> The surgeon general advises you to eat right, exercise regularly and quit
> ageing.
>
> On Mon, May 9, 2016 at 12:04 PM, Gus Correa  g...@ldeo.columbia.edu>> wrote:
> Hi Durga
>
> Just in case ...
> If you're using a resource manager to start the jobs (Torque, etc),
> you need to have them set the limits (for coredump size, stacksize, locked
> memory size, etc).
> This way the jobs will inherit the limits from the
> resource manager daemon.
> On Torque (which I use) I do this on the pbs_mom daemon
> init script (I am still before the systemd era, that lovely POS).
> And set the hard/soft limits on /etc/security/limits.conf as well.
>
> I hope this helps,
> Gus Correa
>
> On 05/07/2016 12:27 PM, Jeff Squyres (jsquyres) wrote:
> I'm afraid I don't know what a .btr file is -- that is not something that
> is controlled by Open MPI.
>
> You might want to look into your OS settings to see if it has some kind of
> alternate corefile mechanism...?
>
>
> On May 6, 2016, at 8:58 PM, dpchoudh .  dpcho...@gmail.com>> wrote:
>
> Hello all
>
> I run MPI jobs (for test purpose only) on two different 'clusters'. Both
> 'clusters' have two nodes only, connected back-to-back. The two are very
> similar, but not identical, both software and hardware wise.
>
> Both have ulimit -c set to unlimited. However, only one of the two creates
> core files when an MPI job crashes. The other creates a text file named
> something like
>
> .80s-,.btr
>
> I'd much prefer a core file because that allows me to debug with a lot
> more options than a static text file with addresses. How do I get a core
> file in all situations? I am using MPI source from the master branch.
>
> Thanks in advance
> Durga
>
> The surgeon general advises you to eat right, exercise regularly and quit
> ageing.
> ___
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29124.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29141.php
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29154.php
>


Re: [OMPI users] No core dump in some cases

2016-05-10 Thread Gus Correa

On 05/09/2016 04:59 PM, dpchoudh . wrote:

Hi Gus

Thanks for your suggestion. But I am not using any resource manager
(i.e. I am launching mpirun from the bash shell.). In fact, both of the
two clusters I talked about run CentOS 7 and I launch the job the same
way on both of these, yet one of them creates standard core files and
the other creates the 'btr; files. Strange thing is, I could not find
anything on the .btr (= Backtrace?) files on Google, which is any I
asked on this forum.

Best regards
Durga

The surgeon general advises you to eat right, exercise regularly and
quit ageing.


Hi Durga

My search showed something, but quite weirdly related to databases.
Maybe the same file extension used for two different things?
Does "file *.btr" tell anything?

Databases:

http://cs.pervasive.com/forums/p/14533/50237.aspx

... more databases ...

http://www.openthefile.net/extension/btr

... binary tree indexes ...

http://www.velocityreviews.com/threads/index-btr-file-in-windows-xp-help-please.307459/

... and a catalog of buterflies!  :)

http://filext.com/file-extension/BTR
http://review-tech.appspot.com/btr-file.html

Oh well ...

... and finally a previous incarnation of an OpenMPI 1.6.5 question 
similar to yours (where .btr stands for backtrace):


http://stackoverflow.com/questions/25275450/cause-all-processes-running-under-openmpi-to-dump-core

Could this be due to a (unlikely) mix of OpenMPI 1.10 with 1.6.5?

Gus Correa



On Mon, May 9, 2016 at 12:04 PM, Gus Correa mailto:g...@ldeo.columbia.edu>> wrote:

Hi Durga

Just in case ...
If you're using a resource manager to start the jobs (Torque, etc),
you need to have them set the limits (for coredump size, stacksize,
locked memory size, etc).
This way the jobs will inherit the limits from the
resource manager daemon.
On Torque (which I use) I do this on the pbs_mom daemon
init script (I am still before the systemd era, that lovely POS).
And set the hard/soft limits on /etc/security/limits.conf as well.

I hope this helps,
Gus Correa

On 05/07/2016 12:27 PM, Jeff Squyres (jsquyres) wrote:

I'm afraid I don't know what a .btr file is -- that is not
something that is controlled by Open MPI.

You might want to look into your OS settings to see if it has
some kind of alternate corefile mechanism...?


On May 6, 2016, at 8:58 PM, dpchoudh . mailto:dpcho...@gmail.com>> wrote:

Hello all

I run MPI jobs (for test purpose only) on two different
'clusters'. Both 'clusters' have two nodes only, connected
back-to-back. The two are very similar, but not identical,
both software and hardware wise.

Both have ulimit -c set to unlimited. However, only one of
the two creates core files when an MPI job crashes. The
other creates a text file named something like

.80s-,.btr

I'd much prefer a core file because that allows me to debug
with a lot more options than a static text file with
addresses. How do I get a core file in all situations? I am
using MPI source from the master branch.

Thanks in advance
Durga

The surgeon general advises you to eat right, exercise
regularly and quit ageing.
___
users mailing list
us...@open-mpi.org 
Subscription:
https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29124.php




___
users mailing list
us...@open-mpi.org 
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29141.php




___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29145.php





Re: [OMPI users] No core dump in some cases

2016-05-10 Thread Hjelm, Nathan Thomas
btr files are indeed created by open mpi's backtrace mechanism. I think we 
should revisit it at some point but for now the only effective way i have found 
to prevent it is to restore the default signal handlers after MPI_Init.

Excuse the quoting style. Good sucks.



From: users on behalf of dpchoudh .
Sent: Monday, May 09, 2016 2:59:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] No core dump in some cases

Hi Gus

Thanks for your suggestion. But I am not using any resource manager (i.e. I am 
launching mpirun from the bash shell.). In fact, both of the two clusters I 
talked about run CentOS 7 and I launch the job the same way on both of these, 
yet one of them creates standard core files and the other creates the 'btr; 
files. Strange thing is, I could not find anything on the .btr (= Backtrace?) 
files on Google, which is any I asked on this forum.

Best regards
Durga

The surgeon general advises you to eat right, exercise regularly and quit 
ageing.

On Mon, May 9, 2016 at 12:04 PM, Gus Correa 
mailto:g...@ldeo.columbia.edu>> wrote:
Hi Durga

Just in case ...
If you're using a resource manager to start the jobs (Torque, etc),
you need to have them set the limits (for coredump size, stacksize, locked 
memory size, etc).
This way the jobs will inherit the limits from the
resource manager daemon.
On Torque (which I use) I do this on the pbs_mom daemon
init script (I am still before the systemd era, that lovely POS).
And set the hard/soft limits on /etc/security/limits.conf as well.

I hope this helps,
Gus Correa

On 05/07/2016 12:27 PM, Jeff Squyres (jsquyres) wrote:
I'm afraid I don't know what a .btr file is -- that is not something that is 
controlled by Open MPI.

You might want to look into your OS settings to see if it has some kind of 
alternate corefile mechanism...?


On May 6, 2016, at 8:58 PM, dpchoudh . 
mailto:dpcho...@gmail.com>> wrote:

Hello all

I run MPI jobs (for test purpose only) on two different 'clusters'. Both 
'clusters' have two nodes only, connected back-to-back. The two are very 
similar, but not identical, both software and hardware wise.

Both have ulimit -c set to unlimited. However, only one of the two creates core 
files when an MPI job crashes. The other creates a text file named something 
like
.80s-,.btr

I'd much prefer a core file because that allows me to debug with a lot more 
options than a static text file with addresses. How do I get a core file in all 
situations? I am using MPI source from the master branch.

Thanks in advance
Durga

The surgeon general advises you to eat right, exercise regularly and quit 
ageing.
___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29124.php



___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29141.php



Re: [OMPI users] No core dump in some cases

2016-05-09 Thread dpchoudh .
Hi Gus

Thanks for your suggestion. But I am not using any resource manager (i.e. I
am launching mpirun from the bash shell.). In fact, both of the two
clusters I talked about run CentOS 7 and I launch the job the same way on
both of these, yet one of them creates standard core files and the other
creates the 'btr; files. Strange thing is, I could not find anything on the
.btr (= Backtrace?) files on Google, which is any I asked on this forum.

Best regards
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Mon, May 9, 2016 at 12:04 PM, Gus Correa  wrote:

> Hi Durga
>
> Just in case ...
> If you're using a resource manager to start the jobs (Torque, etc),
> you need to have them set the limits (for coredump size, stacksize, locked
> memory size, etc).
> This way the jobs will inherit the limits from the
> resource manager daemon.
> On Torque (which I use) I do this on the pbs_mom daemon
> init script (I am still before the systemd era, that lovely POS).
> And set the hard/soft limits on /etc/security/limits.conf as well.
>
> I hope this helps,
> Gus Correa
>
> On 05/07/2016 12:27 PM, Jeff Squyres (jsquyres) wrote:
>
>> I'm afraid I don't know what a .btr file is -- that is not something that
>> is controlled by Open MPI.
>>
>> You might want to look into your OS settings to see if it has some kind
>> of alternate corefile mechanism...?
>>
>>
>> On May 6, 2016, at 8:58 PM, dpchoudh .  wrote:
>>>
>>> Hello all
>>>
>>> I run MPI jobs (for test purpose only) on two different 'clusters'. Both
>>> 'clusters' have two nodes only, connected back-to-back. The two are very
>>> similar, but not identical, both software and hardware wise.
>>>
>>> Both have ulimit -c set to unlimited. However, only one of the two
>>> creates core files when an MPI job crashes. The other creates a text file
>>> named something like
>>>
>>> .80s-,.btr
>>>
>>> I'd much prefer a core file because that allows me to debug with a lot
>>> more options than a static text file with addresses. How do I get a core
>>> file in all situations? I am using MPI source from the master branch.
>>>
>>> Thanks in advance
>>> Durga
>>>
>>> The surgeon general advises you to eat right, exercise regularly and
>>> quit ageing.
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/05/29124.php
>>>
>>
>>
>>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29141.php
>


Re: [OMPI users] No core dump in some cases

2016-05-09 Thread Gus Correa

Hi Durga

Just in case ...
If you're using a resource manager to start the jobs (Torque, etc),
you need to have them set the limits (for coredump size, stacksize, 
locked memory size, etc).

This way the jobs will inherit the limits from the
resource manager daemon.
On Torque (which I use) I do this on the pbs_mom daemon
init script (I am still before the systemd era, that lovely POS).
And set the hard/soft limits on /etc/security/limits.conf as well.

I hope this helps,
Gus Correa

On 05/07/2016 12:27 PM, Jeff Squyres (jsquyres) wrote:

I'm afraid I don't know what a .btr file is -- that is not something that is 
controlled by Open MPI.

You might want to look into your OS settings to see if it has some kind of 
alternate corefile mechanism...?



On May 6, 2016, at 8:58 PM, dpchoudh .  wrote:

Hello all

I run MPI jobs (for test purpose only) on two different 'clusters'. Both 
'clusters' have two nodes only, connected back-to-back. The two are very 
similar, but not identical, both software and hardware wise.

Both have ulimit -c set to unlimited. However, only one of the two creates core 
files when an MPI job crashes. The other creates a text file named something 
like
.80s-,.btr

I'd much prefer a core file because that allows me to debug with a lot more 
options than a static text file with addresses. How do I get a core file in all 
situations? I am using MPI source from the master branch.

Thanks in advance
Durga

The surgeon general advises you to eat right, exercise regularly and quit 
ageing.
___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29124.php







Re: [OMPI users] No core dump in some cases

2016-05-07 Thread Jeff Squyres (jsquyres)
I'm afraid I don't know what a .btr file is -- that is not something that is 
controlled by Open MPI.

You might want to look into your OS settings to see if it has some kind of 
alternate corefile mechanism...?


> On May 6, 2016, at 8:58 PM, dpchoudh .  wrote:
> 
> Hello all
> 
> I run MPI jobs (for test purpose only) on two different 'clusters'. Both 
> 'clusters' have two nodes only, connected back-to-back. The two are very 
> similar, but not identical, both software and hardware wise.
> 
> Both have ulimit -c set to unlimited. However, only one of the two creates 
> core files when an MPI job crashes. The other creates a text file named 
> something like
> .80s-,.btr
> 
> I'd much prefer a core file because that allows me to debug with a lot more 
> options than a static text file with addresses. How do I get a core file in 
> all situations? I am using MPI source from the master branch.
> 
> Thanks in advance
> Durga
> 
> The surgeon general advises you to eat right, exercise regularly and quit 
> ageing.
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29124.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] No core dump in some cases

2016-05-06 Thread dpchoudh .
Hello all

I run MPI jobs (for test purpose only) on two different 'clusters'. Both
'clusters' have two nodes only, connected back-to-back. The two are very
similar, but not identical, both software and hardware wise.

Both have ulimit -c set to unlimited. However, only one of the two creates
core files when an MPI job crashes. The other creates a text file named
something like
.80s-,.btr

I'd much prefer a core file because that allows me to debug with a lot more
options than a static text file with addresses. How do I get a core file in
all situations? I am using MPI source from the master branch.

Thanks in advance
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.