H George,

I did not manage to trigger the core dump in a simpler test case, but
can reproduce the other parts: I can trigger a jsrun failing, and from
that point on all subsequent jsruns will also fail.

To reproduce, perform the following steps:

--------------------------------------------
$ bsub -W 1:00 -nnodes 32 -P BIP178 -Is /bin/bash
# on the batch node
$ cp /autofs/nccs-svm1_home1/merzky1/jsrun_test.tgz .
$ tar zxf jsrun_test.tgz
$ cd jsrun_test
$ source runme.sh
------------------------------------------------

After that, you can watch your jsrun processes with `ps`.  If you
trigger the error, you will see non-empty `unit.*.err` files in that
directory.  I you don't see those, and all jsrun processes are done,
you may need to remove all *.out and *.err files and try the last
command again.  I never needed more than 3 attempts to see failing
jsruns, and usually it 'worked' on the first attempt.

Best, Andre.

On Tue, Feb 12, 2019 at 12:53 AM Andre Merzky <an...@merzky.net> wrote:
>
> On Mon, Feb 11, 2019 at 10:04 PM George Markomanolis via RT
> <h...@nccs.gov> wrote:
> >
> > Hi Andre,
> >
> > I have no permissions to copy the files from your home. I would need the 
> > core
> > file and the binary to check if I can extract something more. Was it 
> > possible
> > to reproduce it with more simple cases?
>
> Apologies - I fixed the permission for the core file.  As for the
> binary: the binary is jsrun, which is a system utility - it is *not*
> my own application:
>
> core.113220: ELF 64-bit LSB core file 64-bit PowerPC or cisco 7500,
> version 1 (SYSV), SVR4-style, from
> '/opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun -U
> /ccs/home/merzky1/radical.pil', rea
> l uid: 13416, effective uid: 13416, real gid: 24502, effective gid:
> 24502, execfn: '/opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun',
> platform: 'power9'
>
>
> Best, Andre.
>
>
> > regards,
> > George
> >
> > On Wed Feb 06 17:59:08 2019, an...@merzky.net wrote:
> >
> >   Hi George.
> >
> >   comments inlined below.
> >
> >
> >   On Wed, Feb 6, 2019 at 7:35 PM George Markomanolis via RT <h...@nccs.gov> 
> > wrote:
> >   >
> >   > Hi Andre,
> >   >
> >   > Initially, could you unload xalt for testing before you submit your job?
> >   >   module unload xalt
> >
> >   Alas, the problem persists.  I should note that I have been running
> >   similar workloads successfully over the last days (or at least more
> >   successful).  Now it fails consistently with the error described here
> >   (I only saw one core dump though).
> >
> >   I should check if this is workload dependent - I'll  ping back if I
> >   see a difference in that respect.
> >
> >
> >   > The second error, we have seen it sometimes and disappears but we can't
> >   > reproduce it. Could you send us the submission script?
> >
> >   The submission mechanism is unfortunately not a single script, but a
> >   rather involved framework.  But basically we create a resource file
> >   like this:
> >
> >   $ cat unit.000115/unit.000115.rs
> >   RS 0: { host: 3 cpu: 35 36 37 }
> >   RS 1: { host: 3 cpu: 38 39 40 }
> >
> >   and then run with this command:
> >
> >   $ grep jsrun unit.000115/unit.000115.sh
> >   /sw/summit/xalt/1.1.3/bin/jsrun -U
> >   
> > /ccs/home/merzky1/radical.pilot.sandbox/rp.session.login2.merzky1.017933.0015/pilot.0000/unit.000115//unit.000115.rs
> >   -a 1 -E "LD_LIBRARY_PATH" -E "PATH" -
> >   E "PYTHONPATH" -E "NODE_LFS_PATH" /bin/sleep "10"
> >
> >   The resource files vary and can specify up to 100 nodes, and the
> >   workload can also vary 0 here it is just a test obviously.
> >
> >   I'll try to reproduce this in a simple submission script.
> >
> >   > I don't have access to the core file, could you copy it somewhere with 
> > access
> >   > as also the binary?
> >
> >   I copied the core file into my home directory, which should be world
> >   readable.   The executable is jsrun - otherwise I would not have
> >   bothered you guys :-)
> >
> >   $ file 
> > rp.session.login2.merzky1.017933.0015/pilot.0000/unit.000113/core.113220
> >   rp.session.login2.merzky1.017933.0015/pilot.0000/unit.000113/core.113220:
> >   ELF 64-bit LSB core file 64-bit PowerPC or cisco 7500, version 1
> >   (SYSV), SVR4-style, from
> >   '/opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun -U
> >   /ccs/home/merzky1/radical.pil', real uid: 13416, effective uid: 13416,
> >   real gid: 24502, effective gid: 24502, execfn:
> >   '/opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun', platform: 'power9'
> >
> >   > Just to be sure you have one job with 128 jsrun calls?
> >
> >   Yes, aehm, this is a small scale test.  We are working with a pilot
> >   system which launches many small tasks within a larger job allocation.
> >   We are just getting started on summit and are now testing jsrun
> >   capabilities.  As said earlier; I had several runs at larger scale w/o
> >   seeing this specific problem.
> >
> >   Let me know if you need more info or want me to run any tests.
> >
> >   Thanks, Andre.
> >
> >   > regards,
> >   > George
> >   >
> >   > On Wed Feb 06 05:26:32 2019, an...@merzky.net wrote:
> >   >
> >   >  Your name                                                              
> >                                                              Your name      
> >                                                                             
> >                                            Andre Merzky  Your username    
> > merzky1  Email address    [1]an...@merzky.net            Subject of your 
> > question/problem    jsrun core dump  Describe your question/problem    Dear 
> > support,
> >   >
> >   >     Andre Merzky                                                        
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      I am 
> > executing 128 jsruns in a sufficiently large job. Out of those,
> >   >                                                                         
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      one 
> > fails with:
> >   >     Your username                                                       
> >                                                                             
> >                                                                             
> >                                                                             
> >                            1. mailto:an...@merzky.net
> >   >                                                                         
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      ```
> >   >     merzky1                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      cat 
> > unit.000113/STDERR
> >   >                                                                         
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      
> > /autofs/nccs-svm1_sw/summit/xalt/1.1.3/bin/xalt_helper_functions.sh:
> >   >     Email address                                                       
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      line 
> > 185: 113220 Segmentation fault (core dumped) $MY_CMD "$@"
> >   >                                                                         
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      ```
> >   >     [1]an...@merzky.net                                                 
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      The 
> > core is in
> >   >                                                                         
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      
> > /autofs/nccs-svm1_home1/merzky1/radical.pilot.sandbox/rp.session.login2.merzky1.017933.0015/pilot.0000/unit.000113/core.113220,
> >   >     Subject of your question/problem                                    
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      I will 
> > leave it there.
> >   >
> >   >     jsrun core dump                                                     
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      97 
> > more jsruns fail with:
> >   >                                                                         
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      ```
> >   >     Describe your question/problem                                      
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      $ cat 
> > unit.000112/*ERR
> >   >                                                                         
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      Error: 
> > Locate pipe file
> >   >     Dear support,                                                       
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      
> > /tmp/jsm.batch4.13416/168421/JSM_rm_port_13416_168421 timed out.
> >   >                                                                         
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      Error 
> > message: No such file or directory
> >   >     I am executing 128 jsruns in a sufficiently large job. Out of 
> > those,                                                                      
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                            
> > 02-06-2019 05:18:25:897 112266 main: Error initializing RM connection.
> >   >     one fails with:                                                     
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      
> > Exiting.
> >   >                                                                         
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      ```
> >   >     ```                                                                 
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      which 
> > I assume is caused by the first one failing. While I see jsruns
> >   >     cat unit.000113/STDERR                                              
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      
> > failing from time to time, this is the first one where subsequent
> >   >     
> > /autofs/nccs-svm1_sw/summit/xalt/1.1.3/bin/xalt_helper_functions.sh:        
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                              instances do 
> > not succeed.
> >   >     line 185: 113220 Segmentation fault (core dumped) $MY_CMD "$@"
> >   >     ```                                                                 
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                      Let me 
> > know if you need more information.
> >   >     The core is in
> >   >     
> > /autofs/nccs-svm1_home1/merzky1/radical.pilot.sandbox/rp.session.login2.merzky1.017933.0015/pilot.0000/unit.000113/core.113220,
> >                                                                             
> >                                                                             
> >                                                                             
> >                                                                             
> >           Thanks, Andre
> >   >     I will leave it there.
> >   >
> >   >     97 more jsruns fail with:
> >   >     ```
> >   >     $ cat unit.000112/*ERR
> >   >     Error: Locate pipe file
> >   >     /tmp/jsm.batch4.13416/168421/JSM_rm_port_13416_168421 timed out.
> >   >     Error message: No such file or directory
> >   >     02-06-2019 05:18:25:897 112266 main: Error initializing RM 
> > connection.
> >   >     Exiting.
> >   >     ```
> >   >     which I assume is caused by the first one failing. While I see 
> > jsruns
> >   >     failing from time to time, this is the first one where subsequent
> >   >     instances do not succeed.
> >   >
> >   >     Let me know if you need more information.
> >   >
> >   >     Thanks, Andre
> >   >
> >   >
> >   >
> >   >     1. mailto:an...@merzky.net
> >   >
> >
> >
_______________________________________________
mtt-users mailing list
mtt-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/mtt-users

Reply via email to