Re: [OMPI users] Debugging a crash
Il 29/01/21 15:58, Gilles Gouaillardet via users ha scritto: Hi Gilles. Tks for the answer. > the mpirun command line starts 2 MPI task, but the error log mentions > rank 56, so unless there is a copy/paste error, this is highly > suspicious. Uhm... Going to re-check. Most probably it's just my error substituting a variable, but worth checking again. > I invite you to check the filesystem usage on this node, and make sure > there is a similar amount of available space in /tmp and /dev/shm (or > other filesystem if you use a non standard $TMPDIR Well, on all those nodes /tmp is disk-based (~52G available) while /dev/shm is a tmpfs w/ 239G "available" mounted as tmpfs /dev/shm tmpfs defaults,size=95% (we've had to increase the default to 95% because that's required by the new release of a library that "mixes" MPI and OpenMP to squeeze a bit more speed reducing comms overhead). -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [OMPI users] Debugging a crash
Diego, the mpirun command line starts 2 MPI task, but the error log mentions rank 56, so unless there is a copy/paste error, this is highly suspicious. I invite you to check the filesystem usage on this node, and make sure there is a similar amount of available space in /tmp and /dev/shm (or other filesystem if you use a non standard $TMPDIR Cheers, Gilles On Fri, Jan 29, 2021 at 10:50 PM Diego Zuccato via users wrote: > > Hello all. > > I'm having a problem with a job: if it gets scheduled on a specific node > of our cluster, it fails with: > -8<-- > -- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > -- > [str957-mtx-10:38099] *** Process received signal *** > [str957-mtx-10:38099] Signal: Segmentation fault (11) > [str957-mtx-10:38099] Signal code: Address not mapped (1) > [str957-mtx-10:38099] Failing at address: 0x7f98cb266008 > [str957-mtx-10:38099] [ 0] > /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f98ca553730] > [str957-mtx-10:38099] [ 1] > /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7f98c8a99936] > [str957-mtx-10:38099] [ 2] > /lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7f98c8a82733] > [str957-mtx-10:38099] [ 3] > /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7f98c8a995b4] > [str957-mtx-10:38099] [ 4] > /lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7f98c8bdc46e] > [str957-mtx-10:38099] [ 5] > /lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7f98c8b9488d] > [str957-mtx-10:38099] [ 6] > /lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7f98c8b50d7c] > [str957-mtx-10:38099] [ 7] > /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7f98c8c3afe4] > [str957-mtx-10:38099] [ 8] > /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7f98c946d656] > [str957-mtx-10:38099] [ 9] > /lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7f98ca2c111a] > [str957-mtx-10:38099] [10] > /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7f98cae1ce62] > [str957-mtx-10:38099] [11] > /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x7f98cae4b17e] > [str957-mtx-10:38099] [12] Arepo(+0x3940)[0x561b45905940] > [str957-mtx-10:38099] [13] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f98ca3a409b] > [str957-mtx-10:38099] [14] Arepo(+0x3d3a)[0x561b45905d3a] > [str957-mtx-10:38099] *** End of error message *** > -- > mpiexec noticed that process rank 56 with PID 37999 on node > str957-mtx-10 exited on signal 11 (Segmentation fault). > -- > slurmstepd-str957-mtx-00: error: *** JOB 12129 ON str957-mtx-00 > CANCELLED AT 2021-01-28T14:11:33 *** > -8<-- > [I cut out the other repetitions of the stack trace for brevity.] > > The command used to launch it is: > mpirun --mca mpi_leave_pinned 0 --mca oob_tcp_listen_mode listen_thread > -np 2 --map-by socket Arepo someargs > > The same job, when scheduled to run on another node, works w/o problems. > For what I could check, the nodes are configured the same (actually > installed from the same series of scripts and following the same > procedure: it was a set of 16 nodes and just one is giving troubles). > I tried with simpler MPI codes and could not reproduce the error. Other > users are using the same node w/o problems with different codes. > Packages are the same on all nodes. I already double-checked that kernel > module config is the same and memlock is unlimited. > Any hint where to look? > > Tks. > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786
[OMPI users] Debugging a crash
Hello all. I'm having a problem with a job: if it gets scheduled on a specific node of our cluster, it fails with: -8<-- -- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -- [str957-mtx-10:38099] *** Process received signal *** [str957-mtx-10:38099] Signal: Segmentation fault (11) [str957-mtx-10:38099] Signal code: Address not mapped (1) [str957-mtx-10:38099] Failing at address: 0x7f98cb266008 [str957-mtx-10:38099] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f98ca553730] [str957-mtx-10:38099] [ 1] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x2936)[0x7f98c8a99936] [str957-mtx-10:38099] [ 2] /lib/x86_64-linux-gnu/libmca_common_dstore.so.1(pmix_common_dstor_init+0x9d3)[0x7f98c8a82733] [str957-mtx-10:38099] [ 3] /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so(+0x25b4)[0x7f98c8a995b4] [str957-mtx-10:38099] [ 4] /lib/x86_64-linux-gnu/libpmix.so.2(pmix_gds_base_select+0x12e)[0x7f98c8bdc46e] [str957-mtx-10:38099] [ 5] /lib/x86_64-linux-gnu/libpmix.so.2(pmix_rte_init+0x8cd)[0x7f98c8b9488d] [str957-mtx-10:38099] [ 6] /lib/x86_64-linux-gnu/libpmix.so.2(PMIx_Init+0xdc)[0x7f98c8b50d7c] [str957-mtx-10:38099] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so(ext2x_client_init+0xc4)[0x7f98c8c3afe4] [str957-mtx-10:38099] [ 8] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so(+0x2656)[0x7f98c946d656] [str957-mtx-10:38099] [ 9] /lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7f98ca2c111a] [str957-mtx-10:38099] [10] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7f98cae1ce62] [str957-mtx-10:38099] [11] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0x6e)[0x7f98cae4b17e] [str957-mtx-10:38099] [12] Arepo(+0x3940)[0x561b45905940] [str957-mtx-10:38099] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f98ca3a409b] [str957-mtx-10:38099] [14] Arepo(+0x3d3a)[0x561b45905d3a] [str957-mtx-10:38099] *** End of error message *** -- mpiexec noticed that process rank 56 with PID 37999 on node str957-mtx-10 exited on signal 11 (Segmentation fault). -- slurmstepd-str957-mtx-00: error: *** JOB 12129 ON str957-mtx-00 CANCELLED AT 2021-01-28T14:11:33 *** -8<-- [I cut out the other repetitions of the stack trace for brevity.] The command used to launch it is: mpirun --mca mpi_leave_pinned 0 --mca oob_tcp_listen_mode listen_thread -np 2 --map-by socket Arepo someargs The same job, when scheduled to run on another node, works w/o problems. For what I could check, the nodes are configured the same (actually installed from the same series of scripts and following the same procedure: it was a set of 16 nodes and just one is giving troubles). I tried with simpler MPI codes and could not reproduce the error. Other users are using the same node w/o problems with different codes. Packages are the same on all nodes. I already double-checked that kernel module config is the same and memlock is unlimited. Any hint where to look? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786