Hi Goetz Thanks for trying these other versions. Looks like a bug. Could you post the config.log output from your build of the 2.1.0 to the list?
Also could you try running the job using this extra command line arg to see if the problem goes away? mpirun --mca btl ^vader (rest of your args) Howard Götz Waschk <goetz.was...@gmail.com> schrieb am Mi. 22. März 2017 um 13:09: On Wed, Mar 22, 2017 at 7:46 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > Hi Goetz, > > Would you mind testing against the 2.1.0 release or the latest from the > 1.10.x series (1.10.6)? Hi Howard, after sending my mail I have tested both 1.10.6 and 2.1.0 and I have received the same error. I have also tested outside of slurm using ssh, same problem. Here's the message from 2.1.0: [pax11-10:21920] *** Process received signal *** [pax11-10:21920] Signal: Bus error (7) [pax11-10:21920] Signal code: Non-existant physical address (2) [pax11-10:21920] Failing at address: 0x2b5d5b752290 [pax11-10:21920] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b5d446e9370] [pax11-10:21920] [ 1] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5d531645e0] [pax11-10:21920] [ 2] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b5d44f607c1] [pax11-10:21920] [ 3] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b5d53162b51] [pax11-10:21920] [ 4] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b5d5bb0a17f] [pax11-10:21920] [ 5] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b5d5bafe0aa] [pax11-10:21920] [ 6] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x399)[0x2b5d44480429] [pax11-10:21920] [ 7] /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b5d444486ab] [pax11-10:21920] [ 8] IMB-MPI1[0x40b2ff] [pax11-10:21920] [ 9] IMB-MPI1[0x402646] [pax11-10:21920] [10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b5d44917b35] [pax11-10:21920] [11] IMB-MPI1[0x401f79] [pax11-10:21920] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 320 with PID 21920 on node pax11-10 exited on signal 7 (Bus error). -------------------------------------------------------------------------- Regards, Götz Waschk _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users