Hi Goetz

Thanks for trying these other versions.  Looks like a bug.  Could you post
the config.log output from your build of the 2.1.0 to the list?

Also could you try running the job using this extra command line arg to see
if the problem goes away?

mpirun --mca btl ^vader (rest of your args)

Howard

Götz Waschk <goetz.was...@gmail.com> schrieb am Mi. 22. März 2017 um 13:09:

On Wed, Mar 22, 2017 at 7:46 PM, Howard Pritchard <hpprit...@gmail.com>
wrote:
> Hi Goetz,
>
> Would you mind testing against the 2.1.0 release or the latest from the
> 1.10.x series (1.10.6)?

Hi Howard,

after sending my mail I have tested both 1.10.6 and 2.1.0 and I have
received the same error. I have also tested outside of slurm using
ssh, same problem.

Here's the message from 2.1.0:
[pax11-10:21920] *** Process received signal ***
[pax11-10:21920] Signal: Bus error (7)
[pax11-10:21920] Signal code: Non-existant physical address (2)
[pax11-10:21920] Failing at address: 0x2b5d5b752290
[pax11-10:21920] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b5d446e9370]
[pax11-10:21920] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5d531645e0]
[pax11-10:21920] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b5d44f607c1]
[pax11-10:21920] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b5d53162b51]
[pax11-10:21920] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b5d5bb0a17f]
[pax11-10:21920] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b5d5bafe0aa]
[pax11-10:21920] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x399)[0x2b5d44480429]
[pax11-10:21920] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b5d444486ab]
[pax11-10:21920] [ 8] IMB-MPI1[0x40b2ff]
[pax11-10:21920] [ 9] IMB-MPI1[0x402646]
[pax11-10:21920] [10]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b5d44917b35]
[pax11-10:21920] [11] IMB-MPI1[0x401f79]
[pax11-10:21920] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 320 with PID 21920 on node pax11-10
exited on signal 7 (Bus error).
--------------------------------------------------------------------------


Regards, Götz Waschk
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to