Hi Goetz,

Would you mind testing against the 2.1.0 release or the latest from the
1.10.x series (1.10.6)?

Thanks,

Howard


2017-03-22 6:25 GMT-06:00 Götz Waschk <goetz.was...@gmail.com>:

> Hi everyone,
>
> I'm testing a new machine with 32 nodes of 32 cores each using the IMB
> benchmark. It is working fine with 512 processes, but it crashes with
> 1024 processes after a running for a minute:
>
> [pax11-17:16978] *** Process received signal ***
> [pax11-17:16978] Signal: Bus error (7)
> [pax11-17:16978] Signal code: Non-existant physical address (2)
> [pax11-17:16978] Failing at address: 0x2b147b785450
> [pax11-17:16978] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b1473b13370]
> [pax11-17:16978] [ 1]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_
> vader.so(mca_btl_vader_frag_init+0x8e)[0x2b14794a413e]
> [pax11-17:16978] [ 2]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(ompi_
> free_list_grow+0x199)[0x2b147384f309]
> [pax11-17:16978] [ 3]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_
> vader.so(+0x270d)[0x2b14794a270d]
> [pax11-17:16978] [ 4]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_
> ob1.so(mca_pml_ob1_send_request_start_prepare+0x43)[0x2b1479ae3a13]
> [pax11-17:16978] [ 5]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_
> ob1.so(mca_pml_ob1_send+0x89a)[0x2b1479ad90ca]
> [pax11-17:16978] [ 6]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_coll_
> tuned.so(ompi_coll_tuned_allreduce_intra_ring+0x3f1)[0x2b147ad6ec41]
> [pax11-17:16978] [ 7]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(MPI_
> Allreduce+0x17b)[0x2b147387d6bb]
> [pax11-17:16978] [ 8] IMB-MPI1[0x40b316]
> [pax11-17:16978] [ 9] IMB-MPI1[0x407284]
> [pax11-17:16978] [10] IMB-MPI1[0x40250e]
> [pax11-17:16978] [11]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1473d41b35]
> [pax11-17:16978] [12] IMB-MPI1[0x401f79]
> [pax11-17:16978] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 552 with PID 0 on node pax11-17
> exited on signal 7 (Bus error).
> --------------------------------------------------------------------------
>
> The program is started from the slurm batch system using mpirun. The
> same application is working fine when using mvapich2 instead.
>
> Regards, Götz Waschk
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to