Hello all,
Thank you all for the suggestions. Takahiro suggestion has gotten me to a
point were all of the test will run but as soon as it gets to the clean up
step IMB will seg fault again. I opened an issues on IMB's Github but I
guess I am not gonna be able to get much help from them. So I
On Wed, 20 Feb 2019 10:46:10 -0500
Adam LeBlanc wrote:
> Hello,
>
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node
> --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca
> pml ob1 --mca
nMPI v4.0.0 signal 11 (Segmentation fault)
Hello Howard,
Thanks for all of the help and suggestions I will look into them. I also
realized that my ansible wasn't setup properly for handling tar files so the
nightly build didn't even install, but will do it by hand and will give you an
up
Hello Adam,
IMB had a bug related to Reduce_scatter.
https://github.com/intel/mpi-benchmarks/pull/11
I'm not sure this bug is the cause but you can try the patch.
https://github.com/intel/mpi-benchmarks/commit/841446d8cf4ca1f607c0f24b9a424ee39ee1f569
Thanks,
Takahiro Kawashima,
Fujitsu
I was not able to reproduce the issue with openib on the 4.0, but instead I
randomly segfault in MPI finalize during the grdma cleanup.
I could however reproduce the TCP timeout part with both 4.0 and master, on
a pretty sane cluster (only 3 interfaces, lo, eth0 and virbr0). With no
surprise, the
Hello Howard,
Thanks for all of the help and suggestions I will look into them. I also
realized that my ansible wasn't setup properly for handling tar files so
the nightly build didn't even install, but will do it by hand and will give
you an update tomorrow somewhere in the afternoon.
Thanks,
Hello Adam,
This helps some. Could you post first 20 lines of you config.log. This
will
help in trying to reproduce. The content of your host file (you can use
generic
names for the nodes if that'a an issue to publicize) would also help as
the number of nodes and number of MPI processes/node
On tcp side it doesn't seg fault anymore but will timeout on some tests but
on the openib side it will still seg fault, here is the output:
[pandora:19256] *** Process received signal ***
[pandora:19256] Signal: Segmentation fault (11)
[pandora:19256] Signal code: Address not mapped (1)
Can you try the latest 4.0.x nightly snapshot and see if the problem still
occurs?
https://www.open-mpi.org/nightly/v4.0.x/
> On Feb 20, 2019, at 1:40 PM, Adam LeBlanc wrote:
>
> I do here is the output:
>
> 2 total processes killed (some possibly by mpirun during cleanup)
>
I do here is the output:
2 total processes killed (some possibly by mpirun during cleanup)
[pandora:12238] *** Process received signal ***
[pandora:12238] Signal: Segmentation fault (11)
[pandora:12238] Signal code: Invalid permissions (2)
[pandora:12238] Failing at address: 0x7f5c8e31fff0
HI Adam,
As a sanity check, if you try to use --mca btl self,vader,tcp
do you still see the segmentation fault?
Howard
Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <
alebl...@iol.unh.edu>:
> Hello,
>
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> mpirun
Hello,
When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun
--mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
btl_openib_allow_ib 1 -np 6
-hostfile /home/aleblanc/ib-mpi-hosts
12 matches
Mail list logo