Perhaps I spoke too soon. Now, with the Mellanox OFED stack, we occasionally
get the following failure on exit:
[compute-4-20:68008:0:68008] Caught signal 11 (Segmentation fault: address not
mapped to object at address 0x10)
0 0x000000000002a3c5 opal_free_list_destruct() opal_free_list.c:0
1 0x0000000000001e89 mca_rcache_grdma_finalize() rcache_grdma_module.c:0
2 0x00000000000cbfdf mca_rcache_base_module_destroy() ???:0
3 0x000000000000dfef device_destruct() btl_openib_component.c:0
4 0x0000000000009c61 mca_btl_openib_finalize() ???:0
5 0x00000000000796f3 mca_btl_base_close() btl_base_frame.c:0
6 0x0000000000062c99 mca_base_framework_close() ???:0
7 0x0000000000062c99 mca_base_framework_close() ???:0
8 0x0000000000052a2a ompi_mpi_finalize() ???:0
9 0x0000000000046449 mpi_finalize__() ???:0
It appears to be non-deterministic, as far as my users can tell.
I have no idea how to even begin debugging this, but it started when we
switched from the CentOS OFED stuff to the Mellanox version (which,
incidentally, seems to be failing to even recognize our oldest FDR IB cards).
If anyone has any suggestions, I'd appreciate it.
users mailing list