I manage a very heterogeneous cluster. I have nodes of different ages with different processors, different amounts of RAM, etc. One user is reporting that on certain nodes, his jobs keep crashing with the errors below. His application is using OpenMPI 1.10.3, which I know is an ancient version of OpenMPI, but someone else in his research group built the code with that, so that's what he's stuck with.

I did a Google search of "Signal code: Non-existant physical address", and it appears that this may be a bug in 1.10.3 that happens on certain hardware. Can anyone else confirm this? The full error message is below:

[dawson120:29064] *** Process received signal ***
[dawson120:29062] *** Process received signal ***
[dawson120:29062] Signal: Bus error (7)
[dawson120:29062] Signal code: Non-existant physical address (2)
[dawson120:29062] Failing at address: 0x7ff3f030f180
[dawson120:29067] *** Process received signal ***
[dawson120:29067] Signal: Bus error (7)
[dawson120:29067] Signal code: Non-existant physical address (2)
[dawson120:29067] Failing at address: 0x7fb2b8a61d18
[dawson120:29077] *** Process received signal ***
[dawson120:29078] *** Process received signal ***
[dawson120:29078] Signal: Bus error (7)
[dawson120:29078] Signal code: Non-existant physical address (2)
[dawson120:29078] Failing at address: 0x7f60a13d2c98
[dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0]
[dawson120:29078] [ 1] /usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4]

I've asked the user to switch to a newer version of OpenMPI, but since his research group is all using the same application and someone else built it, he's not in a position to do that. For now, he's excluding the "bad" nodes with Slurm -x option.

I just want to know if this is in fact a bug in 1.10.3, or if there's something we can do to fix this error.

Thanks,

--
Prentice

Reply via email to