I manage a very heterogeneous cluster. I have nodes of different ages
with different processors, different amounts of RAM, etc. One user is
reporting that on certain nodes, his jobs keep crashing with the errors
below. His application is using OpenMPI 1.10.3, which I know is an
ancient version of OpenMPI, but someone else in his research group built
the code with that, so that's what he's stuck with.
I did a Google search of "Signal code: Non-existant physical address",
and it appears that this may be a bug in 1.10.3 that happens on certain
hardware. Can anyone else confirm this? The full error message is below:
[dawson120:29064] *** Process received signal ***
[dawson120:29062] *** Process received signal ***
[dawson120:29062] Signal: Bus error (7)
[dawson120:29062] Signal code: Non-existant physical address (2)
[dawson120:29062] Failing at address: 0x7ff3f030f180
[dawson120:29067] *** Process received signal ***
[dawson120:29067] Signal: Bus error (7)
[dawson120:29067] Signal code: Non-existant physical address (2)
[dawson120:29067] Failing at address: 0x7fb2b8a61d18
[dawson120:29077] *** Process received signal ***
[dawson120:29078] *** Process received signal ***
[dawson120:29078] Signal: Bus error (7)
[dawson120:29078] Signal code: Non-existant physical address (2)
[dawson120:29078] Failing at address: 0x7f60a13d2c98
[dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0]
[dawson120:29078] [ 1]
/usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4]
I've asked the user to switch to a newer version of OpenMPI, but since
his research group is all using the same application and someone else
built it, he's not in a position to do that. For now, he's excluding the
"bad" nodes with Slurm -x option.
I just want to know if this is in fact a bug in 1.10.3, or if there's
something we can do to fix this error.
Thanks,
--
Prentice