[OMPI users] Signal code: Non-existant physical address (2)

Prentice Bisbal via users Thu, 02 Jul 2020 07:27:08 -0700

I manage a very heterogeneous cluster. I have nodes of different ageswith different processors, different amounts of RAM, etc. One user isreporting that on certain nodes, his jobs keep crashing with the errorsbelow. His application is using OpenMPI 1.10.3, which I know is anancient version of OpenMPI, but someone else in his research group builtthe code with that, so that's what he's stuck with.

I did a Google search of "Signal code: Non-existant physical address",and it appears that this may be a bug in 1.10.3 that happens on certainhardware. Can anyone else confirm this? The full error message is below:


[dawson120:29064] *** Process received signal ***
[dawson120:29062] *** Process received signal ***
[dawson120:29062] Signal: Bus error (7)
[dawson120:29062] Signal code: Non-existant physical address (2)
[dawson120:29062] Failing at address: 0x7ff3f030f180
[dawson120:29067] *** Process received signal ***
[dawson120:29067] Signal: Bus error (7)
[dawson120:29067] Signal code: Non-existant physical address (2)
[dawson120:29067] Failing at address: 0x7fb2b8a61d18
[dawson120:29077] *** Process received signal ***
[dawson120:29078] *** Process received signal ***
[dawson120:29078] Signal: Bus error (7)
[dawson120:29078] Signal code: Non-existant physical address (2)
[dawson120:29078] Failing at address: 0x7f60a13d2c98
[dawson120:29078] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x7f60b7efd7e0]

[dawson120:29078] [ 1]/usr/pppl/intel/2015-pkgs/openmpi-1.10.3/lib/openmpi/mca_allocator_bucket.so(mca_allocator_bucket_alloc_align+0x84)[0x7f60b20f6ea4]

I've asked the user to switch to a newer version of OpenMPI, but sincehis research group is all using the same application and someone elsebuilt it, he's not in a position to do that. For now, he's excluding the"bad" nodes with Slurm -x option.

I just want to know if this is in fact a bug in 1.10.3, or if there'ssomething we can do to fix this error.


Thanks,

--
Prentice

[OMPI users] Signal code: Non-existant physical address (2)

Reply via email to