Apologies, I have not taken the time to read your comprehensive diagnostics!
As Gus says, this sounds like a memory problem. My suspicion would be the kernel Out Of Memory (OOM) killer. Log into those nodes (or ask your systems manager to do this). Look closely at /var/log/messages where there will be notifications when the OOM Killer kicks in and - well - kills large memory processes! Grep for "killed process" in /var/log/messages http://linux-mm.org/OOM_Killer