On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: > Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is > it always the same node that crashes? And so on.
Another thought on hardware errors... I actually have seen bad RAM cause spontaneous reboots with no Linux warnings. Do you have any hardware diagnostics from your server vendor that you can run? A simple way to test your RAM (it's not completely comprehensive, but it does check for a surprisingly wide array of memory issues) is to do something like this (pseudocode): ----- size_t i, size, increment; increment = 1GB; size = 1GB; int *ptr; // Find the biggest amount of memory that you can malloc while (increment >= 1024) { ptr = malloc(size); if (NULL != ptr) { free(ptr); size += increment; } else { size -= increment; increment /= 2; } } printf("I can malloc %lu bytes\n", size); // Malloc that huge chunk of memory ptr = malloc(size); for (i = 0; i < size / sizeof(int); ++i, ++ptr) { *ptr = 37; if (*ptr != 37) { printf("Readback error!\n"); } } printf("All done\n"); ----- Depending on how much memory you have, that might take a little while to run (all the memory has to be paged in, etc.). You might want to add a status output to show progress, and/or write/read a page at a time for better efficiency, etc. But you get the idea. Hope that helps. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/