Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

Jeff Squyres Fri, 7 Sep 2012 08:02:15 -0400

On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote:

> Also look for hardware errors.  Perhaps you have some bad RAM somewhere.  Is 
> it always the same node that crashes?  And so on.



Another thought on hardware errors... I actually have seen bad RAM cause 
spontaneous reboots with no Linux warnings.

Do you have any hardware diagnostics from your server vendor that you can run?

A simple way to test your RAM (it's not completely comprehensive, but it does 
check for a surprisingly wide array of memory issues) is to do something like 
this (pseudocode):

-----
size_t i, size, increment;
increment = 1GB;
size = 1GB;
int *ptr;

// Find the biggest amount of memory that you can malloc
while (increment >= 1024) {
    ptr = malloc(size);
    if (NULL != ptr) {
         free(ptr);
         size += increment;
    } else {
         size -= increment;
         increment /= 2;
    }
}
printf("I can malloc %lu bytes\n", size);

// Malloc that huge chunk of memory
ptr = malloc(size);
for (i = 0; i < size / sizeof(int); ++i, ++ptr) {
    *ptr = 37;
    if (*ptr != 37) {
        printf("Readback error!\n");
    }
}

printf("All done\n");
-----

Depending on how much memory you have, that might take a little while to run 
(all the memory has to be paged in, etc.).  You might want to add a status 
output to show progress, and/or write/read a page at a time for better 
efficiency, etc.  But you get the idea.

Hope that helps.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

Reply via email to