Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Gus Correa
On 09/07/2012 08:02 AM, Jeff Squyres wrote: On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it always the same node that crashes? And so on. Another thought on hardware errors... I actually have seen bad RAM

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Gus Correa
On 09/03/2012 04:39 PM, Andrea Negri wrote: max locked memory (kbytes, -l) 32 max memory size(kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200

Re: [OMPI users] problem with rankfile

2012-09-07 Thread Ralph Castain
On Sep 7, 2012, at 5:41 AM, Siegmar Gross wrote: > Hi, > > are the following outputs helpful to find the error with > a rankfile on Solaris? If you can't bind on the new Solaris machine, then the rankfile won't do you any good. It looks like we are

Re: [OMPI users] problem with rankfile

2012-09-07 Thread Siegmar Gross
Hi, are the following outputs helpful to find the error with a rankfile on Solaris? I wrapped long lines so that they are easier to read. Have you had time to look at the segmentation fault with a rankfile which I reported in my last email (see below)? "tyr" is a two processor single core

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Jeff Squyres
On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: > Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is > it always the same node that crashes? And so on. Another thought on hardware errors... I actually have seen bad RAM cause spontaneous reboots with no Linux

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Jeff Squyres
On Sep 5, 2012, at 3:59 AM, Andrea Negri wrote: > I have tried with these flags (I use gcc 4.7 and open mpi 1.6), but > the program doesn't crash, a node go down and the rest of them remain > to wait a signal (there is an ALLREDUCE in the code). > > Anyway, yesterday some processes died (without

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Andrea Negri
George, I hace done some modifications to the code, however this is the first part my zmp_list: !ZEUSMP2 CONFIGURATION FILE LGEOM= 2, LDIMEN = 2 / LRAD = 0, XHYDRO = .TRUE., XFORCE = .TRUE., XMHD = .false.,

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-07 Thread Randolph Pullen
One system is actually an i5-2400 - maybe its throttling back on 2 cores to save power? The other(I7) shows consistent CPU MHz on all cores From: Yevgeny Kliteynik To: Randolph Pullen ; OpenMPI Users

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-07 Thread Randolph Pullen
Yevgeny, The ibstat results: CA 'mthca0'     CA type: MT25208 (MT23108 compat mode)     Number of ports: 2     Firmware version: 4.7.600     Hardware version: a0     Node GUID: 0x0005ad0c21e0     System image GUID: 0x0005ad000100d050     Port 1: