Re: [OMPI users] Disconnections

2009-07-01 Thread Ralph Castain
On Jul 1, 2009, at 3:10 PM, Daniel Miles wrote: Hi, everybody. I’m having trouble where one of my client nodes crashes while I have an MPI job on it. When this happens, the mpirun process on the head node never returns. This shouldn't happen - we should cleanly abort. What version are

[OMPI users] Disconnections

2009-07-01 Thread Daniel Miles
Hi, everybody. I¹m having trouble where one of my client nodes crashes while I have an MPI job on it. When this happens, the mpirun process on the head node never returns. I can kill it with a SIGINT (ctrl-c) and it still cleans up its child processes on the remaining healthy client nodes but I

[OMPI users] quadrics support?

2009-07-01 Thread Michael Di Domenico
Did the quadrics support for OpenMPI ever materialize? I can't find any documentation on the web about it and the few mailing list messages I came across showed some hints that it might be in progress but that was way back in 2007 Thanks

Re: [OMPI users] btl_openib_connect_oob.c:459:qp_create_one:errorcreating qp

2009-07-01 Thread Jeff Squyres
On Jul 1, 2009, at 8:01 AM, Jeff Squyres (jsquyres) wrote: Random thought: would it be easy for the output of cat /dev/knem to indicate whether IOAT hardware is present? Well *that* was replying to the wrong message. :-) A real reply is below... > I have problems running large jobs on a

Re: [OMPI users] btl_openib_connect_oob.c:459:qp_create_one: errorcreating qp

2009-07-01 Thread Jeff Squyres
Random thought: would it be easy for the output of cat /dev/knem to indicate whether IOAT hardware is present? On Jul 1, 2009, at 5:23 AM, Jose Gracia wrote: Dear all, I have problems running large jobs on a PC cluster with OpenMPI V1.3. Typically the error appears only for processor count

[OMPI users] btl_openib_connect_oob.c:459:qp_create_one: error creating qp

2009-07-01 Thread Jose Gracia
Dear all, I have problems running large jobs on a PC cluster with OpenMPI V1.3. Typically the error appears only for processor count >= 2048 (actually cores), sometimes also bellow. The nodes (Intel Nehalem, 2 procs, 4 cores each) run (scientific?) linux. $> uname -a Linux cl3fr1