Problem resolved, I set ConnectTimeout N in /etc/ssh/ssd_config , mpirun exit after N seconds.
thanks a lot! From: buptzh...@hotmail.com To: us...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Thu, 2 Apr 2009 11:05:25 +0800 Subject: Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed? thank you very much! The option -mca orte_heartbeat_rate N is very usefull do detect failures like host or network failed or orted deamon killed for the running mpi job. I have another question: I use ssh for openmpi remote connect, but sometimes a host doesn't answer ssh login request, but answer ping, maybe because of os . If this "error" host in the hostfile, the "mpirun -hostfile..." command would hang even I set -mca orte_heartbeat_rate 5 , are there any other options to avoid this? thanks a lot! From: r...@lanl.gov To: us...@open-mpi.org List-Post: users@lists.open-mpi.org Date: Wed, 1 Apr 2009 07:34:46 -0600 Subject: Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed? There is indeed a heartbeat mechanism you can use - it is "off" by default. You can set it to check every N seconds with: -mca orte_heartbeat_rate N on your command line. Or if you want it to always run, add "orte_heartbeat_rate = N" to your default MCA param file. OMPI will declare the orted "dead" if two consecutive heartbeats are not seen. Let me know how it works for you - it hasn't been extensively tested, but has worked so far. Ralph On Apr 1, 2009, at 6:07 AM, Guanyinzhu wrote: I mean killed the orted deamon process during the mpi job running , but the mpi job hang and could't notice one of it's rank failed. > Date: Wed, 1 Apr 2009 19:09:34 +0800 > From: ml.jgmben...@mailsnare.net > To: us...@open-mpi.org > Subject: Re: [OMPI users] Beginner's question: how to avoid a running mpi job > hang if host or network failed or orted deamon killed? > > Is there a firewall somewhere ? > > Jerome > > Guanyinzhu wrote: > > Hi! > > I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on > > Redhat Linux x86_64. > > > > I run a test like this: just killed the orted process and the job hung > > for a long time (hang for 2~3 hours then I killed the job). > > > > I have the follow questions: > > > > when network failed or host failed or orted deamon was killed by > > accident, How long would the running mpi job notice and exit? > > > > Does OpenMPI support a heartbeat me chanism or how c! ould I fast > > detect the failture to avoid the mpi job hang? > > > > > > thanks a lot! > > > > > > ------------------------------------------------------------------------ > > ?MSN????,??????????! ????! <http://mobile.msn.com.cn/> > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > users mailing list> > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users 更多热辣资讯尽在新版MSN首页! 立刻访问! _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users 更多热辣资讯尽在新版MSN首页! 立刻访问! _________________________________________________________________ 打工,挣钱,买房子,快来MClub一起”金屋藏娇”! http://club.msn.cn/?from=10