As far as I know, OMPI combines the fault tolerant features in FT-MPI, LA-MPI 
and LAM/MPI, is this statement still correct now? Or as you say, OMPI supports 
checkpoint/restart(like in LAM/MPI) only? I don't know the details of 
FT-MPI or LA-MPI, aren't they useful or necesarry?In fact, what I really want 
to know is, suppose I run a job on N processors with OMPI, and one (or some) of 
these processors crashes, then what would be done by the 
fault-tolerant mechanism of OMPI? Meanwhile what should the sys-admin 
do(like restart the crashed node) ?In my understanding, after the crash, the 
sys-admin should restart the crashed node(if it can be restarted), and then do 
the rollback by some sort of command, while the OMPI would help hang up all the 
computing process, waiting for rollback command, is this correct?thanks 
again. --------- 原始邮件信息 ---------发件人: "Open MPI Users" 
<us...@open-mpi.org>收件人: "Open MPI Users" <us...@open-mpi.org>主题: 
Re: [OMPI users]  2012/06/18 14:35:07 自动保存草稿日期: 2012/06/20 01:26:08, 
WednesdayThat's a little bit strong - OMPI still supports checkpoint/restart as 
a fault tolerance mechanism. There really isn't anything the sys admin has to 
do, though - what is required is that users periodically order their programs 
to checkpoint so they can be restarted after a failure.Checkpointing is 
typically done either by the app itself (say, when it reaches some point it 
feels is a good one to save), or using a script that just orders a checkpoint 
every so many seconds.What we have said is that we don't believe the FT "run 
thru failure" position pushed by UTK is particularly required at this time. 
Partly a question of impact vs benefit, mostly due to competing approaches 
offering equivalent fault recovery capability with less impact. But that's a 
separate discussion.On Jun 19, 2012, at 11:16 AM, George Bosilca wrote:It has 
been clearly stated that the official position pushed forward by a majority of 
the Open MPI developer community is that fault tolerance is not needed so we 
(read this as the official version of Open MPI) do not support it.However, a 
group of researchers have been working toward a version of Open MPI that 
supports the last fault tolerance proposal submitted for consideration to the 
MPI Forum. You can access it 
at https://bitbucket.org/jjhursey/ompi-ulfm-rts.  george. On Jun 
19, 2012, at 09:58 , 陈松 wrote:Hi all,Can anyone explain me the fault tolerant 
features in OpenMPI? I've read the FAQs and some papers about this topic listed 
in open-mpi.org, but still can't figure out when one node of my supercomputer 
system fails down during computing, what would happen with the fault tolerant 
mechanism in OpenMPI, and what should we system administrator do after the 
failure (or before). Can anyone help me? My boss want me to deploy OpenMPI 
in our system cuz he want the fault tolerant feature.Thanks very 
much.---------------CHEN SongR&D DepartmentNational Supercomputer Center in 
TianjinBinhai New Area, Tianjin, 
China_______________________________________________users mailing 
listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________users
 mailing 
listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to