On Wed, Sep 19, 2012 at 6:29 PM, Reuti <[email protected]> wrote:
> Hi, > > Am 19.09.2012 um 11:26 schrieb Mazouzi: > > > We have an MPI (OpenMPI compiled with sge flag) application executing on > some nodes. > > > > When a node crash (a slave) we receive an email like this : > > > > Job 156880 caused action: PE Job 156880 will be deleted > > User = xxxxx > > Queue = q.all@node05 > > Start Time = <unknown> > > End Time = <unknown> > > failed before writing exit_status:shepherd exited with exit status 19: > before writing exit_status > > > > But qstat show that the job is running and node05 is holding a job > instance (ssh in node05 show 0 process) > > > > Is that a normal behaviour ? I expect all process will be killed. > > > > Here is the pe configuration: > > > > qconf -sp mpi > > pe_name mpi > > slots 999 > > user_lists NONE > > xuser_lists NONE > > start_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile > > For the latest Open MPI you don't need these two scripts any longer. They > prepare a hostfile for the old MPICH(1). > > You use the "builtin" startup in SGE, or `ssh` to go to the slave nodes? > > Hi Reuti, We are using builtin startup. > -- Reuti > > > > stop_proc_args /opt/sge/mpi/stopmpi.sh > > allocation_rule $fill_up > > control_slaves TRUE > > job_is_first_task FALSE > > urgency_slots min > > accounting_summary FALSE > > > > Thanks. > > > > > > > > > > > > > > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
