On Thu, Oct 18, 2012 at 11:54 AM, Reuti <[email protected]> wrote:
> Am 19.09.2012 um 19:58 schrieb Mazouzi: > > > On Wed, Sep 19, 2012 at 6:29 PM, Reuti <[email protected]> > wrote: > > Hi, > > > > Am 19.09.2012 um 11:26 schrieb Mazouzi: > > > > > We have an MPI (OpenMPI compiled with sge flag) application executing > on some nodes. > > > > > > When a node crash (a slave) we receive an email like this : > > > > > > Job 156880 caused action: PE Job 156880 will be deleted > > > User = xxxxx > > > Queue = q.all@node05 > > > Start Time = <unknown> > > > End Time = <unknown> > > > failed before writing exit_status:shepherd exited with exit status 19: > before writing exit_status > > > > > > But qstat show that the job is running and node05 is holding a job > instance (ssh in node05 show 0 process) > > > > > > Is that a normal behaviour ? I expect all process will be killed. > > > > > > Here is the pe configuration: > > > > > > qconf -sp mpi > > > pe_name mpi > > > slots 999 > > > user_lists NONE > > > xuser_lists NONE > > > start_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile > > > > For the latest Open MPI you don't need these two scripts any longer. > They prepare a hostfile for the old MPICH(1). > > > > You use the "builtin" startup in SGE, or `ssh` to go to the slave nodes? > > > > Hi Reuti, > > We are using builtin startup. > > Was this solved, or is it still persistent? > Hi Reuti, We kept "start_proc_args" because some users still use mpirun -machinefile $TMPDIR/machines The start process just produce the machine file. So we are waiting for the next crash to see. Thx. > > -- Reuti > > > > -- Reuti > > > > > > > stop_proc_args /opt/sge/mpi/stopmpi.sh > > > allocation_rule $fill_up > > > control_slaves TRUE > > > job_is_first_task FALSE > > > urgency_slots min > > > accounting_summary FALSE > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > users mailing list > > > [email protected] > > > https://gridengine.org/mailman/listinfo/users > > > > > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
