On Wed, Sep 19, 2012 at 6:29 PM, Reuti <[email protected]> wrote:

> Hi,
>
> Am 19.09.2012 um 11:26 schrieb Mazouzi:
>
> > We have an MPI (OpenMPI compiled with sge flag) application executing on
> some nodes.
> >
> > When a node crash (a slave) we receive  an email like this :
> >
> > Job 156880 caused action: PE Job 156880 will be deleted
> >  User        = xxxxx
> >  Queue       = q.all@node05
> >  Start Time  = <unknown>
> >  End Time    = <unknown>
> > failed before writing exit_status:shepherd exited with exit status 19:
> before writing exit_status
> >
> > But qstat show that the job is running and node05 is holding a job
> instance (ssh in node05 show 0 process)
> >
> > Is that a normal behaviour ? I expect all process will be killed.
> >
> > Here is the pe configuration:
> >
> > qconf -sp   mpi
> > pe_name           mpi
> > slots              999
> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
>
> For the latest Open MPI you don't need these two scripts any longer. They
> prepare a hostfile for the old MPICH(1).
>
> You use the "builtin" startup in SGE, or `ssh` to go to the slave nodes?
>
> Hi Reuti,
We are using builtin startup.

> -- Reuti
>
>
> > stop_proc_args     /opt/sge/mpi/stopmpi.sh
> > allocation_rule    $fill_up
> > control_slaves     TRUE
> > job_is_first_task  FALSE
> > urgency_slots      min
> > accounting_summary FALSE
> >
> > Thanks.
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to