On Thu, Oct 18, 2012 at 11:54 AM, Reuti <[email protected]> wrote:

> Am 19.09.2012 um 19:58 schrieb Mazouzi:
>
> > On Wed, Sep 19, 2012 at 6:29 PM, Reuti <[email protected]>
> wrote:
> > Hi,
> >
> > Am 19.09.2012 um 11:26 schrieb Mazouzi:
> >
> > > We have an MPI (OpenMPI compiled with sge flag) application executing
> on some nodes.
> > >
> > > When a node crash (a slave) we receive  an email like this :
> > >
> > > Job 156880 caused action: PE Job 156880 will be deleted
> > >  User        = xxxxx
> > >  Queue       = q.all@node05
> > >  Start Time  = <unknown>
> > >  End Time    = <unknown>
> > > failed before writing exit_status:shepherd exited with exit status 19:
> before writing exit_status
> > >
> > > But qstat show that the job is running and node05 is holding a job
> instance (ssh in node05 show 0 process)
> > >
> > > Is that a normal behaviour ? I expect all process will be killed.
> > >
> > > Here is the pe configuration:
> > >
> > > qconf -sp   mpi
> > > pe_name           mpi
> > > slots              999
> > > user_lists         NONE
> > > xuser_lists        NONE
> > > start_proc_args    /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
> >
> > For the latest Open MPI you don't need these two scripts any longer.
> They prepare a hostfile for the old MPICH(1).
> >
> > You use the "builtin" startup in SGE, or `ssh` to go to the slave nodes?
> >
> > Hi Reuti,
> > We are using builtin startup.
>
> Was this solved, or is it still persistent?
>

Hi Reuti,

We kept "start_proc_args" because some users still use mpirun -machinefile
$TMPDIR/machines
The start process just produce the machine file.
So we are waiting for the next crash to see.

Thx.

>
> -- Reuti
>
>
> > -- Reuti
> >
> >
> > > stop_proc_args     /opt/sge/mpi/stopmpi.sh
> > > allocation_rule    $fill_up
> > > control_slaves     TRUE
> > > job_is_first_task  FALSE
> > > urgency_slots      min
> > > accounting_summary FALSE
> > >
> > > Thanks.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to