Mazouzi <[email protected]> writes:

> Hi,
>
> We have an MPI (OpenMPI compiled with sge flag) application executing on
> some nodes.
>
> When a node crash (a slave) we receive  an email like this :
>
> Job 156880 caused action: PE Job 156880 will be deleted
>
>  User        = xxxxx
>  Queue       = q.all@node05
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed before writing exit_status:shepherd exited with exit status 19:
> before writing exit_status
>
>
> But qstat show that the job is running and node05 is holding a job instance
> (ssh in node05 show 0 process)
>
> Is that a normal behaviour ? I expect all process will be killed.

Well, yes.  These may be relevant:
<https://arc.liv.ac.uk/trac/SGE/ticket/1283>,
<https://arc.liv.ac.uk/trac/SGE/ticket/1346>.  If the job won't die
otherwise, try qdel -f.

> Here is the pe configuration:
>
> qconf -sp   mpi
> pe_name           mpi
> slots              999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
> stop_proc_args     /opt/sge/mpi/stopmpi.sh

I don't know what those two scripts are actually doing in this case, but
you should have start and stop procs "none" with openmpi (assuming it
was built with SGE integration, and if it wasn't, it should be).

> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to