Hi,

We have an MPI (OpenMPI compiled with sge flag) application executing on
some nodes.

When a node crash (a slave) we receive  an email like this :

Job 156880 caused action: PE Job 156880 will be deleted

 User        = xxxxx
 Queue       = q.all@node05
 Start Time  = <unknown>
 End Time    = <unknown>
failed before writing exit_status:shepherd exited with exit status 19:
before writing exit_status


But qstat show that the job is running and node05 is holding a job instance
(ssh in node05 show 0 process)

Is that a normal behaviour ? I expect all process will be killed.

Here is the pe configuration:

qconf -sp   mpi
pe_name           mpi
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args     /opt/sge/mpi/stopmpi.sh
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

Thanks.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to