Hi, We have an MPI (OpenMPI compiled with sge flag) application executing on some nodes.
When a node crash (a slave) we receive an email like this : Job 156880 caused action: PE Job 156880 will be deleted User = xxxxx Queue = q.all@node05 Start Time = <unknown> End Time = <unknown> failed before writing exit_status:shepherd exited with exit status 19: before writing exit_status But qstat show that the job is running and node05 is holding a job instance (ssh in node05 show 0 process) Is that a normal behaviour ? I expect all process will be killed. Here is the pe configuration: qconf -sp mpi pe_name mpi slots 999 user_lists NONE xuser_lists NONE start_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile stop_proc_args /opt/sge/mpi/stopmpi.sh allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE Thanks.
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
