Mazouzi <[email protected]> writes: > Hi, > > We have an MPI (OpenMPI compiled with sge flag) application executing on > some nodes. > > When a node crash (a slave) we receive an email like this : > > Job 156880 caused action: PE Job 156880 will be deleted > > User = xxxxx > Queue = q.all@node05 > Start Time = <unknown> > End Time = <unknown> > failed before writing exit_status:shepherd exited with exit status 19: > before writing exit_status > > > But qstat show that the job is running and node05 is holding a job instance > (ssh in node05 show 0 process) > > Is that a normal behaviour ? I expect all process will be killed.
Well, yes. These may be relevant: <https://arc.liv.ac.uk/trac/SGE/ticket/1283>, <https://arc.liv.ac.uk/trac/SGE/ticket/1346>. If the job won't die otherwise, try qdel -f. > Here is the pe configuration: > > qconf -sp mpi > pe_name mpi > slots 999 > user_lists NONE > xuser_lists NONE > start_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile > stop_proc_args /opt/sge/mpi/stopmpi.sh I don't know what those two scripts are actually doing in this case, but you should have start and stop procs "none" with openmpi (assuming it was built with SGE integration, and if it wasn't, it should be). > allocation_rule $fill_up > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
