Hi, Am 19.09.2012 um 11:26 schrieb Mazouzi:
> We have an MPI (OpenMPI compiled with sge flag) application executing on some > nodes. > > When a node crash (a slave) we receive an email like this : > > Job 156880 caused action: PE Job 156880 will be deleted > User = xxxxx > Queue = q.all@node05 > Start Time = <unknown> > End Time = <unknown> > failed before writing exit_status:shepherd exited with exit status 19: before > writing exit_status > > But qstat show that the job is running and node05 is holding a job instance > (ssh in node05 show 0 process) > > Is that a normal behaviour ? I expect all process will be killed. > > Here is the pe configuration: > > qconf -sp mpi > pe_name mpi > slots 999 > user_lists NONE > xuser_lists NONE > start_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile For the latest Open MPI you don't need these two scripts any longer. They prepare a hostfile for the old MPICH(1). You use the "builtin" startup in SGE, or `ssh` to go to the slave nodes? -- Reuti > stop_proc_args /opt/sge/mpi/stopmpi.sh > allocation_rule $fill_up > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > Thanks. > > > > > > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
