Hi,

Am 19.09.2012 um 11:26 schrieb Mazouzi:

> We have an MPI (OpenMPI compiled with sge flag) application executing on some 
> nodes. 
> 
> When a node crash (a slave) we receive  an email like this :
> 
> Job 156880 caused action: PE Job 156880 will be deleted
>  User        = xxxxx
>  Queue       = q.all@node05
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed before writing exit_status:shepherd exited with exit status 19: before 
> writing exit_status
> 
> But qstat show that the job is running and node05 is holding a job instance 
> (ssh in node05 show 0 process)
> 
> Is that a normal behaviour ? I expect all process will be killed.
> 
> Here is the pe configuration: 
> 
> qconf -sp   mpi
> pe_name           mpi
> slots              999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile

For the latest Open MPI you don't need these two scripts any longer. They 
prepare a hostfile for the old MPICH(1).

You use the "builtin" startup in SGE, or `ssh` to go to the slave nodes?

-- Reuti


> stop_proc_args     /opt/sge/mpi/stopmpi.sh
> allocation_rule    $fill_up
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
> 
> Thanks.
> 
>  
> 
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to