Re: [gridengine users] SGE-6.2u5: Sudden death of all cluster jobs.

Reuti Mon, 14 Mar 2011 06:38:14 -0700

Am 14.03.2011 um 12:45 schrieb Erik Soyez:

> Good day,
> 
> one of our customers suffered an incident that I've never seen before.
> 
> On Friday night all jobs running on the cluster died within a few hours:
> ------------------------------------------------------------------------
> Application:  CFX
> Integration:  Tight
> ------------------------------------------------------------------------
> 
> Afterwards no new jobs could be submitted, only after all execds(!) had
> been restarted.  Unfortunately I could not have a look onto the cluster
> myself when it had happened, so have to rely on log files etc. which
> do not seem to fit together - sorry for this lengthy email, but I need
> some hint to understand what's going on:
>


What is the setting of rsh_command/-daemon in SGE's configuration?

-- Reuti


> ------------------------------------------------------------------------
> In "qmaster/messages" (job 18690 did not even exist at that time):
> ------------------------------------------------------------------------
> 03/14/2011 08:28:54|worker|xxxxxxxxx1|E|unable to find job 18690 from the 
> scheduler order package
> 03/14/2011 08:28:54|schedu|xxxxxxxxx1|E|unable to find job 18690 from the 
> scheduler order package
> 
>> qstat -j 18690
> Following jobs do not exist:
> 18690
> 
>> qacct -j 18690
> error: job id 18690 not found
> 
>> cat /opt/sge/6.2u5/default/spool/qmaster/jobseqnum
> 18718
> ------------------------------------------------------------------------
> 
> 
> E-Mail:
> ------------------------------------------------------------------------
> GE 6.2u5: Job 17765 failed
> ------------------------------------------------------------------------
>               :
>               :
> failed assumedly before job:can not find an unused add_grp_id
> Shepherd pe_hostfile:
> xxxxxx208.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
> xxxxxx207.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
> xxxxxx205.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
> xxxxxx204.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
> ------------------------------------------------------------------------
> 
> But:
> ------------------------------------------------------------------------
> gid_range                    20000-20100
> ------------------------------------------------------------------------
> This should be more than enough, shouldn't it?
> 
> 
> 
> The application log files show some totally different error messages:
> 
> One outfile:
> ------------------------------------------------------------------------
> +--------------------------------------------------------------------+
> |                              Warning!                              |
> |                                                                    |
> | /opt/sge/6.2u5/mpi/rsh connection to host                          |
> | xxxxxx206.xxxxx.xxxxx.xxx produces the following output after the  |
> | output of the command:                                             |
> |                                                                    |
> |   TRUE                                                             |
> |                                                                    |
> | This may cause problems spawning parallel slaves.                  |
> +--------------------------------------------------------------------+
> ------------------------------------------------------------------------
> 
> Another outfile:
> ------------------------------------------------------------------------
> +--------------------------------------------------------------------+
> |                An error has occurred in cfx5solve:                 |
> |                                                                    |
> | Remote connection to xxxxxx216.xxxxx.xxxxx.xxx was terminated due  |
> | to a timeout.  It was interrupted by signal TERM (15)  It gave the |
> | following output:                                                  |
> |                                                                    |
> |    /opt/sge/6.2u5/bin/lx26-amd64/qrsh -inherit -nostdin xxxxxx216- |
> | .xxxxx.xxxxx.xxx echo TRUE                                         |
> |    error: got no connection within 60 seconds. "Timeout occured w- |
> | hile waiting for connection"                                       |
>                               :
>                               :
>                               :
> ------------------------------------------------------------------------
> 
> Any ideas if these are different minor problems or one major problem?
> 
> Many thanks!
> 
> Erik Soyez.
> 
> 
> --
> 
> 
> 
> -- 
> Vorstand/Board of Management:
> Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
> Vorsitzender des Aufsichtsrats/
> Chairman of the Supervisory Board:
> Michel Lepert
> Sitz/Registered Office: Tuebingen
> Registergericht/Registration Court: Stuttgart
> Registernummer/Commercial Register No.: HRB 382196 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE-6.2u5: Sudden death of *all* cluster jobs.

Reply via email to

Re: [gridengine users] SGE-6.2u5: Sudden death of all cluster jobs.