Re: [gridengine users] SGE-6.2u5: Sudden death of all cluster jobs.

Reuti Tue, 15 Mar 2011 03:13:35 -0700

Am 15.03.2011 um 09:53 schrieb Erik Soyez:

> Thanks Reuti, here you go....
> ------------------------------------------------------------------------
> gid_range                    20000-20100
> qlogin_command               builtin
> qlogin_daemon                builtin
> rlogin_command               builtin
> rlogin_daemon                builtin
> rsh_command                  builtin
> rsh_daemon                   builtin
> ------------------------------------------------------------------------


Ok, this looke fine.


> ....furthermore:
> ------------------------------------------------------------------------
> export CFX5RSH='/opt/sge/6.2u5/mpi/rsh'
> ------------------------------------------------------------------------

What parallel library is used by this application in the end - maybe there was 
any firewall switched on or so, blocking certain ports (were only the eceds 
restarted or the complete node)? No filesystem went full, which would prohibit 
any creation of temporary job information for SGE (nodes or qmaster)?

The "gid_range                    20000-20100" should be fine, as it's per host.

-- Reuti


> Erik Soyez.
> 
> 
> On Mon, 14 Mar 2011, Reuti wrote:
> 
>> What is the setting of rsh_command/-daemon in SGE's configuration?
>> 
>> -- Reuti
> 
> 
>> Am 14.03.2011 um 12:45 schrieb Erik Soyez:
>> 
>>> Good day,
>>> 
>>> one of our customers suffered an incident that I've never seen before.
>>> 
>>> On Friday night all jobs running on the cluster died within a few hours:
>>> ------------------------------------------------------------------------
>>> Application:        CFX
>>> Integration:        Tight
>>> ------------------------------------------------------------------------
>>> 
>>> Afterwards no new jobs could be submitted, only after all execds(!) had
>>> been restarted.  Unfortunately I could not have a look onto the cluster
>>> myself when it had happened, so have to rely on log files etc. which
>>> do not seem to fit together - sorry for this lengthy email, but I need
>>> some hint to understand what's going on:
>>> 
>>> ------------------------------------------------------------------------
>>> In "qmaster/messages" (job 18690 did not even exist at that time):
>>> ------------------------------------------------------------------------
>>> 03/14/2011 08:28:54|worker|xxxxxxxxx1|E|unable to find job 18690 from the 
>>> scheduler order package
>>> 03/14/2011 08:28:54|schedu|xxxxxxxxx1|E|unable to find job 18690 from the 
>>> scheduler order package
>>> 
>>>> qstat -j 18690
>>> Following jobs do not exist:
>>> 18690
>>> 
>>>> qacct -j 18690
>>> error: job id 18690 not found
>>> 
>>>> cat /opt/sge/6.2u5/default/spool/qmaster/jobseqnum
>>> 18718
>>> ------------------------------------------------------------------------
>>> 
>>> 
>>> E-Mail:
>>> ------------------------------------------------------------------------
>>> GE 6.2u5: Job 17765 failed
>>> ------------------------------------------------------------------------
>>>             :
>>>             :
>>> failed assumedly before job:can not find an unused add_grp_id
>>> Shepherd pe_hostfile:
>>> xxxxxx208.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
>>> xxxxxx207.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
>>> xxxxxx205.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
>>> xxxxxx204.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
>>> ------------------------------------------------------------------------
>>> 
>>> But:
>>> ------------------------------------------------------------------------
>>> gid_range                    20000-20100
>>> ------------------------------------------------------------------------
>>> This should be more than enough, shouldn't it?
>>> 
>>> 
>>> 
>>> The application log files show some totally different error messages:
>>> 
>>> One outfile:
>>> ------------------------------------------------------------------------
>>> +--------------------------------------------------------------------+
>>> |                              Warning!                              |
>>> |                                                                    |
>>> | /opt/sge/6.2u5/mpi/rsh connection to host                          |
>>> | xxxxxx206.xxxxx.xxxxx.xxx produces the following output after the  |
>>> | output of the command:                                             |
>>> |                                                                    |
>>> |   TRUE                                                             |
>>> |                                                                    |
>>> | This may cause problems spawning parallel slaves.                  |
>>> +--------------------------------------------------------------------+
>>> ------------------------------------------------------------------------
>>> 
>>> Another outfile:
>>> ------------------------------------------------------------------------
>>> +--------------------------------------------------------------------+
>>> |                An error has occurred in cfx5solve:                 |
>>> |                                                                    |
>>> | Remote connection to xxxxxx216.xxxxx.xxxxx.xxx was terminated due  |
>>> | to a timeout.  It was interrupted by signal TERM (15)  It gave the |
>>> | following output:                                                  |
>>> |                                                                    |
>>> |    /opt/sge/6.2u5/bin/lx26-amd64/qrsh -inherit -nostdin xxxxxx216- |
>>> | .xxxxx.xxxxx.xxx echo TRUE                                         |
>>> |    error: got no connection within 60 seconds. "Timeout occured w- |
>>> | hile waiting for connection"                                       |
>>>                             :
>>>                             :
>>>                             :
>>> ------------------------------------------------------------------------
>>> 
>>> Any ideas if these are different minor problems or one major problem?
> 
> 
> --
> 
> -- 
> Vorstand/Board of Management:
> Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech
> Vorsitzender des Aufsichtsrats/
> Chairman of the Supervisory Board:
> Michel Lepert
> Sitz/Registered Office: Tuebingen
> Registergericht/Registration Court: Stuttgart
> Registernummer/Commercial Register No.: HRB 382196 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE-6.2u5: Sudden death of *all* cluster jobs.

Reply via email to

Re: [gridengine users] SGE-6.2u5: Sudden death of all cluster jobs.