[gridengine users] SGE-6.2u5: Sudden death of all cluster jobs.

Erik Soyez Mon, 14 Mar 2011 04:46:35 -0700

Good day,

one of our customers suffered an incident that I've never seen before.


On Friday night all jobs running on the cluster died within a few hours:
------------------------------------------------------------------------
Application:    CFX
Integration:    Tight
------------------------------------------------------------------------

Afterwards no new jobs could be submitted, only after all execds(!) had
been restarted.  Unfortunately I could not have a look onto the cluster
myself when it had happened, so have to rely on log files etc. which
do not seem to fit together - sorry for this lengthy email, but I need
some hint to understand what's going on:


------------------------------------------------------------------------
In "qmaster/messages" (job 18690 did not even exist at that time):
------------------------------------------------------------------------
03/14/2011 08:28:54|worker|xxxxxxxxx1|E|unable to find job 18690 from the 
scheduler order package
03/14/2011 08:28:54|schedu|xxxxxxxxx1|E|unable to find job 18690 from the 
scheduler order package

qstat -j 18690

Following jobs do not exist:
18690

qacct -j 18690

error: job id 18690 not found

cat /opt/sge/6.2u5/default/spool/qmaster/jobseqnum

18718
------------------------------------------------------------------------


E-Mail:
------------------------------------------------------------------------
GE 6.2u5: Job 17765 failed
------------------------------------------------------------------------
                :
                :
failed assumedly before job:can not find an unused add_grp_id
Shepherd pe_hostfile:
xxxxxx208.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx207.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx205.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx204.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
------------------------------------------------------------------------

But:
------------------------------------------------------------------------
gid_range                    20000-20100
------------------------------------------------------------------------
This should be more than enough, shouldn't it?



The application log files show some totally different error messages:

One outfile:
------------------------------------------------------------------------
 +--------------------------------------------------------------------+
 |                              Warning!                              |
 |                                                                    |
 | /opt/sge/6.2u5/mpi/rsh connection to host                          |
 | xxxxxx206.xxxxx.xxxxx.xxx produces the following output after the  |
 | output of the command:                                             |
 |                                                                    |
 |   TRUE                                                             |
 |                                                                    |
 | This may cause problems spawning parallel slaves.                  |
 +--------------------------------------------------------------------+
------------------------------------------------------------------------

Another outfile:
------------------------------------------------------------------------
 +--------------------------------------------------------------------+
 |                An error has occurred in cfx5solve:                 |
 |                                                                    |
 | Remote connection to xxxxxx216.xxxxx.xxxxx.xxx was terminated due  |
 | to a timeout.  It was interrupted by signal TERM (15)  It gave the |
 | following output:                                                  |
 |                                                                    |
 |    /opt/sge/6.2u5/bin/lx26-amd64/qrsh -inherit -nostdin xxxxxx216- |
 | .xxxxx.xxxxx.xxx echo TRUE                                         |
 |    error: got no connection within 60 seconds. "Timeout occured w- |
 | hile waiting for connection"                                       |
                                :
                                :
                                :
------------------------------------------------------------------------

Any ideas if these are different minor problems or one major problem?

Many thanks!

Erik Soyez.


--



--
Vorstand/Board of Management:

Dr. Bernd Finkbeiner, Dr. Roland Niemeier,Dr. Arno Steitz, Dr. Ingrid Zech

Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Michel Lepert
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart

Registernummer/Commercial Register No.: HRB 382196


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] SGE-6.2u5: Sudden death of *all* cluster jobs.

Reply via email to

[gridengine users] SGE-6.2u5: Sudden death of all cluster jobs.