Good day,
one of our customers suffered an incident that I've never seen before.
On Friday night all jobs running on the cluster died within a few hours:
------------------------------------------------------------------------
Application: CFX
Integration: Tight
------------------------------------------------------------------------
Afterwards no new jobs could be submitted, only after all execds(!) had
been restarted. Unfortunately I could not have a look onto the cluster
myself when it had happened, so have to rely on log files etc. which
do not seem to fit together - sorry for this lengthy email, but I need
some hint to understand what's going on:
------------------------------------------------------------------------
In "qmaster/messages" (job 18690 did not even exist at that time):
------------------------------------------------------------------------
03/14/2011 08:28:54|worker|xxxxxxxxx1|E|unable to find job 18690 from the
scheduler order package
03/14/2011 08:28:54|schedu|xxxxxxxxx1|E|unable to find job 18690 from the
scheduler order package
qstat -j 18690
Following jobs do not exist:
18690
qacct -j 18690
error: job id 18690 not found
cat /opt/sge/6.2u5/default/spool/qmaster/jobseqnum
18718
------------------------------------------------------------------------
E-Mail:
------------------------------------------------------------------------
GE 6.2u5: Job 17765 failed
------------------------------------------------------------------------
:
:
failed assumedly before job:can not find an unused add_grp_id
Shepherd pe_hostfile:
xxxxxx208.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx207.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx205.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx204.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
------------------------------------------------------------------------
But:
------------------------------------------------------------------------
gid_range 20000-20100
------------------------------------------------------------------------
This should be more than enough, shouldn't it?
The application log files show some totally different error messages:
One outfile:
------------------------------------------------------------------------
+--------------------------------------------------------------------+
| Warning! |
| |
| /opt/sge/6.2u5/mpi/rsh connection to host |
| xxxxxx206.xxxxx.xxxxx.xxx produces the following output after the |
| output of the command: |
| |
| TRUE |
| |
| This may cause problems spawning parallel slaves. |
+--------------------------------------------------------------------+
------------------------------------------------------------------------
Another outfile:
------------------------------------------------------------------------
+--------------------------------------------------------------------+
| An error has occurred in cfx5solve: |
| |
| Remote connection to xxxxxx216.xxxxx.xxxxx.xxx was terminated due |
| to a timeout. It was interrupted by signal TERM (15) It gave the |
| following output: |
| |
| /opt/sge/6.2u5/bin/lx26-amd64/qrsh -inherit -nostdin xxxxxx216- |
| .xxxxx.xxxxx.xxx echo TRUE |
| error: got no connection within 60 seconds. "Timeout occured w- |
| hile waiting for connection" |
:
:
:
------------------------------------------------------------------------
Any ideas if these are different minor problems or one major problem?
Many thanks!
Erik Soyez.
--
--
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Roland Niemeier,
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Michel Lepert
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users