Re: [gridengine users] SGE-6.2u5: Sudden death of all cluster jobs.

Erik Soyez Tue, 15 Mar 2011 01:53:50 -0700

Thanks Reuti, here you go....
------------------------------------------------------------------------
gid_range                    20000-20100
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin
------------------------------------------------------------------------


....furthermore:
------------------------------------------------------------------------
export CFX5RSH='/opt/sge/6.2u5/mpi/rsh'
------------------------------------------------------------------------

Erik Soyez.


On Mon, 14 Mar 2011, Reuti wrote:

What is the setting of rsh_command/-daemon in SGE's configuration?

-- Reuti

Am 14.03.2011 um 12:45 schrieb Erik Soyez:

Good day,

one of our customers suffered an incident that I've never seen before.

On Friday night all jobs running on the cluster died within a few hours:
------------------------------------------------------------------------
Application:    CFX
Integration:    Tight
------------------------------------------------------------------------

Afterwards no new jobs could be submitted, only after all execds(!) had
been restarted.  Unfortunately I could not have a look onto the cluster
myself when it had happened, so have to rely on log files etc. which
do not seem to fit together - sorry for this lengthy email, but I need
some hint to understand what's going on:

------------------------------------------------------------------------
In "qmaster/messages" (job 18690 did not even exist at that time):
------------------------------------------------------------------------
03/14/2011 08:28:54|worker|xxxxxxxxx1|E|unable to find job 18690 from the 
scheduler order package
03/14/2011 08:28:54|schedu|xxxxxxxxx1|E|unable to find job 18690 from the 
scheduler order package

qstat -j 18690

Following jobs do not exist:
18690

qacct -j 18690

error: job id 18690 not found

cat /opt/sge/6.2u5/default/spool/qmaster/jobseqnum

18718
------------------------------------------------------------------------


E-Mail:
------------------------------------------------------------------------
GE 6.2u5: Job 17765 failed
------------------------------------------------------------------------
                :
                :
failed assumedly before job:can not find an unused add_grp_id
Shepherd pe_hostfile:
xxxxxx208.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx207.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx205.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx204.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
------------------------------------------------------------------------

But:
------------------------------------------------------------------------
gid_range                    20000-20100
------------------------------------------------------------------------
This should be more than enough, shouldn't it?



The application log files show some totally different error messages:

One outfile:
------------------------------------------------------------------------
+--------------------------------------------------------------------+
|                              Warning!                              |
|                                                                    |
| /opt/sge/6.2u5/mpi/rsh connection to host                          |
| xxxxxx206.xxxxx.xxxxx.xxx produces the following output after the  |
| output of the command:                                             |
|                                                                    |
|   TRUE                                                             |
|                                                                    |
| This may cause problems spawning parallel slaves.                  |
+--------------------------------------------------------------------+
------------------------------------------------------------------------

Another outfile:
------------------------------------------------------------------------
+--------------------------------------------------------------------+
|                An error has occurred in cfx5solve:                 |
|                                                                    |
| Remote connection to xxxxxx216.xxxxx.xxxxx.xxx was terminated due  |
| to a timeout.  It was interrupted by signal TERM (15)  It gave the |
| following output:                                                  |
|                                                                    |
|    /opt/sge/6.2u5/bin/lx26-amd64/qrsh -inherit -nostdin xxxxxx216- |
| .xxxxx.xxxxx.xxx echo TRUE                                         |
|    error: got no connection within 60 seconds. "Timeout occured w- |
| hile waiting for connection"                                       |
                                :
                                :
                                :
------------------------------------------------------------------------

Any ideas if these are different minor problems or one major problem?



--

--
Vorstand/Board of Management:

Dr. Bernd Finkbeiner, Dr. Roland Niemeier,Dr. Arno Steitz, Dr. Ingrid Zech

Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Michel Lepert
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart

Registernummer/Commercial Register No.: HRB 382196


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE-6.2u5: Sudden death of *all* cluster jobs.

Reply via email to

Re: [gridengine users] SGE-6.2u5: Sudden death of all cluster jobs.