Am 14.03.2011 um 12:45 schrieb Erik Soyez: > Good day, > > one of our customers suffered an incident that I've never seen before. > > On Friday night all jobs running on the cluster died within a few hours: > ------------------------------------------------------------------------ > Application: CFX > Integration: Tight > ------------------------------------------------------------------------ > > Afterwards no new jobs could be submitted, only after all execds(!) had > been restarted. Unfortunately I could not have a look onto the cluster > myself when it had happened, so have to rely on log files etc. which > do not seem to fit together - sorry for this lengthy email, but I need > some hint to understand what's going on: >
What is the setting of rsh_command/-daemon in SGE's configuration? -- Reuti > ------------------------------------------------------------------------ > In "qmaster/messages" (job 18690 did not even exist at that time): > ------------------------------------------------------------------------ > 03/14/2011 08:28:54|worker|xxxxxxxxx1|E|unable to find job 18690 from the > scheduler order package > 03/14/2011 08:28:54|schedu|xxxxxxxxx1|E|unable to find job 18690 from the > scheduler order package > >> qstat -j 18690 > Following jobs do not exist: > 18690 > >> qacct -j 18690 > error: job id 18690 not found > >> cat /opt/sge/6.2u5/default/spool/qmaster/jobseqnum > 18718 > ------------------------------------------------------------------------ > > > E-Mail: > ------------------------------------------------------------------------ > GE 6.2u5: Job 17765 failed > ------------------------------------------------------------------------ > : > : > failed assumedly before job:can not find an unused add_grp_id > Shepherd pe_hostfile: > xxxxxx208.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED > xxxxxx207.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED > xxxxxx205.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED > xxxxxx204.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED > ------------------------------------------------------------------------ > > But: > ------------------------------------------------------------------------ > gid_range 20000-20100 > ------------------------------------------------------------------------ > This should be more than enough, shouldn't it? > > > > The application log files show some totally different error messages: > > One outfile: > ------------------------------------------------------------------------ > +--------------------------------------------------------------------+ > | Warning! | > | | > | /opt/sge/6.2u5/mpi/rsh connection to host | > | xxxxxx206.xxxxx.xxxxx.xxx produces the following output after the | > | output of the command: | > | | > | TRUE | > | | > | This may cause problems spawning parallel slaves. | > +--------------------------------------------------------------------+ > ------------------------------------------------------------------------ > > Another outfile: > ------------------------------------------------------------------------ > +--------------------------------------------------------------------+ > | An error has occurred in cfx5solve: | > | | > | Remote connection to xxxxxx216.xxxxx.xxxxx.xxx was terminated due | > | to a timeout. It was interrupted by signal TERM (15) It gave the | > | following output: | > | | > | /opt/sge/6.2u5/bin/lx26-amd64/qrsh -inherit -nostdin xxxxxx216- | > | .xxxxx.xxxxx.xxx echo TRUE | > | error: got no connection within 60 seconds. "Timeout occured w- | > | hile waiting for connection" | > : > : > : > ------------------------------------------------------------------------ > > Any ideas if these are different minor problems or one major problem? > > Many thanks! > > Erik Soyez. > > > -- > > > > -- > Vorstand/Board of Management: > Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech > Vorsitzender des Aufsichtsrats/ > Chairman of the Supervisory Board: > Michel Lepert > Sitz/Registered Office: Tuebingen > Registergericht/Registration Court: Stuttgart > Registernummer/Commercial Register No.: HRB 382196 > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
