Am 15.03.2011 um 09:53 schrieb Erik Soyez: > Thanks Reuti, here you go.... > ------------------------------------------------------------------------ > gid_range 20000-20100 > qlogin_command builtin > qlogin_daemon builtin > rlogin_command builtin > rlogin_daemon builtin > rsh_command builtin > rsh_daemon builtin > ------------------------------------------------------------------------
Ok, this looke fine. > ....furthermore: > ------------------------------------------------------------------------ > export CFX5RSH='/opt/sge/6.2u5/mpi/rsh' > ------------------------------------------------------------------------ What parallel library is used by this application in the end - maybe there was any firewall switched on or so, blocking certain ports (were only the eceds restarted or the complete node)? No filesystem went full, which would prohibit any creation of temporary job information for SGE (nodes or qmaster)? The "gid_range 20000-20100" should be fine, as it's per host. -- Reuti > Erik Soyez. > > > On Mon, 14 Mar 2011, Reuti wrote: > >> What is the setting of rsh_command/-daemon in SGE's configuration? >> >> -- Reuti > > >> Am 14.03.2011 um 12:45 schrieb Erik Soyez: >> >>> Good day, >>> >>> one of our customers suffered an incident that I've never seen before. >>> >>> On Friday night all jobs running on the cluster died within a few hours: >>> ------------------------------------------------------------------------ >>> Application: CFX >>> Integration: Tight >>> ------------------------------------------------------------------------ >>> >>> Afterwards no new jobs could be submitted, only after all execds(!) had >>> been restarted. Unfortunately I could not have a look onto the cluster >>> myself when it had happened, so have to rely on log files etc. which >>> do not seem to fit together - sorry for this lengthy email, but I need >>> some hint to understand what's going on: >>> >>> ------------------------------------------------------------------------ >>> In "qmaster/messages" (job 18690 did not even exist at that time): >>> ------------------------------------------------------------------------ >>> 03/14/2011 08:28:54|worker|xxxxxxxxx1|E|unable to find job 18690 from the >>> scheduler order package >>> 03/14/2011 08:28:54|schedu|xxxxxxxxx1|E|unable to find job 18690 from the >>> scheduler order package >>> >>>> qstat -j 18690 >>> Following jobs do not exist: >>> 18690 >>> >>>> qacct -j 18690 >>> error: job id 18690 not found >>> >>>> cat /opt/sge/6.2u5/default/spool/qmaster/jobseqnum >>> 18718 >>> ------------------------------------------------------------------------ >>> >>> >>> E-Mail: >>> ------------------------------------------------------------------------ >>> GE 6.2u5: Job 17765 failed >>> ------------------------------------------------------------------------ >>> : >>> : >>> failed assumedly before job:can not find an unused add_grp_id >>> Shepherd pe_hostfile: >>> xxxxxx208.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED >>> xxxxxx207.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED >>> xxxxxx205.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED >>> xxxxxx204.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED >>> ------------------------------------------------------------------------ >>> >>> But: >>> ------------------------------------------------------------------------ >>> gid_range 20000-20100 >>> ------------------------------------------------------------------------ >>> This should be more than enough, shouldn't it? >>> >>> >>> >>> The application log files show some totally different error messages: >>> >>> One outfile: >>> ------------------------------------------------------------------------ >>> +--------------------------------------------------------------------+ >>> | Warning! | >>> | | >>> | /opt/sge/6.2u5/mpi/rsh connection to host | >>> | xxxxxx206.xxxxx.xxxxx.xxx produces the following output after the | >>> | output of the command: | >>> | | >>> | TRUE | >>> | | >>> | This may cause problems spawning parallel slaves. | >>> +--------------------------------------------------------------------+ >>> ------------------------------------------------------------------------ >>> >>> Another outfile: >>> ------------------------------------------------------------------------ >>> +--------------------------------------------------------------------+ >>> | An error has occurred in cfx5solve: | >>> | | >>> | Remote connection to xxxxxx216.xxxxx.xxxxx.xxx was terminated due | >>> | to a timeout. It was interrupted by signal TERM (15) It gave the | >>> | following output: | >>> | | >>> | /opt/sge/6.2u5/bin/lx26-amd64/qrsh -inherit -nostdin xxxxxx216- | >>> | .xxxxx.xxxxx.xxx echo TRUE | >>> | error: got no connection within 60 seconds. "Timeout occured w- | >>> | hile waiting for connection" | >>> : >>> : >>> : >>> ------------------------------------------------------------------------ >>> >>> Any ideas if these are different minor problems or one major problem? > > > -- > > -- > Vorstand/Board of Management: > Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech > Vorsitzender des Aufsichtsrats/ > Chairman of the Supervisory Board: > Michel Lepert > Sitz/Registered Office: Tuebingen > Registergericht/Registration Court: Stuttgart > Registernummer/Commercial Register No.: HRB 382196 > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
