I am still looking for the problem with no results. The only progress that i have made so far is that i have found that some jobs are correctly executed but the cluster lose their track. For instance, from the messages of qmaster:
12/03/2013 11:03:56|worker|headnode1|E|writing job finish information: can't locate queue "<unknown queue>" 12/03/2013 11:03:56|worker|headnode1|I|removing trigger to terminate job 418968.101 12/03/2013 11:03:56|worker|headnode1|W|job 418968.101 failed on host <unknown host> in recognizing job because: execd doesn't know this job The qacct output for this job is: qname UNKNOWN hostname UNKNOWN group 1405785 owner mideag project NONE department defaultdepartment jobname rec-FULLPM-dev.TRx6-1x3000-1x2000-1x1000-1x1000-1x2000-E10.16.18.800.70.20000_body jobnumber 418968 taskid 101 account sge priority 0 qsub_time Thu Jan 1 01:00:00 1970 start_time -/- end_time -/- granted_pe NONE slots 1 failed 21 : in recognizing job exit_status 0 ru_wallclock 0 ru_utime 0.000 ru_stime 0.000 ru_maxrss 0 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 0 ru_majflt 0 ru_nswap 0 ru_inblock 0 ru_oublock 0 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 0 ru_nivcsw 0 cpu 0.000 mem 0.000 io 0.000 iow 0.000 maxvmem 0.000 arid undefined Could it be a problem to have the qmaster spool directory mounted by nfs? Excerpts from nserrano's message of 2013-11-29 13:25:41 +0100: > Excerpts from Reuti's message of 2013-11-28 19:11:57 +0100: > > Am 27.11.2013 um 10:24 schrieb Nicolás Serrano Martínez-Santos: > > > > > Excerpts from Reuti's message of 2013-11-26 19:37:34 +0100: > > >> > > >> But the process is also gone from the node, and not in some > > >> uninterruptible kernel sleep? > > >> > > > > > > It is gone. > > > > > >> > > >> What's in the script: /scripts/sgeepilog.sh - anything what could hang? > > >> > > > > > > Please find it attached. However, the wait does not always return -1 in > > > the > > > epilog but sometimes also in the main script. > > > > > >> Are you using -notify and s_rt at the same time? At least for the CPU > > >> time I spot 36000 as s_cpu which I suggest to remove. It has no direct > > >> effect as you have a h_cpu in addition anyway. Having -notify and a soft > > >> warning at the same time could result in a warning for the warning and > > >> the job is never killed but warned every 90 seconds or so. Maybe > > >> something similar is happening when you have s_cpu and s_rt being > > >> triggered almost at the same time. > > >> > > > > > > We are not using those two options. This is what the typical qstat of a > > > process loooks like > > > > Good. > > > > > ============================================================== > > > job_number: 294730 > > > exec_file: job_scripts/294730 > > > submission_time: Tue Nov 19 17:30:06 2013 > > > owner: adgipas > > > uid: 3155 > > > group: 20040059 > > > gid: 3091 > > > sge_o_home: /h/adgipas > > > sge_o_log_name: adgipas > > > sge_o_path: > > > /home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts > > > sge_o_shell: /bin/bash > > > sge_o_workdir: /home/adgipas/proves_cart > > > sge_o_host: mainnode > > > account: sge > > > cwd: /h/adgipas/proves_cart > > > reserve: y > > > merge: y > > > hard resource_list: h_cpu=72000,h_rt=72000,h_vmem=5120M > > > mail_list: [email protected] > > > notify: FALSE > > > job_name: > > > cart_700.standard.triphoneme.train-em-MIX01-ITER1-estimate > > > jobshare: 0 > > > shell_list: NONE:/bin/bash > > > env_list: > > > PATH=/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts > > > script_file: STDIN > > > jid_predecessor_list (req): cart_700.standard.triphoneme.train-init > > > jid_successor_list: 294731 > > > job-array tasks: 1-500:1 > > > usage 334: cpu=10:08:02, > > > mem=182410.00000 GBs, io=0.00000, vmem=5.000G, maxvmem=5.000G > > > scheduling info: queue instance > > > "[email protected]" dropped because it is disabled > > > queue instance > > > "[email protected]" dropped because it is disabled > > > > > > ----------------- > > > > > > Another peculiarity of the cluster is that all processes are submittion > > > with -R y, could it cause also any problem? I read in one of your mails > > > > > > http://gridengine.org/pipermail/users/2012-October/005077.html > > > > > > but I don't think is related to this problem. > > > > I think so too. > > > > Are these serial jobs (I see no PE requested, but Open MPI in the path). > > Does the called application do strange things like jumping out of the > > process try an are no longer under SGE control? > > No, there are some parallel jobs that also suffer from this problem. > > > > > Is the spool directory local for each node on some kind of NFS? > > > > The spool of execution host is local, but the spool of the master host is > mounted by NFS. > -- NiCo _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
