Excerpts from Reuti's message of 2013-11-28 19:11:57 +0100: > Am 27.11.2013 um 10:24 schrieb Nicolás Serrano Martínez-Santos: > > > Excerpts from Reuti's message of 2013-11-26 19:37:34 +0100: > >> > >> But the process is also gone from the node, and not in some > >> uninterruptible kernel sleep? > >> > > > > It is gone. > > > >> > >> What's in the script: /scripts/sgeepilog.sh - anything what could hang? > >> > > > > Please find it attached. However, the wait does not always return -1 in the > > epilog but sometimes also in the main script. > > > >> Are you using -notify and s_rt at the same time? At least for the CPU time > >> I spot 36000 as s_cpu which I suggest to remove. It has no direct effect > >> as you have a h_cpu in addition anyway. Having -notify and a soft warning > >> at the same time could result in a warning for the warning and the job is > >> never killed but warned every 90 seconds or so. Maybe something similar is > >> happening when you have s_cpu and s_rt being triggered almost at the same > >> time. > >> > > > > We are not using those two options. This is what the typical qstat of a > > process loooks like > > Good. > > > ============================================================== > > job_number: 294730 > > exec_file: job_scripts/294730 > > submission_time: Tue Nov 19 17:30:06 2013 > > owner: adgipas > > uid: 3155 > > group: 20040059 > > gid: 3091 > > sge_o_home: /h/adgipas > > sge_o_log_name: adgipas > > sge_o_path: > > /home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts > > sge_o_shell: /bin/bash > > sge_o_workdir: /home/adgipas/proves_cart > > sge_o_host: mainnode > > account: sge > > cwd: /h/adgipas/proves_cart > > reserve: y > > merge: y > > hard resource_list: h_cpu=72000,h_rt=72000,h_vmem=5120M > > mail_list: [email protected] > > notify: FALSE > > job_name: > > cart_700.standard.triphoneme.train-em-MIX01-ITER1-estimate > > jobshare: 0 > > shell_list: NONE:/bin/bash > > env_list: > > PATH=/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts > > script_file: STDIN > > jid_predecessor_list (req): cart_700.standard.triphoneme.train-init > > jid_successor_list: 294731 > > job-array tasks: 1-500:1 > > usage 334: cpu=10:08:02, > > mem=182410.00000 GBs, io=0.00000, vmem=5.000G, maxvmem=5.000G > > scheduling info: queue instance > > "[email protected]" dropped because it is disabled > > queue instance > > "[email protected]" dropped because it is disabled > > > > ----------------- > > > > Another peculiarity of the cluster is that all processes are submittion > > with -R y, could it cause also any problem? I read in one of your mails > > > > http://gridengine.org/pipermail/users/2012-October/005077.html > > > > but I don't think is related to this problem. > > I think so too. > > Are these serial jobs (I see no PE requested, but Open MPI in the path). Does > the called application do strange things like jumping out of the process try > an are no longer under SGE control?
No, there are some parallel jobs that also suffer from this problem. > > Is the spool directory local for each node on some kind of NFS? > The spool of execution host is local, but the spool of the master host is mounted by NFS. -- NiCo _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
