Am 27.11.2013 um 10:24 schrieb Nicolás Serrano Martínez-Santos: > Excerpts from Reuti's message of 2013-11-26 19:37:34 +0100: >> >> But the process is also gone from the node, and not in some uninterruptible >> kernel sleep? >> > > It is gone. > >> >> What's in the script: /scripts/sgeepilog.sh - anything what could hang? >> > > Please find it attached. However, the wait does not always return -1 in the > epilog but sometimes also in the main script. > >> Are you using -notify and s_rt at the same time? At least for the CPU time I >> spot 36000 as s_cpu which I suggest to remove. It has no direct effect as >> you have a h_cpu in addition anyway. Having -notify and a soft warning at >> the same time could result in a warning for the warning and the job is never >> killed but warned every 90 seconds or so. Maybe something similar is >> happening when you have s_cpu and s_rt being triggered almost at the same >> time. >> > > We are not using those two options. This is what the typical qstat of a > process loooks like
Good. > ============================================================== > job_number: 294730 > exec_file: job_scripts/294730 > submission_time: Tue Nov 19 17:30:06 2013 > owner: adgipas > uid: 3155 > group: 20040059 > gid: 3091 > sge_o_home: /h/adgipas > sge_o_log_name: adgipas > sge_o_path: > /home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts > sge_o_shell: /bin/bash > sge_o_workdir: /home/adgipas/proves_cart > sge_o_host: mainnode > account: sge > cwd: /h/adgipas/proves_cart > reserve: y > merge: y > hard resource_list: h_cpu=72000,h_rt=72000,h_vmem=5120M > mail_list: [email protected] > notify: FALSE > job_name: > cart_700.standard.triphoneme.train-em-MIX01-ITER1-estimate > jobshare: 0 > shell_list: NONE:/bin/bash > env_list: > PATH=/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts > script_file: STDIN > jid_predecessor_list (req): cart_700.standard.triphoneme.train-init > jid_successor_list: 294731 > job-array tasks: 1-500:1 > usage 334: cpu=10:08:02, > mem=182410.00000 GBs, io=0.00000, vmem=5.000G, maxvmem=5.000G > scheduling info: queue instance > "[email protected]" dropped because it is disabled > queue instance > "[email protected]" dropped because it is disabled > > ----------------- > > Another peculiarity of the cluster is that all processes are submittion with > -R y, could it cause also any problem? I read in one of your mails > > http://gridengine.org/pipermail/users/2012-October/005077.html > > but I don't think is related to this problem. I think so too. Are these serial jobs (I see no PE requested, but Open MPI in the path). Does the called application do strange things like jumping out of the process try an are no longer under SGE control? Is the spool directory local for each node on some kind of NFS? -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
