Excerpts from Reuti's message of 2013-11-28 19:11:57 +0100:
> Am 27.11.2013 um 10:24 schrieb Nicolás Serrano Martínez-Santos:
> 
> > Excerpts from Reuti's message of 2013-11-26 19:37:34 +0100:
> >> 
> >> But the process is also gone from the node, and not in some 
> >> uninterruptible kernel sleep?
> >> 
> > 
> > It is gone.
> > 
> >> 
> >> What's in the script: /scripts/sgeepilog.sh - anything what could hang?
> >> 
> > 
> > Please find it attached. However, the wait does not always return -1 in the
> > epilog but sometimes also in the main script. 
> > 
> >> Are you using -notify and s_rt at the same time? At least for the CPU time 
> >> I spot 36000 as s_cpu which I suggest to remove. It has no direct effect 
> >> as you have a h_cpu in addition anyway. Having -notify and a soft warning 
> >> at the same time could result in a warning for the warning and the job is 
> >> never killed but warned every 90 seconds or so. Maybe something similar is 
> >> happening when you have s_cpu and s_rt being triggered almost at the same 
> >> time.
> >> 
> > 
> > We are not using those two options. This is what the typical qstat of a 
> > process loooks like
> 
> Good.
> 
> > ==============================================================
> > job_number:                 294730
> > exec_file:                  job_scripts/294730
> > submission_time:            Tue Nov 19 17:30:06 2013
> > owner:                      adgipas
> > uid:                        3155
> > group:                      20040059
> > gid:                        3091
> > sge_o_home:                 /h/adgipas
> > sge_o_log_name:             adgipas
> > sge_o_path:                 
> > /home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts
> > sge_o_shell:                /bin/bash
> > sge_o_workdir:              /home/adgipas/proves_cart
> > sge_o_host:                 mainnode
> > account:                    sge
> > cwd:                        /h/adgipas/proves_cart
> > reserve:                    y
> > merge:                      y
> > hard resource_list:         h_cpu=72000,h_rt=72000,h_vmem=5120M
> > mail_list:                  [email protected]
> > notify:                     FALSE
> > job_name:                   
> > cart_700.standard.triphoneme.train-em-MIX01-ITER1-estimate
> > jobshare:                   0
> > shell_list:                 NONE:/bin/bash
> > env_list:                   
> > PATH=/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts
> > script_file:                STDIN
> > jid_predecessor_list (req):  cart_700.standard.triphoneme.train-init
> >                             jid_successor_list:          294731
> >                             job-array tasks:            1-500:1
> >                             usage  334:                 cpu=10:08:02, 
> > mem=182410.00000 GBs, io=0.00000, vmem=5.000G, maxvmem=5.000G
> >                             scheduling info:            queue instance 
> > "[email protected]" dropped because it is disabled
> >                                                         queue instance 
> > "[email protected]" dropped because it is disabled
> > 
> > -----------------
> > 
> > Another peculiarity of the cluster is that all processes are submittion 
> > with -R y, could it cause also any problem? I read in one of your mails
> > 
> > http://gridengine.org/pipermail/users/2012-October/005077.html
> > 
> > but I don't think is related to this problem.
> 
> I think so too.
> 
> Are these serial jobs (I see no PE requested, but Open MPI in the path). Does 
> the called application do strange things like jumping out of the process try 
> an are no longer under SGE control?

No, there are some parallel jobs that also suffer from this problem.

> 
> Is the spool directory local for each node on some kind of NFS?
>

The spool of execution host is local, but the spool of the master host is 
mounted by NFS.

-- 
NiCo
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to