I am still looking for the problem with no results.

The only progress that i have made so far is that i have found that some jobs
are correctly executed but the cluster lose their track. For instance, from the
messages of qmaster:

12/03/2013 11:03:56|worker|headnode1|E|writing job finish information: can't 
locate queue "<unknown queue>"
12/03/2013 11:03:56|worker|headnode1|I|removing trigger to terminate job 
418968.101 12/03/2013 11:03:56|worker|headnode1|W|job 418968.101 failed on host 
<unknown host> in recognizing job because: execd doesn't know this job

The qacct output for this job is:

qname        UNKNOWN             
hostname     UNKNOWN             
group        1405785             
owner        mideag              
project      NONE                
department   defaultdepartment   
jobname      
rec-FULLPM-dev.TRx6-1x3000-1x2000-1x1000-1x1000-1x2000-E10.16.18.800.70.20000_body
jobnumber    418968              
taskid       101                 
account      sge                 
priority     0                   
qsub_time    Thu Jan  1 01:00:00 1970
start_time   -/-
end_time     -/-
granted_pe   NONE                
slots        1                   
failed       21  : in recognizing job
exit_status  0                   
ru_wallclock 0            
ru_utime     0.000        
ru_stime     0.000        
ru_maxrss    0                   
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    0                   
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     0                   
ru_nivcsw    0                   
cpu          0.000        
mem          0.000             
io           0.000             
iow          0.000             
maxvmem      0.000
arid         undefined

Could it be a problem to have the qmaster spool directory mounted by nfs? 

Excerpts from nserrano's message of 2013-11-29 13:25:41 +0100:
> Excerpts from Reuti's message of 2013-11-28 19:11:57 +0100:
> > Am 27.11.2013 um 10:24 schrieb Nicolás Serrano Martínez-Santos:
> > 
> > > Excerpts from Reuti's message of 2013-11-26 19:37:34 +0100:
> > >> 
> > >> But the process is also gone from the node, and not in some 
> > >> uninterruptible kernel sleep?
> > >> 
> > > 
> > > It is gone.
> > > 
> > >> 
> > >> What's in the script: /scripts/sgeepilog.sh - anything what could hang?
> > >> 
> > > 
> > > Please find it attached. However, the wait does not always return -1 in 
> > > the
> > > epilog but sometimes also in the main script. 
> > > 
> > >> Are you using -notify and s_rt at the same time? At least for the CPU 
> > >> time I spot 36000 as s_cpu which I suggest to remove. It has no direct 
> > >> effect as you have a h_cpu in addition anyway. Having -notify and a soft 
> > >> warning at the same time could result in a warning for the warning and 
> > >> the job is never killed but warned every 90 seconds or so. Maybe 
> > >> something similar is happening when you have s_cpu and s_rt being 
> > >> triggered almost at the same time.
> > >> 
> > > 
> > > We are not using those two options. This is what the typical qstat of a 
> > > process loooks like
> > 
> > Good.
> > 
> > > ==============================================================
> > > job_number:                 294730
> > > exec_file:                  job_scripts/294730
> > > submission_time:            Tue Nov 19 17:30:06 2013
> > > owner:                      adgipas
> > > uid:                        3155
> > > group:                      20040059
> > > gid:                        3091
> > > sge_o_home:                 /h/adgipas
> > > sge_o_log_name:             adgipas
> > > sge_o_path:                 
> > > /home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts
> > > sge_o_shell:                /bin/bash
> > > sge_o_workdir:              /home/adgipas/proves_cart
> > > sge_o_host:                 mainnode
> > > account:                    sge
> > > cwd:                        /h/adgipas/proves_cart
> > > reserve:                    y
> > > merge:                      y
> > > hard resource_list:         h_cpu=72000,h_rt=72000,h_vmem=5120M
> > > mail_list:                  [email protected]
> > > notify:                     FALSE
> > > job_name:                   
> > > cart_700.standard.triphoneme.train-em-MIX01-ITER1-estimate
> > > jobshare:                   0
> > > shell_list:                 NONE:/bin/bash
> > > env_list:                   
> > > PATH=/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts
> > > script_file:                STDIN
> > > jid_predecessor_list (req):  cart_700.standard.triphoneme.train-init
> > >                             jid_successor_list:          294731
> > >                             job-array tasks:            1-500:1
> > >                             usage  334:                 cpu=10:08:02, 
> > > mem=182410.00000 GBs, io=0.00000, vmem=5.000G, maxvmem=5.000G
> > >                             scheduling info:            queue instance 
> > > "[email protected]" dropped because it is disabled
> > >                                                         queue instance 
> > > "[email protected]" dropped because it is disabled
> > > 
> > > -----------------
> > > 
> > > Another peculiarity of the cluster is that all processes are submittion 
> > > with -R y, could it cause also any problem? I read in one of your mails
> > > 
> > > http://gridengine.org/pipermail/users/2012-October/005077.html
> > > 
> > > but I don't think is related to this problem.
> > 
> > I think so too.
> > 
> > Are these serial jobs (I see no PE requested, but Open MPI in the path). 
> > Does the called application do strange things like jumping out of the 
> > process try an are no longer under SGE control?
> 
> No, there are some parallel jobs that also suffer from this problem.
> 
> > 
> > Is the spool directory local for each node on some kind of NFS?
> >
> 
> The spool of execution host is local, but the spool of the master host is 
> mounted by NFS.
> 

-- 
NiCo
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to