Am 29.08.2013 um 09:09 schrieb Guillermo Marco Puche: > That's one of my queues config. I think they will be many many things bad > configured regarding thresholds. > > qconf -sq shudra.q > qname shudra.q > hostlist @allhosts > seq_no 0 > load_thresholds np_load_avg=1.75 > suspend_thresholds NONE,[compute-0-2.local=load_avg=8,swap_used=12G], \ > [compute-0-1.local=load_avg=8,swap_used=12G], \ > [compute-0-3.local=load_avg=8,swap_used=12G], \ > [compute-0-5.local=load_avg=8,swap_used=12G], \ > [compute-0-4.local=load_avg=8,swap_used=12G], \ > [compute-0-0.local=load_avg=8,swap_used=12G] > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval 00:05:00 > processors UNDEFINED > qtype BATCH INTERACTIVE > ckpt_list NONE > pe_list make mpi mpich orte smp > rerun TRUE > slots 0,[compute-0-2.local=8],[compute-0-3.local=8], \ > [compute-0-1.local=8],[compute-0-5.local=8], \ > [compute-0-4.local=8],[compute-0-0.local=8] > tmpdir /tmp > shell /bin/csh > prolog NONE > epilog NONE > shell_start_mode posix_compliant > starter_method NONE > suspend_method NONE > resume_method NONE > terminate_method NONE > notify 00:00:60 > owner_list NONE > user_lists NONE > xuser_lists NONE > subordinate_list NONE > complex_values mem_free=30G
Do you have more than one queue on these machines? Often (means: almost always) these machine dependent settings are made in `qconf -me compute-0-1` as it's a feature of a node (having just one queue it works in this case too). Do you request mem_free in the job submission and made it consumable? Otherwise setting it here has no influence - it's never consumed by the jobs and not tested. -- Reuti > projects NONE > xprojects NONE > calendar NONE > initial_state default > s_rt INFINITY > h_rt INFINITY > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_data INFINITY > h_data INFINITY > s_stack INFINITY > h_stack INFINITY > s_core INFINITY > h_core INFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem INFINITY > h_vmem INFINITY > > > On 08/29/2013 08:36 AM, Guillermo Marco Puche wrote: >> On 08/28/2013 05:57 PM, Dave Love wrote: >>> Reuti <[email protected]> >>> writes: >>> >>> >>>>>>> • Job comes back to R status. >>>>>>> >>>>>>> >>>>>> Do you use any checkpointing interface, to restart the job? If so, it >>>>>> should output "Rr" in `qstat` instead of a plain "R" for the SGE job >>>>>> state. >>>>>> >>>>>> >>>>>> >>>>> No, I don't use any checkpointing interface. >>>>> >>>> Then the state should be "r". >>>> >>> There are some conditions (errors in prolog or pe_starter, I think) >>> which can cause rescheduling (state Rr), but certainly plain R shouldn't >>> happen (see sge_status(5) in the current man pages via the URL below). >>> >>> >> Thank you Dave I'm gonna take a look at this right now. >> >> Maybe the problem is on my thresholds configuration. I had to set threshold >> in all the compute nodes. This is because sometimes compute nodes in my >> Rocks cluster went down due to memory usage (using all memory + swap). >> >> I would really appreciate a link if there's any specific configuration >> manual on how to set correctly thresholds. Maybe I won't experience this >> weird behavior with Java jobs and get a better performance overall. >> >> >> Thank you very much. >> >> Best regards, >> Guillermo. >> >> -- >> >> >> _______________________________________________ >> users mailing list >> >> [email protected] >> https://gridengine.org/mailman/listinfo/users > > > -- > Guillermo Marco Puche > > Bioinformatician, Computer Science Engineer. > Sistemas Genómicos S.L. > Phone: +34 902 364 669 > Fax: +34 902 364 670 > www.sistemasgenomicos.com > > <bioinfo.png> > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
