Am 29.08.2013 um 09:09 schrieb Guillermo Marco Puche:

> That's one of my queues config. I think they will be many many things bad 
> configured regarding thresholds.
> 
> qconf -sq shudra.q
> qname                 shudra.q
> hostlist              @allhosts
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE,[compute-0-2.local=load_avg=8,swap_used=12G], \
>                       [compute-0-1.local=load_avg=8,swap_used=12G], \
>                       [compute-0-3.local=load_avg=8,swap_used=12G], \
>                       [compute-0-5.local=load_avg=8,swap_used=12G], \
>                       [compute-0-4.local=load_avg=8,swap_used=12G], \
>                       [compute-0-0.local=load_avg=8,swap_used=12G]
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make mpi mpich orte smp
> rerun                 TRUE
> slots                 0,[compute-0-2.local=8],[compute-0-3.local=8], \
>                       [compute-0-1.local=8],[compute-0-5.local=8], \
>                       [compute-0-4.local=8],[compute-0-0.local=8]
> tmpdir                /tmp
> shell                 /bin/csh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        mem_free=30G

Do you have more than one queue on these machines? Often (means: almost always) 
these machine dependent settings are made in `qconf -me compute-0-1` as it's a 
feature of a node (having just one queue it works in  this case too).

Do you request mem_free in the job submission and made it consumable? Otherwise 
setting it here has no influence - it's never consumed by the jobs and not 
tested.

-- Reuti


> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> 
> On 08/29/2013 08:36 AM, Guillermo Marco Puche wrote:
>> On 08/28/2013 05:57 PM, Dave Love wrote:
>>> Reuti <[email protected]>
>>>  writes:
>>> 
>>> 
>>>>>>>         • Job comes back to R status.
>>>>>>> 
>>>>>>> 
>>>>>> Do you use any checkpointing interface, to restart the job? If so, it 
>>>>>> should output "Rr" in `qstat` instead of a plain "R" for the SGE job 
>>>>>> state.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> No, I don't use any checkpointing interface.
>>>>> 
>>>> Then the state should be "r".
>>>> 
>>> There are some conditions (errors in prolog or pe_starter, I think)
>>> which can cause rescheduling (state Rr), but certainly plain R shouldn't
>>> happen (see sge_status(5) in the current man pages via the URL below).
>>> 
>>> 
>> Thank you Dave I'm gonna take a look at this right now.
>> 
>> Maybe the problem is on my thresholds configuration. I had to set threshold 
>> in all the compute nodes. This is because sometimes compute nodes in my 
>> Rocks cluster went down due to memory usage (using all memory + swap).
>> 
>> I would really appreciate a link if there's any specific configuration 
>> manual on how to set correctly thresholds. Maybe I won't experience this 
>> weird behavior with Java jobs and get a better performance overall.
>> 
>> 
>> Thank you very much.
>> 
>> Best regards,
>> Guillermo.
>> 
>> -- 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> 
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
> 
> 
> -- 
> Guillermo Marco Puche
> 
> Bioinformatician, Computer Science Engineer.
> Sistemas Genómicos S.L.
> Phone: +34 902 364 669
> Fax: +34 902 364 670
> www.sistemasgenomicos.com
> 
>  <bioinfo.png> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to