Re: [gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Reuti


Am 11.06.2020 um 22:44 schrieb Chris Dagdigian:

> 
> The root cause was strange so it's worth documenting here ...
> 
> I had created a new consumable and requestable resource called "gpu" 
> configured like this:
> 
> gpu gpuINT   <=YES YESNONE
> 0
> 
> And on host A I had set "complex_values gpu=1" and on host B I set 
> "complex_values gpu=2" etc. etc. across the cluster. 
> 
> My mistake was setting the default value of the new complex entry to "NONE" 
> instead of "0" which is what you probably want when the attribute is of type 
> INT
> 
> But this was bizzare;  basically I had a bad default value for a requestable 
> resource and as soon as we set that value down at the execution host level it 
> instantly broke all of our parallel environments.  SGE scheduler was treating 
> my mistake like I had created a requestable resource of type FORCED or 
> something. 

Aha, a couple of days ago I got a request in PM where someone swore that the 
configuration "h_vmem …  YES YES 0 0" was working fine all the time. Only after 
my suggestion to add h_vmem on an exechost level to avoid oversubscription all 
the jobs crashed then, due to no memory being available (as h_vmem = 0 was used 
this way as an automatically set limit).

Essentially: the default value in a complex definition is ignored, as long as 
there is nothing to consume from. If it's not ignored, then the type has to 
match.

-- Reuti


> 
> Strange but resolved now. 
> 
> Regards
> Chris
> 
> 
> 
> 
> Reuti wrote on 6/11/20 4:17 PM:
>> Hi,
>> 
>> Any consumables in place like memory or other resource requests? Any output 
>> of `qalter -w v …` or "-w p"?
>> 
>> -- Reuti
>> 
>> 
>> 
>>> Am 11.06.2020 um 20:32 schrieb Chris Dagdigian 
>>> :
>>> 
>>> Hi folks,
>>> 
>>> Got a bewildering situation I've never seen before with simple SMP/threaded 
>>> PE techniques
>>> 
>>> I made a brand new PE called threaded:
>>> 
>>> $ qconf -sp threaded
>>> pe_namethreaded
>>> slots  999
>>> user_lists NONE
>>> xuser_listsNONE
>>> start_proc_argsNONE
>>> stop_proc_args NONE
>>> allocation_rule$pe_slots
>>> control_slaves FALSE
>>> job_is_first_task  TRUE
>>> urgency_slots  min
>>> accounting_summary FALSE
>>> qsort_args NONE
>>> 
>>> 
>>> And I attached that to all.q on an IDLE grid and submitted a job with '-pe 
>>> threaded 1' argument
>>> 
>>> However all "qstat -j" data is showing this scheduler decision line:
>>> 
>>> cannot run in PE "threaded" because it only offers 0 slots
>>> 
>>> 
>>> I'm sort of lost on how to debug this because I can't figure out how to 
>>> probe where SGE is keeping track of PE specific slots.  With other stuff I 
>>> can look at complex_values reported by execution hosts or I can use an "-F" 
>>> argument to qstat to dump the live state and status of a requestable 
>>> resource but I don't really have any debug or troubleshooting ideas for 
>>> "how to figure out why SGE thinks there are 0 slots when the static PE on 
>>> an idle cluster has. been set to contain 999 slots" 
>>> 
>>> Anyone seen something like this before?  I don't think I've ever seen this 
>>> particular issue with an SGE parallel environment before ...
>>> 
>>> 
>>> Chris
>>> 
>>> ___
>>> users mailing list
>>> 
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
> 


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Feng Zhang
Is "threads" added into all.q?

Also can check "qconf -srqs" is there's any limit

On Thu, Jun 11, 2020 at 2:33 PM Chris Dagdigian  wrote:
>
> Hi folks,
>
> Got a bewildering situation I've never seen before with simple SMP/threaded 
> PE techniques
>
> I made a brand new PE called threaded:
>
> $ qconf -sp threaded
> pe_namethreaded
> slots  999
> user_lists NONE
> xuser_listsNONE
> start_proc_argsNONE
> stop_proc_args NONE
> allocation_rule$pe_slots
> control_slaves FALSE
> job_is_first_task  TRUE
> urgency_slots  min
> accounting_summary FALSE
> qsort_args NONE
>
>
> And I attached that to all.q on an IDLE grid and submitted a job with '-pe 
> threaded 1' argument
>
> However all "qstat -j" data is showing this scheduler decision line:
>
> cannot run in PE "threaded" because it only offers 0 slots
>
>
> I'm sort of lost on how to debug this because I can't figure out how to probe 
> where SGE is keeping track of PE specific slots.  With other stuff I can look 
> at complex_values reported by execution hosts or I can use an "-F" argument 
> to qstat to dump the live state and status of a requestable resource but I 
> don't really have any debug or troubleshooting ideas for "how to figure out 
> why SGE thinks there are 0 slots when the static PE on an idle cluster has. 
> been set to contain 999 slots"
>
> Anyone seen something like this before?  I don't think I've ever seen this 
> particular issue with an SGE parallel environment before ...
>
>
> Chris
>
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Chris Dagdigian


The root cause was strange so it's worth documenting here ...

I had created a new consumable and requestable resource called "gpu" 
configured like this:


gpu gpu    INT   <=    YES YES    NONE    0

And on host A I had set "complex_values gpu=1" and on host B I set 
"complex_values gpu=2" etc. etc. across the cluster.


My mistake was setting the default value of the new complex entry to 
"NONE" instead of "0" which is what you probably want when the attribute 
is of type INT


But this was bizzare;  basically I had a bad default value for a 
requestable resource and as soon as we set that value down at the 
execution host level it instantly broke all of our parallel 
environments.  SGE scheduler was treating my mistake like I had created 
a requestable resource of type FORCED or something.


Strange but resolved now.

Regards
Chris




Reuti wrote on 6/11/20 4:17 PM:

Hi,

Any consumables in place like memory or other resource requests? Any output of `qalter -w 
v …` or "-w p"?

-- Reuti



Am 11.06.2020 um 20:32 schrieb Chris Dagdigian :

Hi folks,

Got a bewildering situation I've never seen before with simple SMP/threaded PE 
techniques

I made a brand new PE called threaded:

$ qconf -sp threaded
pe_namethreaded
slots  999
user_lists NONE
xuser_listsNONE
start_proc_argsNONE
stop_proc_args NONE
allocation_rule$pe_slots
control_slaves FALSE
job_is_first_task  TRUE
urgency_slots  min
accounting_summary FALSE
qsort_args NONE


And I attached that to all.q on an IDLE grid and submitted a job with '-pe 
threaded 1' argument

However all "qstat -j" data is showing this scheduler decision line:

cannot run in PE "threaded" because it only offers 0 slots


I'm sort of lost on how to debug this because I can't figure out how to probe where SGE is keeping 
track of PE specific slots.  With other stuff I can look at complex_values reported by execution 
hosts or I can use an "-F" argument to qstat to dump the live state and status of a 
requestable resource but I don't really have any debug or troubleshooting ideas for "how to 
figure out why SGE thinks there are 0 slots when the static PE on an idle cluster has. been set to 
contain 999 slots"

Anyone seen something like this before?  I don't think I've ever seen this 
particular issue with an SGE parallel environment before ...


Chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Reuti
Hi,

Any consumables in place like memory or other resource requests? Any output of 
`qalter -w v …` or "-w p"?

-- Reuti


> Am 11.06.2020 um 20:32 schrieb Chris Dagdigian :
> 
> Hi folks,
> 
> Got a bewildering situation I've never seen before with simple SMP/threaded 
> PE techniques
> 
> I made a brand new PE called threaded:
> 
> $ qconf -sp threaded
> pe_namethreaded
> slots  999
> user_lists NONE
> xuser_listsNONE
> start_proc_argsNONE
> stop_proc_args NONE
> allocation_rule$pe_slots
> control_slaves FALSE
> job_is_first_task  TRUE
> urgency_slots  min
> accounting_summary FALSE
> qsort_args NONE
> 
> 
> And I attached that to all.q on an IDLE grid and submitted a job with '-pe 
> threaded 1' argument
> 
> However all "qstat -j" data is showing this scheduler decision line:
> 
> cannot run in PE "threaded" because it only offers 0 slots
> 
> 
> I'm sort of lost on how to debug this because I can't figure out how to probe 
> where SGE is keeping track of PE specific slots.  With other stuff I can look 
> at complex_values reported by execution hosts or I can use an "-F" argument 
> to qstat to dump the live state and status of a requestable resource but I 
> don't really have any debug or troubleshooting ideas for "how to figure out 
> why SGE thinks there are 0 slots when the static PE on an idle cluster has. 
> been set to contain 999 slots" 
> 
> Anyone seen something like this before?  I don't think I've ever seen this 
> particular issue with an SGE parallel environment before ...
> 
> 
> Chris
> 
> ___
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


[gridengine users] Strange SGE PE issue (threaded PE with 999 slots but scheduler thinks the value is 0)

2020-06-11 Thread Chris Dagdigian

Hi folks,

Got a bewildering situation I've never seen before with simple 
SMP/threaded PE techniques


I made a brand new PE called threaded:

$ qconf -sp threaded
pe_name    threaded
slots  999
user_lists NONE
xuser_lists    NONE
start_proc_args    NONE
stop_proc_args NONE
allocation_rule    $pe_slots
control_slaves FALSE
job_is_first_task  TRUE
urgency_slots  min
accounting_summary FALSE
qsort_args NONE


And I attached that to all.q on an IDLE grid and submitted a job with 
'-pe threaded 1' argument


However all "qstat -j" data is showing this scheduler decision line:

cannot run in PE "threaded" because it only offers 0 slots


I'm sort of lost on how to debug this because I can't figure out how to 
probe where SGE is keeping track of PE specific slots.  With other stuff 
I can look at complex_values reported by execution hosts or I can use an 
"-F" argument to qstat to dump the live state and status of a 
requestable resource but I don't really have any debug or 
troubleshooting ideas for "how to figure out why SGE thinks there are 0 
slots when the static PE on an idle cluster has. been set to contain 999 
slots"


Anyone seen something like this before?  I don't think I've ever seen 
this particular issue with an SGE parallel environment before ...



Chris

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users