The root cause was strange so it's worth documenting here ...
I had created a new consumable and requestable resource called "gpu"
configured like this:
gpu gpu INT <= YES YES NONE 0
And on host A I had set "complex_values gpu=1" and on host B I set
"complex_values gpu=2" etc. etc. across the cluster.
My mistake was setting the default value of the new complex entry to
"NONE" instead of "0" which is what you probably want when the attribute
is of type INT
But this was bizzare; basically I had a bad default value for a
requestable resource and as soon as we set that value down at the
execution host level it instantly broke all of our parallel
environments. SGE scheduler was treating my mistake like I had created
a requestable resource of type FORCED or something.
Strange but resolved now.
Regards
Chris
Reuti wrote on 6/11/20 4:17 PM:
Hi,
Any consumables in place like memory or other resource requests? Any output of `qalter -w
v …` or "-w p"?
-- Reuti
Am 11.06.2020 um 20:32 schrieb Chris Dagdigian <d...@sonsorol.org>:
Hi folks,
Got a bewildering situation I've never seen before with simple SMP/threaded PE
techniques
I made a brand new PE called threaded:
$ qconf -sp threaded
pe_name threaded
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
allocation_rule $pe_slots
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
qsort_args NONE
And I attached that to all.q on an IDLE grid and submitted a job with '-pe
threaded 1' argument
However all "qstat -j" data is showing this scheduler decision line:
cannot run in PE "threaded" because it only offers 0 slots
I'm sort of lost on how to debug this because I can't figure out how to probe where SGE is keeping
track of PE specific slots. With other stuff I can look at complex_values reported by execution
hosts or I can use an "-F" argument to qstat to dump the live state and status of a
requestable resource but I don't really have any debug or troubleshooting ideas for "how to
figure out why SGE thinks there are 0 slots when the static PE on an idle cluster has. been set to
contain 999 slots"
Anyone seen something like this before? I don't think I've ever seen this
particular issue with an SGE parallel environment before ...
Chris
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users