Hi all,

I have a 27-node cluster. Currently there are 320 out of 320 slots filled up. All by jobs requesting 1-slot.

At the top of my waiting queue there are 28 different jobs requesting 3 to 12 cores using two different parallel environments. All these jobs are requesting -R y. They are being ignored and overrun by the myriad of 1-slot requesting jobs behind them in the waiting queue.

I have enabled the scheduler logging. During the last 4 hours, it has logged 724 new jobs starting, in all the 27 nodes. Not a single job on the system is requesting -l h_rt, but single-core jobs keep being scheduled and all the parallel jobs are starving.

As far as I understand, the backfilling is killing my reservations, even if no one is requesting any kind of time, but if I set the "default_duration" to INFINITY, all the RESERVING log messages disappear.

Additionaly, for some odd reason, I only receive RESERVING messages from the jobs requesting a given number of slots (-pe whatever N). The jobs requesting a slot-range (-pe threaded 4-10) seem to reserve nothing.

My scheduler configuration is as follows:

# qconf -ssconf
algorithm                         default
schedule_interval                 0:0:5
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  0
flush_finish_sec                  0
params                            MONITOR=1
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list cpu=0.187000,mem=0.116000,io=0.697000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         1000000000
weight_tickets_share              1000000000
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OSF
weight_ticket                     0.010000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
max_reservation                   50
default_duration                  24:00:00


I have also tested it with params PROFILE=1 and default_duration INFINITY. But, when I set it, not a single reservation is logged in /opt/gridengine/default/common/schedule and new jobs keep starting.


What am I missing? Is it possible to kill the backfilling? Are my reservations really working?

Thanks in advance,

Txema
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to