Am 04.10.2013 um 22:43 schrieb Alex Chekholko: > Hi Txema, > > Are the jobs submitted with "-R y"? > > Maybe also try > > load_formula slots > > > More philosophically, how do you expect backfilling to work if there are no > hard runtime limits on jobs? > > Regards, > Alex > > On 10/04/2013 07:25 AM, Txema Heredia wrote: >> Hi all, >> >> I have a 27-node cluster. Currently there are 320 out of 320 slots >> filled up. All by jobs requesting 1-slot. >> >> At the top of my waiting queue there are 28 different jobs requesting 3 >> to 12 cores using two different parallel environments. All these jobs >> are requesting -R y. They are being ignored and overrun by the myriad of >> 1-slot requesting jobs behind them in the waiting queue. >> >> I have enabled the scheduler logging. During the last 4 hours, it has >> logged 724 new jobs starting, in all the 27 nodes. Not a single job on >> the system is requesting -l h_rt, but single-core jobs keep being >> scheduled and all the parallel jobs are starving. >> >> As far as I understand, the backfilling is killing my reservations, even >> if no one is requesting any kind of time, but if I set the >> "default_duration" to INFINITY, all the RESERVING log messages disappear.
As SGE sees all jobs running forever by default with this setting, there can't anything be reserved. Even if some are requesting h_rt: SGE judges INFINITY being smaller than INFINITY and running jobs without a h_rt request may allow jobs to slip in because of this - whether they request h_rt or not. -- Reuti >> Additionaly, for some odd reason, I only receive RESERVING messages from >> the jobs requesting a given number of slots (-pe whatever N). The jobs >> requesting a slot-range (-pe threaded 4-10) seem to reserve nothing. >> >> My scheduler configuration is as follows: >> >> # qconf -ssconf >> algorithm default >> schedule_interval 0:0:5 >> maxujobs 0 >> queue_sort_method load >> job_load_adjustments np_load_avg=0.50 >> load_adjustment_decay_time 0:7:30 >> load_formula np_load_avg >> schedd_job_info true >> flush_submit_sec 0 >> flush_finish_sec 0 >> params MONITOR=1 >> reprioritize_interval 0:0:0 >> halftime 168 >> usage_weight_list cpu=0.187000,mem=0.116000,io=0.697000 >> compensation_factor 5.000000 >> weight_user 0.250000 >> weight_project 0.250000 >> weight_department 0.250000 >> weight_job 0.250000 >> weight_tickets_functional 1000000000 >> weight_tickets_share 1000000000 >> share_override_tickets TRUE >> share_functional_shares TRUE >> max_functional_jobs_to_schedule 200 >> report_pjob_tickets TRUE >> max_pending_tasks_per_job 50 >> halflife_decay_list none >> policy_hierarchy OSF >> weight_ticket 0.010000 >> weight_waiting_time 0.000000 >> weight_deadline 3600000.000000 >> weight_urgency 0.100000 >> weight_priority 1.000000 >> max_reservation 50 >> default_duration 24:00:00 >> >> >> I have also tested it with params PROFILE=1 and default_duration >> INFINITY. But, when I set it, not a single reservation is logged in >> /opt/gridengine/default/common/schedule and new jobs keep starting. >> >> >> What am I missing? Is it possible to kill the backfilling? Are my >> reservations really working? >> >> Thanks in advance, >> >> Txema >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > > -- > Alex Chekholko [email protected] > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
