Am 07.10.2013 um 15:59 schrieb Txema Heredia:

> El 07/10/13 14:58, Reuti escribió:
>> Hi,
>> 
>> Am 07.10.2013 um 13:15 schrieb Txema Heredia:
>> 
>>> The problem is that, right now, the mandatory usage of h_rt is not an 
>>> option. So we need to work considering that all jobs will last to infinity 
>>> and beyond.
>>> 
>>> Right now, the scheduler configuration is:
>>> max_reservation 50
>>> default_duration 24:00:00
>>> 
>>> During the weekend, most of the parallel ( and -R y) jobs started running, 
>>> but now there is something fishy in my queues:
>>> 
>>> The first 3 jobs in my waiting queue belong to user1. All 3 jobs request 
>>> -pe mpich_round 12, -R y and -l h_vmem=4G (h_vmem is set to consumable = 
>>> YES, not JOB).
>> Which amount of memory did you specify in the exechost definition, i.e. 
>> what's in the machine physically?
>> 
>> -- Reuti
> 
> 26 nodes have 96GB of ram. One node has 48GB.

And you defined it on an exechost level under "complex_values"? - Reuti


> Currently nodes range from 4 to 10 free slots and from 26 to 82.1 free GB
> 
> The first jobs in my waiting queue (after the 3 reserving ones) require 
> measly 0.9G, 3G and 12G, all with slots=1 and -R n. None of them is 
> scheduled. But if I manually increase their priority so they are put BEFORE 
> the 3 -R y jobs, they are immediately scheduled.
> 
>> 
>> 
>>> This user has already one job like these running. User1 has a RQS that 
>>> limits him to use only 12 slots in the whole cluster. Thus the 3 waiting 
>>> jobs will not be able to run until the first one finishes.
>>> 
>>> This is the current schedule log:
>>> 
>>> # grep "::::\|RESERVING" schedule | tail -200 | grep "::::\|Q:all" | tail 
>>> -37 | sort
>>> ::::::::
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734185:1:RESERVING:1381142325:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734186:1:RESERVING:1381228785:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 2734187:1:RESERVING:1381315245:86460:Q:[email protected]:slots:1.000000
>>> 
>>> 
>>> Right now, the cluster is using 190 slots of 320 total. The schedule log 
>>> says that the 3 waiting jobs form user1 are the only jobs making any kind 
>>> of reservation. These jobs are reserving a total of 36 cores. These 3 jobs 
>>> are effectively blocking 36 already-free slots because the RQS doesn't 
>>> allow user1 to make usage of more than 12 slots at once. This is not "nice" 
>>> but I understand that the scheduler has its limitations and cannot predict 
>>> the future.
>>> 
>>> Taking into account the jobs running + the slots & memory locked by the 
>>> reserving jobs, there is a grand total of 226 slots locked. Thus leaving 94 
>>> free slots.
>>> 
>>> Here comes the problem: Even though there are 94 free slots and lots of 
>>> spare memory, NONE of the 4300 waiting jobs is running. There are nodes 
>>> with 6 free slots and 59 GB of free RAM but none of the waiting jobs is 
>>> scheduled. New jobs only star running when one of the 190 slots occupied by 
>>> running jobs is freed. None of these other waiting jobs is requesting -R y, 
>>> -pe nor h_rt.
>>> 
>>> 
>>> Additionaly, this is creating some odd behaviour. It seems that, on each 
>>> scheduler run, it is trying to start jobs in those "blocked slots", but it 
>>> fails with no apparent reason. Some of the jobs are even trying to start 
>>> twice, but almost none (generally none at all) gets to run:
>>> 
>>> # tail -2000 schedule | grep -A 1000 "::::::" | grep "Q:all" | grep 
>>> STARTING | sort
>>> 2734121:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734122:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734123:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734124:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734125:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734126:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734127:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734128:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734129:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734130:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734131:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734132:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734133:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734134:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734135:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734136:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734137:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734138:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734139:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734140:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734141:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734142:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734143:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734144:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734145:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734146:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734147:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734148:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734149:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734150:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734151:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734152:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734153:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734154:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734155:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734156:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734157:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734158:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734159:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734160:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2734161:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735158:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735159:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735160:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735161:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735162:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735163:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735164:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735165:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735166:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735167:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735168:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735169:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735170:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735171:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735172:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735173:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735174:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735175:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735176:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735177:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735178:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735179:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735180:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735181:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735182:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735183:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735184:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735185:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735186:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735187:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735188:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735189:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735190:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735191:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735192:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2735193:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743479:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743480:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743481:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743482:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743483:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743484:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743485:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743486:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743487:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743488:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743489:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743490:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743491:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743492:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743493:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743494:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743495:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743496:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743497:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743498:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743499:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743500:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743501:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743502:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743503:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743504:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743505:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743506:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743507:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743508:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743509:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743510:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743511:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743512:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743513:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743514:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743515:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743516:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743517:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 2743518:1:STARTING:1381144160:86460:Q:[email protected]:slots:1.000000
>>> 
>>> 
>>> Even though jobs appear here listed as "starting" they are not running at 
>>> all. But they are issuing a "starting" message on each scheduling interval.
>>> 
>>> Why are the reservations blocking a third of the cluster??? It shouldn't be 
>>> a backfilling issue, they are blocking the usage of 3 times the slots 
>>> reserved. Why the "starting" jobs cannot run?
>>> 
>>> Txema
>>> 
>>> 
>>> 
>>> El 07/10/13 09:28, Christian Krause escribió:
>>>> Hello,
>>>> 
>>>> We solved it the way that `h_rt` is set to FORCED in the complex list:
>>>> 
>>>>     #name                    shortcut      type        relop requestable 
>>>> consumable default  urgency
>>>>     
>>>> #------------------------------------------------------------------------------------------------
>>>>     h_rt                     h_rt          TIME        <=    FORCED      
>>>> YES        0:0:0    0
>>>> 
>>>> And have a JSV rejecting jobs that don't request it (because they would be 
>>>> pending indefinetely
>>>> unless you have a default duration or use qalter).
>>>> 
>>>> You could also use a JSV to enforce that only jobs with large resources 
>>>> (in your case more than some
>>>> amount of slots) are able to request reservation, i.e.:
>>>> 
>>>>     # pseudo JSV code
>>>>          SLOT_RESERVATION_THRESHOLD=...
>>>>          if slots < SLOT_RESERVATION_THRESHOLD then
>>>>         "disable reservation / reject"
>>>>     else
>>>>         "enable reservation"
>>>>     fi
>>>> 
>>>> 
>>>> On Fri, Oct 04, 2013 at 04:25:29PM +0200, Txema Heredia wrote:
>>>>> Hi all,
>>>>> 
>>>>> I have a 27-node cluster. Currently there are 320 out of 320 slots
>>>>> filled up. All by jobs requesting 1-slot.
>>>>> 
>>>>> At the top of my waiting queue there are 28 different jobs
>>>>> requesting 3 to 12 cores using two different parallel environments.
>>>>> All these jobs are requesting -R y. They are being ignored and
>>>>> overrun by the myriad of 1-slot requesting  jobs behind them in the
>>>>> waiting queue.
>>>>> 
>>>>> I have enabled the scheduler logging. During the last 4 hours, it
>>>>> has logged 724 new jobs starting, in all the 27 nodes. Not a single
>>>>> job on the system is requesting -l h_rt, but single-core jobs keep
>>>>> being scheduled  and all the parallel jobs are starving.
>>>>> 
>>>>> As far as I understand, the backfilling is killing my reservations,
>>>>> even if no one is requesting any kind of time, but if I set the
>>>>> "default_duration" to INFINITY, all the RESERVING log messages
>>>>> disappear.
>>>>> 
>>>>> Additionaly, for some odd reason, I only receive RESERVING messages
>>>>> from the jobs requesting a given number of slots (-pe whatever N).
>>>>> The jobs requesting a slot-range (-pe threaded 4-10) seem to reserve
>>>>> nothing.
>>>>> 
>>>>> My scheduler configuration is as follows:
>>>>> 
>>>>> # qconf -ssconf
>>>>> algorithm                         default
>>>>> schedule_interval                 0:0:5
>>>>> maxujobs                          0
>>>>> queue_sort_method                 load
>>>>> job_load_adjustments              np_load_avg=0.50
>>>>> load_adjustment_decay_time        0:7:30
>>>>> load_formula                      np_load_avg
>>>>> schedd_job_info                   true
>>>>> flush_submit_sec                  0
>>>>> flush_finish_sec                  0
>>>>> params                            MONITOR=1
>>>>> reprioritize_interval             0:0:0
>>>>> halftime                          168
>>>>> usage_weight_list cpu=0.187000,mem=0.116000,io=0.697000
>>>>> compensation_factor               5.000000
>>>>> weight_user                       0.250000
>>>>> weight_project                    0.250000
>>>>> weight_department                 0.250000
>>>>> weight_job                        0.250000
>>>>> weight_tickets_functional         1000000000
>>>>> weight_tickets_share              1000000000
>>>>> share_override_tickets            TRUE
>>>>> share_functional_shares           TRUE
>>>>> max_functional_jobs_to_schedule   200
>>>>> report_pjob_tickets               TRUE
>>>>> max_pending_tasks_per_job         50
>>>>> halflife_decay_list               none
>>>>> policy_hierarchy                  OSF
>>>>> weight_ticket                     0.010000
>>>>> weight_waiting_time               0.000000
>>>>> weight_deadline                   3600000.000000
>>>>> weight_urgency                    0.100000
>>>>> weight_priority                   1.000000
>>>>> max_reservation                   50
>>>>> default_duration                  24:00:00
>>>>> 
>>>>> 
>>>>> I have also tested it with params PROFILE=1 and default_duration
>>>>> INFINITY. But, when I set it, not a single reservation is logged in
>>>>> /opt/gridengine/default/common/schedule and new jobs keep starting.
>>>>> 
>>>>> 
>>>>> What am I missing? Is it possible to kill the backfilling? Are my
>>>>> reservations really working?
>>>>> 
>>>>> Thanks in advance,
>>>>> 
>>>>> Txema
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to