Anyone?
it seems to me that right now if a job ends up in short queue and due to its
higher priority all scheduling stops. At least until a node frees up so that it
can start short queue jobs (no clue why they aren't started on nodes that have
free consumable resources, but have jobs from another parition on them). Right
now it seems also that main partition jobs are waiting in queue:
[root@slurm-1 out]# qstat -q
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
main -- -- 2880 -- 1790 556 -- E R
short -- -- 60 -- 0 910 -- E R
----- -----
1790 1466
[root@slurm-1 out]# sinfo -Nle -p main
Mon Feb 4 12:47:03 2013
NODELIST
NODES
PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
wn-v-[2036,2072,2108,2180,2324,2396,2468,2540,2684,2936,3260,3404,3440,3620,3728,3836,3872,3944,4016,4052,4376,4412,4592,4628,4772,4844,4988,5096,5132,5168,5348,5384,5456,5564,5600,5708,5852,5924,6068,6104,6140,6284,6428,6536,6572]
45 main* allocated 32 2:16:1 65536 0 1 (null)
none
wn-v-[7001,7003,7005,7007-7008,7013,7015,7017-7019,7021,7023,7026-7027,7029-7030,7033,7035,7039]
19 main*
allocated 24 2:12:1 49152 0 1 (null) none
(so all nodes are allocated, none idle)
And the core count is:
[root@slurm-1 out]# sinfo -N -le -p main|grep wn|awk '{s+=$2*$5} END {print s}'
1896
therefore there have to be over 100 free cores, but jobs aren't starting.
As already said, we use single core jobs and have consumable resources as cores:
SelectTypeParameters=CR_Core
Ideas how to solve this so that Short partition jobs get high priority and
start first, but that this kind of node-freeing wouldn't happen? It may well
mean right now that the whole cluster has to be drained as the jobs started at
about the same time and have about the same length so waiting for a whole node
to free might take a day or so... And this is wasting resources.
On 01.02.2013, at 16:42, Mario Kadastik <[email protected]> wrote:
>
> Hi,
>
> we would like to configure our cluster for two main situations. Main jobs and
> short jobs. We'd like to give the short jobs pre-emption by rising their
> priority. Right now it's done such that we've configured two partitions (main
> and short) and have configured all nodes into both partitions. We use
> consumable resources with CPU cores as consumable and short partition has
> priority 10x that of the main one.
>
> However it seems that this causes scheduling issues:
>
> [root@slurm-1 log]# qstat -q
> Queue Memory CPU Time Walltime Node Run Que Lm State
> ---------------- ------ -------- -------- ---- --- --- -- -----
> main -- -- 2880 -- 1652 0 -- E R
> short -- -- 60 -- 0 2 -- E R
> ----- -----
> 1652 2
>
> while in fact we have 1896 job slots. The reason is that all nodes are in use:
>
> [root@slurm-1 log]# sinfo -Nle -p main
> Fri Feb 1 15:39:00 2013
> NODELIST
>
>
> NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES
> REASON
> wn-v-[2036,2072,2108,2180,2324,2396,2468,2540,2684,2936,3260,3404,3440,3620,3728,3836,3872,3944,4016,4052,4376,4412,4592,4628,4772,4844,4988,5096,5132,5168,5348,5384,5456,5564,5600,5708,5852,5924,6068,6104,6140,6284,6428,6536,6572]
> 45 main* allocated 32 2:16:1 65536 0 1 (null)
> none
> wn-v-[7001,7003,7005,7007-7008,7013,7015,7017-7019,7021,7023,7026-7027,7029-7030,7033,7035,7039]
>
> 19 main*
> allocated 24 2:12:1 49152 0 1 (null) none
>
>
> Scheduling pool data:
> ----------------------------------------------------------------------------------
> Total Usable Free Node Time Other
>
> Pool Memory Cpus Nodes Nodes Nodes Limit Limit traits
>
> ----------------------------------------------------------------------------------
> main* 65536Mb 32 45 45 0 UNLIM 2-00:00:00
> main* 49152Mb 24 19 19 0 UNLIM 2-00:00:00
> short 65536Mb 32 45 45 0 UNLIM 01:00:00
> short 49152Mb 24 19 19 0 UNLIM 01:00:00
>
> however considering that ALL jobs we use are 1 core jobs it means that even
> though every single node has 1 or more jobs on it, it still has free cores.
> Looking at sjstat -v output I see that those two short jobs are held:
>
> 106495 joosep 1 short PD 0:00 1:00:00
> N/A (Resources)
> 106496 joosep 1 short PD 0:00 1:00:00
> N/A (Priority)
>
> one by resources and other by priority (because it has lower than the one
> held by resources).
>
> Now I guess the problem is that there were no free nodes (instead of cores)
> and the jobs were of different partition. How to best solve this? We'd like
> to have most jobs in the main queue, where their priority is based on
> fairshare and queue time, but we'd also like to give users the ability to
> send short testing jobs at any point in time and get them high up in the
> queue and execute on the next free cores. We'd rather not dedicate specific
> nodes to the short queue because that has two distinct disadvantages:
> 1) resource utilization if there aren't short jobs all the time
> 2) if those nodes go down for what ever reason, then there are no slots left
> for short.
>
> What is the best way to solve this?
>
> Thanks,
>
> Mario Kadastik, PhD
> Researcher
>
> ---
> "Physics is like sex, sure it may have practical reasons, but that's not why
> we do it"
> -- Richard P. Feynman
Mario Kadastik, PhD
Researcher
---
"Physics is like sex, sure it may have practical reasons, but that's not why
we do it"
-- Richard P. Feynman