[slurm-dev] Re: Pre-empting short jobs

Mario Kadastik Mon, 04 Feb 2013 02:54:04 -0800

Anyone? 

it seems to me that right now if a job ends up in short queue and due to its 
higher priority all scheduling stops. At least until a node frees up so that it 
can start short queue jobs (no clue why they aren't started on nodes that have 
free consumable resources, but have jobs from another parition on them). Right 
now it seems also that main partition jobs are waiting in queue:


[root@slurm-1 out]# qstat -q
Queue            Memory CPU Time Walltime Node  Run Que Lm State
---------------- ------ -------- -------- ----  --- --- -- -----
main               --      --        2880   --  1790 556 --  E R 
short              --      --          60   --    0 910 --  E R 
                                               ----- -----
                                                1790  1466

[root@slurm-1 out]# sinfo -Nle -p main
Mon Feb  4 12:47:03 2013
NODELIST                                                                        
                                                                                
                                                                         NODES 
PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON      
        
wn-v-[2036,2072,2108,2180,2324,2396,2468,2540,2684,2936,3260,3404,3440,3620,3728,3836,3872,3944,4016,4052,4376,4412,4592,4628,4772,4844,4988,5096,5132,5168,5348,5384,5456,5564,5600,5708,5852,5924,6068,6104,6140,6284,6428,6536,6572]
     45     main*   allocated   32   2:16:1  65536        0      1   (null) 
none                
wn-v-[7001,7003,7005,7007-7008,7013,7015,7017-7019,7021,7023,7026-7027,7029-7030,7033,7035,7039]
                                                                                
                                                            19     main*   
allocated   24   2:12:1  49152        0      1   (null) none          

(so all nodes are allocated, none idle)

And the core count is:
[root@slurm-1 out]# sinfo -N -le -p main|grep wn|awk '{s+=$2*$5} END {print s}'
1896

therefore there have to be over 100 free cores, but jobs aren't starting. 

As already said, we use single core jobs and have consumable resources as cores:

SelectTypeParameters=CR_Core

Ideas how to solve this so that Short partition jobs get high priority and 
start first, but that this kind of node-freeing wouldn't happen? It may well 
mean right now that the whole cluster has to be drained as the jobs started at 
about the same time and have about the same length so waiting for a whole node 
to free might take a day or so... And this is wasting resources. 

On 01.02.2013, at 16:42, Mario Kadastik <[email protected]> wrote:

> 
> Hi,
> 
> we would like to configure our cluster for two main situations. Main jobs and 
> short jobs. We'd like to give the short jobs pre-emption by rising their 
> priority. Right now it's done such that we've configured two partitions (main 
> and short) and have configured all nodes into both partitions. We use 
> consumable resources with CPU cores as consumable and short partition has 
> priority 10x that of the main one. 
> 
> However it seems that this causes scheduling issues:
> 
> [root@slurm-1 log]# qstat -q
> Queue            Memory CPU Time Walltime Node  Run Que Lm State
> ---------------- ------ -------- -------- ----  --- --- -- -----
> main               --      --        2880   --  1652   0 --  E R 
> short              --      --          60   --    0   2 --  E R 
>                                               ----- -----
>                                                1652     2
> 
> while in fact we have 1896 job slots. The reason is that all nodes are in use:
> 
> [root@slurm-1 log]# sinfo -Nle -p main
> Fri Feb  1 15:39:00 2013
> NODELIST                                                                      
>                                                                               
>                                                                              
> NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES 
> REASON              
> wn-v-[2036,2072,2108,2180,2324,2396,2468,2540,2684,2936,3260,3404,3440,3620,3728,3836,3872,3944,4016,4052,4376,4412,4592,4628,4772,4844,4988,5096,5132,5168,5348,5384,5456,5564,5600,5708,5852,5924,6068,6104,6140,6284,6428,6536,6572]
>      45     main*   allocated   32   2:16:1  65536        0      1   (null) 
> none                
> wn-v-[7001,7003,7005,7007-7008,7013,7015,7017-7019,7021,7023,7026-7027,7029-7030,7033,7035,7039]
>                                                                               
>                                                               19     main*   
> allocated   24   2:12:1  49152        0      1   (null) none  
> 
> 
> Scheduling pool data:
> ----------------------------------------------------------------------------------
>                           Total  Usable   Free   Node   Time      Other       
>    
> Pool         Memory  Cpus  Nodes   Nodes  Nodes  Limit  Limit      traits     
>     
> ----------------------------------------------------------------------------------
> main*       65536Mb    32     45      45      0  UNLIM 2-00:00:00   
> main*       49152Mb    24     19      19      0  UNLIM 2-00:00:00   
> short       65536Mb    32     45      45      0  UNLIM   01:00:00   
> short       49152Mb    24     19      19      0  UNLIM   01:00:00   
> 
> however considering that ALL jobs we use are 1 core jobs it means that even 
> though every single node has 1 or more jobs on it, it still has free cores. 
> Looking at sjstat -v output I see that those two short jobs are held:
> 
> 106495   joosep        1 short     PD            0:00     1:00:00             
> N/A  (Resources)
> 106496   joosep        1 short     PD            0:00     1:00:00             
> N/A  (Priority)
> 
> one by resources and other by priority (because it has lower than the one 
> held by resources). 
> 
> Now I guess the problem is that there were no free nodes (instead of cores) 
> and the jobs were of different partition. How to best solve this? We'd like 
> to have most jobs in the main queue, where their priority is based on 
> fairshare and queue time, but we'd also like to give users the ability to 
> send short testing jobs at any point in time and get them high up in the 
> queue and execute on the next free cores. We'd rather not dedicate specific 
> nodes to the short queue because that has two distinct disadvantages:
> 1) resource utilization if there aren't short jobs all the time
> 2) if those nodes go down for what ever reason, then there are no slots left 
> for short. 
> 
> What is the best way to solve this?
> 
> Thanks,
> 
> Mario Kadastik, PhD
> Researcher
> 
> ---
>  "Physics is like sex, sure it may have practical reasons, but that's not why 
> we do it" 
>     -- Richard P. Feynman

Mario Kadastik, PhD
Researcher

---
  "Physics is like sex, sure it may have practical reasons, but that's not why 
we do it" 
     -- Richard P. Feynman

[slurm-dev] Re: Pre-empting short jobs

Reply via email to