[slurm-dev] Re: fairshare incrementing

Ralph Castain Fri, 23 Aug 2013 12:20:38 -0700

Ah, never mind - I see the difference now. Was looking for some info to be 
different



On Aug 23, 2013, at 12:17 PM, Ralph Castain <[email protected]> wrote:

> Perhaps it is a copy/paste error - but those two tables are identical
> 
> On Aug 23, 2013, at 12:14 PM, Alan V. Cowles <[email protected]> wrote:
> 
>> 
>> Final update for the day, we have found what is causing priority to be 
>> overlooked we just don't know what is causing it...
>> 
>> [root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T %.10M 
>> %.9l %.6D %R" |grep user1
>> (null)  181378    lowmem testbatc user1  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> (null)  181379    lowmem testbatc user1  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> (null)  181380    lowmem testbatc user1  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> (null)  181381    lowmem testbatc user1  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> (null)  181382    lowmem testbatc user1  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> (null)  181383    lowmem testbatc user1  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> (null)  181384    lowmem testbatc user1  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> (null)  181385    lowmem testbatc user1  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> (null)  181386    lowmem testbatc user1  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> (null)  181387    lowmem testbatc user1  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> 
>> Compared to:
>> 
>> [root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T %.10M 
>> %.9l %.6D %R" |grep user2
>> account  181378    lowmem testbatc user2  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> account  181379    lowmem testbatc user2  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> account  181380    lowmem testbatc user2  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> account  181381    lowmem testbatc user2  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> account  181382    lowmem testbatc user2  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> account  181383    lowmem testbatc user2  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> account  181384    lowmem testbatc user2  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> account  181385    lowmem testbatc user2  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> account  181386    lowmem testbatc user2  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> account  181387    lowmem testbatc user2  PENDING       0:00 UNLIMITED    1 
>> (Priority)
>> 
>> 
>> We have tried to create new users and new accounts this afternoon and all of 
>> them show (null) as their account when we break out the formatting rules on 
>> sacct.
>> 
>> sacctmgr add account accountname
>> sacctmgr add user username defaultaccount accountname
>> 
>> We have even one case where all users under and account are working fine 
>> except a user we added yesterday... so at some point in the past (logs 
>> aren't helping us thus far) the ability to actually sync up a user and an 
>> account for accounting purposes has left us. Also I have failed to mention 
>> to this point that we are still running Slurm 2.5.4, my apologies for that.
>> 
>> AC
>> 
>> 
>> On 08/23/2013 11:22 AM, Alan V. Cowles wrote:
>>> Sorry to spam the list, but we wanted to keep updates in flux.
>>> 
>>> We managed to find the issue in our mysqldb we are using for job accounting 
>>> which had the column value set to smallint (5) for that value, so it was 
>>> rounding things off, some SQL magic and we now have appropriate uid's 
>>> showing up. A new monkey wrench, some test jobs submitted by user3 below 
>>> get their fairshare value of 5000 as expected, just not user2... we just 
>>> cleared his jobs from the queue, and submitted another 100 jobs for testing 
>>> and none of them got a fairshare value...
>>> 
>>> In his entire history of using our cluster he hasn't submitted over 5000 
>>> jobs, in fact:
>>> 
>>> [root@slurm-master ~]# sacct -c 
>>> --format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grep 
>>> user2 | wc -l
>>> 2573
>>> 
>>> So we can't figure out why he's being overlooked.
>>> 
>>> AC
>>> 
>>> 
>>> On 08/23/2013 10:31 AM, Alan V. Cowles wrote:
>>>> We think we may be onto something, in sacct we were looking at the jobs 
>>>> submitted by the users, and found that many users share the same uidnumber 
>>>> in the slurm database. It seems to correlate with the size of the user's 
>>>> uid number in our ldap directory... users who's uid number are greater 
>>>> than 65535 get trunked to that number... users with uid numbers below that 
>>>> keep their correct uidnumbers (user2 in the sample output below)
>>>> 
>>>> 
>>>> 
>>>> 
>>>> [root@slurm-master ~]# sacct -c 
>>>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
>>>>  |grep user2|head
>>>> user2  27545 30548               bwa node01-1 2013-07-08T13:04:25   
>>>> 00:00:48      0:0 COMPLETED
>>>> user2  27545 30571               bwa node01-1 2013-07-08T15:18:00   
>>>> 00:00:48      0:0 COMPLETED
>>>> user2  27545 30573               bwa node01-1 2013-07-09T09:40:59   
>>>> 00:00:48      0:0 COMPLETED
>>>> user2  27545 30618              grep node01-1 2013-07-09T11:57:12   
>>>> 00:00:48      0:0 COMPLETED
>>>> user2  27545 30619                bc node01-1 2013-07-09T11:58:08   
>>>> 00:00:48      0:0 CANCELLED
>>>> user2  27545 30620                du node01-1 2013-07-09T11:58:19   
>>>> 00:00:48      0:0 COMPLETED
>>>> user2  27545 30621                wc node01-1 2013-07-09T11:58:43   
>>>> 00:00:48      0:0 COMPLETED
>>>> user2  27545 30622              zcat node01-1 2013-07-09T11:58:54   
>>>> 00:00:48      0:0 COMPLETED
>>>> user2  27545 30623              zcat node01-1 2013-07-09T12:12:56   
>>>> 00:00:48      0:0 COMPLETED
>>>> user2  27545 30624              zcat node01-1 2013-07-09T12:26:37   
>>>> 00:00:48      0:0 CANCELLED
>>>> [root@slurm-master ~]# sacct -c 
>>>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
>>>>  |grep user1|head
>>>> user1  65535 83           impute2_w+ node01-1 2013-04-17T09:29:47   
>>>> 00:00:48      0:0 FAILED
>>>> user1  65535 84           impute2_w+ node01-1 2013-04-17T09:30:17   
>>>> 00:00:48      0:0 FAILED
>>>> user1  65535 85           impute2_w+ node01-1 2013-04-17T09:30:40   
>>>> 00:00:48      0:0 FAILED
>>>> user1  65535 86           impute2_w+ node01-1 2013-04-17T09:40:45   
>>>> 00:00:48      0:0 FAILED
>>>> user1  65535 87                 date node01-1 2013-04-17T09:42:36   
>>>> 00:00:48      0:0 COMPLETED
>>>> user1  65535 88             hostname node01-1 2013-04-17T09:42:37   
>>>> 00:00:48      0:0 COMPLETED
>>>> user1  65535 89           impute2_w+ node01-1 2013-04-17T09:48:50   
>>>> 00:00:48      0:0 FAILED
>>>> user1  65535 90           impute2_w+ node01-1 2013-04-17T09:48:56   
>>>> 00:00:48      0:0 FAILED
>>>> user1  65535 91           impute2_w+ node01-1 2013-04-17T09:49:56   
>>>> 00:00:48      0:0 FAILED
>>>> user1  65535 92           impute2_w+ node01-1 2013-04-17T09:50:06   
>>>> 00:00:48      0:0 FAILED
>>>> [root@slurm-master ~]# sacct -c 
>>>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
>>>>  |grep user3|head
>>>> user3  65535 5             script.sh node09-1 2013-04-09T15:55:07   
>>>> 00:00:48      0:0 FAILED
>>>> user3  65535 6             script.sh node09-1 2013-04-09T15:55:13    
>>>> INVALID      0:0 COMPLETED
>>>> user3  65535 8                  bash node09-1 2013-04-09T15:57:34   
>>>> 00:00:48      0:0 COMPLETED
>>>> user3  65535 7                  bash node09-1 2013-04-09T15:57:21   
>>>> 00:00:48      0:0 COMPLETED
>>>> user3  65535 23            script.sh node09-1 2013-04-09T16:10:02   
>>>> 00:00:48      0:0 COMPLETED
>>>> user3  65535 27            script.sh node09-+ 2013-04-09T16:18:33   
>>>> 00:00:48      0:0 CANCELLED
>>>> user3  65535 28            script.sh node01-+ 2013-04-09T16:18:55   
>>>> 00:00:48      0:0 CANCELLED
>>>> user3  65535 30            script.sh node01-+ 2013-04-09T16:34:12   
>>>> 00:00:48      0:0 CANCELLED
>>>> user3  65535 31            script.sh node01-+ 2013-04-09T16:34:17   
>>>> 00:00:48      0:0 CANCELLED
>>>> user3  65535 32            script.sh node01-+ 2013-04-09T16:34:21   
>>>> 00:00:48      0:0 CANCELLED
>>>> 
>>>> We are thinking perhaps this could lead to our major issues with the 
>>>> system and priority factoring.
>>>> 
>>>> AC
>>>> 
>>>> On 08/23/2013 07:56 AM, Alan V. Cowles wrote:
>>>>> Hey guys,
>>>>> 
>>>>> So in the past we had 3 prioritization factors in effect: partition, age 
>>>>> and fairshare and they were working wonderfully. Currently partition has 
>>>>> no effect for us as it's all one large shared partition so everyone gets 
>>>>> the same value there. So everything is balanced in age and fairshare, In 
>>>>> the past age and fairshare worked splendidly, and we have it set as I 
>>>>> understand to refresh counters every 2 weeks... so basically everyone had 
>>>>> a blank slate this past weekend. What our current issue is as follows...
>>>>> 
>>>>> A problematic user has submitted 70k jobs to a partition with 512 slots 
>>>>> and she is currently consuming all slots... basically locking up the 
>>>>> queue for anybody else that wants to try and work.
>>>>> 
>>>>> Normally fairshare kicks in and jumps other users to the top of the queue 
>>>>> but when a new user submitted 25 jobs (vs the 70k) he didn't get any 
>>>>> fairshare weighting at all...
>>>>> 
>>>>> JOBID     USER   PRIORITY  AGE  FAIRSHARE    JOBSIZE PARTITION        QOS 
>>>>>   NICE
>>>>> 162986    uid1    8371        371          0 0 8000              0        
>>>>>   0
>>>>> 162987    uid1    8371        371          0 0 8000              0        
>>>>>   0
>>>>> 162988    uid1    8371        371          0 0 8000              0        
>>>>>   0
>>>>> 180698    uid2    8320        321          0 0 8000              0        
>>>>>   0
>>>>> 180699    uid2    8320        321          0 0 8000              0        
>>>>>   0
>>>>> 180700    uid2    8320        321          0 0 8000              0        
>>>>>   0
>>>>> 180701    uid2    8320        321          0 0 8000              0        
>>>>>   0
>>>>> 
>>>>> 
>>>>> I'm used to seeing a user like that get 5000 fairshare to start out 
>>>>> with... Thoughts?
>>>>> 
>>>>> AC
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>

[slurm-dev] Re: fairshare incrementing

Reply via email to