[slurm-dev] Re: fairshare incrementing

Alan V. Cowles Fri, 23 Aug 2013 07:33:53 -0700

We think we may be onto something, in sacct we were looking at the jobssubmitted by the users, and found that many users share the sameuidnumber in the slurm database. It seems to correlate with the size ofthe user's uid number in our ldap directory... users who's uid numberare greater than 65535 get trunked to that number... users with uidnumbers below that keep their correct uidnumbers (user2 in the sampleoutput below)





[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
 |grep user2|head
user2  27545 30548               bwa node01-1 2013-07-08T13:04:25   00:00:48    
  0:0                  COMPLETED
user2  27545 30571               bwa node01-1 2013-07-08T15:18:00   00:00:48    
  0:0                  COMPLETED
user2  27545 30573               bwa node01-1 2013-07-09T09:40:59   00:00:48    
  0:0                  COMPLETED
user2  27545 30618              grep node01-1 2013-07-09T11:57:12   00:00:48    
  0:0                  COMPLETED
user2  27545 30619                bc node01-1 2013-07-09T11:58:08   00:00:48    
  0:0                  CANCELLED
user2  27545 30620                du node01-1 2013-07-09T11:58:19   00:00:48    
  0:0                  COMPLETED
user2  27545 30621                wc node01-1 2013-07-09T11:58:43   00:00:48    
  0:0                  COMPLETED
user2  27545 30622              zcat node01-1 2013-07-09T11:58:54   00:00:48    
  0:0                  COMPLETED
user2  27545 30623              zcat node01-1 2013-07-09T12:12:56   00:00:48    
  0:0                  COMPLETED
user2  27545 30624              zcat node01-1 2013-07-09T12:26:37   00:00:48    
  0:0                  CANCELLED
[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
 |grep user1|head
user1  65535 83           impute2_w+ node01-1 2013-04-17T09:29:47   00:00:48    
  0:0                     FAILED
user1  65535 84           impute2_w+ node01-1 2013-04-17T09:30:17   00:00:48    
  0:0                     FAILED
user1  65535 85           impute2_w+ node01-1 2013-04-17T09:30:40   00:00:48    
  0:0                     FAILED
user1  65535 86           impute2_w+ node01-1 2013-04-17T09:40:45   00:00:48    
  0:0                     FAILED
user1  65535 87                 date node01-1 2013-04-17T09:42:36   00:00:48    
  0:0                  COMPLETED
user1  65535 88             hostname node01-1 2013-04-17T09:42:37   00:00:48    
  0:0                  COMPLETED
user1  65535 89           impute2_w+ node01-1 2013-04-17T09:48:50   00:00:48    
  0:0                     FAILED
user1  65535 90           impute2_w+ node01-1 2013-04-17T09:48:56   00:00:48    
  0:0                     FAILED
user1  65535 91           impute2_w+ node01-1 2013-04-17T09:49:56   00:00:48    
  0:0                     FAILED
user1  65535 92           impute2_w+ node01-1 2013-04-17T09:50:06   00:00:48    
  0:0                     FAILED
[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
 |grep user3|head
user3  65535 5             script.sh node09-1 2013-04-09T15:55:07   00:00:48    
  0:0                     FAILED
user3  65535 6             script.sh node09-1 2013-04-09T15:55:13    INVALID    
  0:0                  COMPLETED
user3  65535 8                  bash node09-1 2013-04-09T15:57:34   00:00:48    
  0:0                  COMPLETED
user3  65535 7                  bash node09-1 2013-04-09T15:57:21   00:00:48    
  0:0                  COMPLETED
user3  65535 23            script.sh node09-1 2013-04-09T16:10:02   00:00:48    
  0:0                  COMPLETED
user3  65535 27            script.sh node09-+ 2013-04-09T16:18:33   00:00:48    
  0:0                  CANCELLED
user3  65535 28            script.sh node01-+ 2013-04-09T16:18:55   00:00:48    
  0:0                  CANCELLED
user3  65535 30            script.sh node01-+ 2013-04-09T16:34:12   00:00:48    
  0:0                  CANCELLED
user3  65535 31            script.sh node01-+ 2013-04-09T16:34:17   00:00:48    
  0:0                  CANCELLED
user3  65535 32            script.sh node01-+ 2013-04-09T16:34:21   00:00:48    
  0:0                  CANCELLED

We are thinking perhaps this could lead to our major issues with thesystem and priority factoring.


AC

On 08/23/2013 07:56 AM, Alan V. Cowles wrote:

Hey guys,
So in the past we had 3 prioritization factors in effect: partition,age and fairshare and they were working wonderfully. Currentlypartition has no effect for us as it's all one large shared partitionso everyone gets the same value there. So everything is balanced inage and fairshare, In the past age and fairshare worked splendidly,and we have it set as I understand to refresh counters every 2weeks... so basically everyone had a blank slate this past weekend.What our current issue is as follows...
A problematic user has submitted 70k jobs to a partition with 512slots and she is currently consuming all slots... basically locking upthe queue for anybody else that wants to try and work.
Normally fairshare kicks in and jumps other users to the top of thequeue but when a new user submitted 25 jobs (vs the 70k) he didn't getany fairshare weighting at all...
JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITIONQOS NICE162986 uid1 8371 371 0 0 80000 0162987 uid1 8371 371 0 0 80000 0162988 uid1 8371 371 0 0 80000 0180698 uid2 8320 321 0 0 80000 0180699 uid2 8320 321 0 0 80000 0180700 uid2 8320 321 0 0 80000 0180701 uid2 8320 321 0 0 80000 0
I'm used to seeing a user like that get 5000 fairshare to start outwith... Thoughts?
AC

[slurm-dev] Re: fairshare incrementing

Reply via email to