We think we may be onto something, in sacct we were looking at the jobs submitted by the users, and found that many users share the same uidnumber in the slurm database. It seems to correlate with the size of the user's uid number in our ldap directory... users who's uid number are greater than 65535 get trunked to that number... users with uid numbers below that keep their correct uidnumbers (user2 in the sample output below)




[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
 |grep user2|head
user2  27545 30548               bwa node01-1 2013-07-08T13:04:25   00:00:48    
  0:0                  COMPLETED
user2  27545 30571               bwa node01-1 2013-07-08T15:18:00   00:00:48    
  0:0                  COMPLETED
user2  27545 30573               bwa node01-1 2013-07-09T09:40:59   00:00:48    
  0:0                  COMPLETED
user2  27545 30618              grep node01-1 2013-07-09T11:57:12   00:00:48    
  0:0                  COMPLETED
user2  27545 30619                bc node01-1 2013-07-09T11:58:08   00:00:48    
  0:0                  CANCELLED
user2  27545 30620                du node01-1 2013-07-09T11:58:19   00:00:48    
  0:0                  COMPLETED
user2  27545 30621                wc node01-1 2013-07-09T11:58:43   00:00:48    
  0:0                  COMPLETED
user2  27545 30622              zcat node01-1 2013-07-09T11:58:54   00:00:48    
  0:0                  COMPLETED
user2  27545 30623              zcat node01-1 2013-07-09T12:12:56   00:00:48    
  0:0                  COMPLETED
user2  27545 30624              zcat node01-1 2013-07-09T12:26:37   00:00:48    
  0:0                  CANCELLED
[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
 |grep user1|head
user1  65535 83           impute2_w+ node01-1 2013-04-17T09:29:47   00:00:48    
  0:0                     FAILED
user1  65535 84           impute2_w+ node01-1 2013-04-17T09:30:17   00:00:48    
  0:0                     FAILED
user1  65535 85           impute2_w+ node01-1 2013-04-17T09:30:40   00:00:48    
  0:0                     FAILED
user1  65535 86           impute2_w+ node01-1 2013-04-17T09:40:45   00:00:48    
  0:0                     FAILED
user1  65535 87                 date node01-1 2013-04-17T09:42:36   00:00:48    
  0:0                  COMPLETED
user1  65535 88             hostname node01-1 2013-04-17T09:42:37   00:00:48    
  0:0                  COMPLETED
user1  65535 89           impute2_w+ node01-1 2013-04-17T09:48:50   00:00:48    
  0:0                     FAILED
user1  65535 90           impute2_w+ node01-1 2013-04-17T09:48:56   00:00:48    
  0:0                     FAILED
user1  65535 91           impute2_w+ node01-1 2013-04-17T09:49:56   00:00:48    
  0:0                     FAILED
user1  65535 92           impute2_w+ node01-1 2013-04-17T09:50:06   00:00:48    
  0:0                     FAILED
[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
 |grep user3|head
user3  65535 5             script.sh node09-1 2013-04-09T15:55:07   00:00:48    
  0:0                     FAILED
user3  65535 6             script.sh node09-1 2013-04-09T15:55:13    INVALID    
  0:0                  COMPLETED
user3  65535 8                  bash node09-1 2013-04-09T15:57:34   00:00:48    
  0:0                  COMPLETED
user3  65535 7                  bash node09-1 2013-04-09T15:57:21   00:00:48    
  0:0                  COMPLETED
user3  65535 23            script.sh node09-1 2013-04-09T16:10:02   00:00:48    
  0:0                  COMPLETED
user3  65535 27            script.sh node09-+ 2013-04-09T16:18:33   00:00:48    
  0:0                  CANCELLED
user3  65535 28            script.sh node01-+ 2013-04-09T16:18:55   00:00:48    
  0:0                  CANCELLED
user3  65535 30            script.sh node01-+ 2013-04-09T16:34:12   00:00:48    
  0:0                  CANCELLED
user3  65535 31            script.sh node01-+ 2013-04-09T16:34:17   00:00:48    
  0:0                  CANCELLED
user3  65535 32            script.sh node01-+ 2013-04-09T16:34:21   00:00:48    
  0:0                  CANCELLED

We are thinking perhaps this could lead to our major issues with the system and priority factoring.

AC

On 08/23/2013 07:56 AM, Alan V. Cowles wrote:
Hey guys,

So in the past we had 3 prioritization factors in effect: partition, age and fairshare and they were working wonderfully. Currently partition has no effect for us as it's all one large shared partition so everyone gets the same value there. So everything is balanced in age and fairshare, In the past age and fairshare worked splendidly, and we have it set as I understand to refresh counters every 2 weeks... so basically everyone had a blank slate this past weekend. What our current issue is as follows...

A problematic user has submitted 70k jobs to a partition with 512 slots and she is currently consuming all slots... basically locking up the queue for anybody else that wants to try and work.

Normally fairshare kicks in and jumps other users to the top of the queue but when a new user submitted 25 jobs (vs the 70k) he didn't get any fairshare weighting at all...

JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS NICE 162986 uid1 8371 371 0 0 8000 0 0 162987 uid1 8371 371 0 0 8000 0 0 162988 uid1 8371 371 0 0 8000 0 0 180698 uid2 8320 321 0 0 8000 0 0 180699 uid2 8320 321 0 0 8000 0 0 180700 uid2 8320 321 0 0 8000 0 0 180701 uid2 8320 321 0 0 8000 0 0


I'm used to seeing a user like that get 5000 fairshare to start out with... Thoughts?

AC



Reply via email to