Ah, never mind - I see the difference now. Was looking for some info to be different
On Aug 23, 2013, at 12:17 PM, Ralph Castain <[email protected]> wrote: > Perhaps it is a copy/paste error - but those two tables are identical > > On Aug 23, 2013, at 12:14 PM, Alan V. Cowles <[email protected]> wrote: > >> >> Final update for the day, we have found what is causing priority to be >> overlooked we just don't know what is causing it... >> >> [root@cluster-login ~]# squeue --format="%a %.7i %.9P %.8j %.8u %.8T %.10M >> %.9l %.6D %R" |grep user1 >> (null) 181378 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >> (Priority) >> (null) 181379 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >> (Priority) >> (null) 181380 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >> (Priority) >> (null) 181381 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >> (Priority) >> (null) 181382 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >> (Priority) >> (null) 181383 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >> (Priority) >> (null) 181384 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >> (Priority) >> (null) 181385 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >> (Priority) >> (null) 181386 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >> (Priority) >> (null) 181387 lowmem testbatc user1 PENDING 0:00 UNLIMITED 1 >> (Priority) >> >> Compared to: >> >> [root@cluster-login ~]# squeue --format="%a %.7i %.9P %.8j %.8u %.8T %.10M >> %.9l %.6D %R" |grep user2 >> account 181378 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >> (Priority) >> account 181379 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >> (Priority) >> account 181380 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >> (Priority) >> account 181381 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >> (Priority) >> account 181382 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >> (Priority) >> account 181383 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >> (Priority) >> account 181384 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >> (Priority) >> account 181385 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >> (Priority) >> account 181386 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >> (Priority) >> account 181387 lowmem testbatc user2 PENDING 0:00 UNLIMITED 1 >> (Priority) >> >> >> We have tried to create new users and new accounts this afternoon and all of >> them show (null) as their account when we break out the formatting rules on >> sacct. >> >> sacctmgr add account accountname >> sacctmgr add user username defaultaccount accountname >> >> We have even one case where all users under and account are working fine >> except a user we added yesterday... so at some point in the past (logs >> aren't helping us thus far) the ability to actually sync up a user and an >> account for accounting purposes has left us. Also I have failed to mention >> to this point that we are still running Slurm 2.5.4, my apologies for that. >> >> AC >> >> >> On 08/23/2013 11:22 AM, Alan V. Cowles wrote: >>> Sorry to spam the list, but we wanted to keep updates in flux. >>> >>> We managed to find the issue in our mysqldb we are using for job accounting >>> which had the column value set to smallint (5) for that value, so it was >>> rounding things off, some SQL magic and we now have appropriate uid's >>> showing up. A new monkey wrench, some test jobs submitted by user3 below >>> get their fairshare value of 5000 as expected, just not user2... we just >>> cleared his jobs from the queue, and submitted another 100 jobs for testing >>> and none of them got a fairshare value... >>> >>> In his entire history of using our cluster he hasn't submitted over 5000 >>> jobs, in fact: >>> >>> [root@slurm-master ~]# sacct -c >>> --format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grep >>> user2 | wc -l >>> 2573 >>> >>> So we can't figure out why he's being overlooked. >>> >>> AC >>> >>> >>> On 08/23/2013 10:31 AM, Alan V. Cowles wrote: >>>> We think we may be onto something, in sacct we were looking at the jobs >>>> submitted by the users, and found that many users share the same uidnumber >>>> in the slurm database. It seems to correlate with the size of the user's >>>> uid number in our ldap directory... users who's uid number are greater >>>> than 65535 get trunked to that number... users with uid numbers below that >>>> keep their correct uidnumbers (user2 in the sample output below) >>>> >>>> >>>> >>>> >>>> [root@slurm-master ~]# sacct -c >>>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state >>>> |grep user2|head >>>> user2 27545 30548 bwa node01-1 2013-07-08T13:04:25 >>>> 00:00:48 0:0 COMPLETED >>>> user2 27545 30571 bwa node01-1 2013-07-08T15:18:00 >>>> 00:00:48 0:0 COMPLETED >>>> user2 27545 30573 bwa node01-1 2013-07-09T09:40:59 >>>> 00:00:48 0:0 COMPLETED >>>> user2 27545 30618 grep node01-1 2013-07-09T11:57:12 >>>> 00:00:48 0:0 COMPLETED >>>> user2 27545 30619 bc node01-1 2013-07-09T11:58:08 >>>> 00:00:48 0:0 CANCELLED >>>> user2 27545 30620 du node01-1 2013-07-09T11:58:19 >>>> 00:00:48 0:0 COMPLETED >>>> user2 27545 30621 wc node01-1 2013-07-09T11:58:43 >>>> 00:00:48 0:0 COMPLETED >>>> user2 27545 30622 zcat node01-1 2013-07-09T11:58:54 >>>> 00:00:48 0:0 COMPLETED >>>> user2 27545 30623 zcat node01-1 2013-07-09T12:12:56 >>>> 00:00:48 0:0 COMPLETED >>>> user2 27545 30624 zcat node01-1 2013-07-09T12:26:37 >>>> 00:00:48 0:0 CANCELLED >>>> [root@slurm-master ~]# sacct -c >>>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state >>>> |grep user1|head >>>> user1 65535 83 impute2_w+ node01-1 2013-04-17T09:29:47 >>>> 00:00:48 0:0 FAILED >>>> user1 65535 84 impute2_w+ node01-1 2013-04-17T09:30:17 >>>> 00:00:48 0:0 FAILED >>>> user1 65535 85 impute2_w+ node01-1 2013-04-17T09:30:40 >>>> 00:00:48 0:0 FAILED >>>> user1 65535 86 impute2_w+ node01-1 2013-04-17T09:40:45 >>>> 00:00:48 0:0 FAILED >>>> user1 65535 87 date node01-1 2013-04-17T09:42:36 >>>> 00:00:48 0:0 COMPLETED >>>> user1 65535 88 hostname node01-1 2013-04-17T09:42:37 >>>> 00:00:48 0:0 COMPLETED >>>> user1 65535 89 impute2_w+ node01-1 2013-04-17T09:48:50 >>>> 00:00:48 0:0 FAILED >>>> user1 65535 90 impute2_w+ node01-1 2013-04-17T09:48:56 >>>> 00:00:48 0:0 FAILED >>>> user1 65535 91 impute2_w+ node01-1 2013-04-17T09:49:56 >>>> 00:00:48 0:0 FAILED >>>> user1 65535 92 impute2_w+ node01-1 2013-04-17T09:50:06 >>>> 00:00:48 0:0 FAILED >>>> [root@slurm-master ~]# sacct -c >>>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state >>>> |grep user3|head >>>> user3 65535 5 script.sh node09-1 2013-04-09T15:55:07 >>>> 00:00:48 0:0 FAILED >>>> user3 65535 6 script.sh node09-1 2013-04-09T15:55:13 >>>> INVALID 0:0 COMPLETED >>>> user3 65535 8 bash node09-1 2013-04-09T15:57:34 >>>> 00:00:48 0:0 COMPLETED >>>> user3 65535 7 bash node09-1 2013-04-09T15:57:21 >>>> 00:00:48 0:0 COMPLETED >>>> user3 65535 23 script.sh node09-1 2013-04-09T16:10:02 >>>> 00:00:48 0:0 COMPLETED >>>> user3 65535 27 script.sh node09-+ 2013-04-09T16:18:33 >>>> 00:00:48 0:0 CANCELLED >>>> user3 65535 28 script.sh node01-+ 2013-04-09T16:18:55 >>>> 00:00:48 0:0 CANCELLED >>>> user3 65535 30 script.sh node01-+ 2013-04-09T16:34:12 >>>> 00:00:48 0:0 CANCELLED >>>> user3 65535 31 script.sh node01-+ 2013-04-09T16:34:17 >>>> 00:00:48 0:0 CANCELLED >>>> user3 65535 32 script.sh node01-+ 2013-04-09T16:34:21 >>>> 00:00:48 0:0 CANCELLED >>>> >>>> We are thinking perhaps this could lead to our major issues with the >>>> system and priority factoring. >>>> >>>> AC >>>> >>>> On 08/23/2013 07:56 AM, Alan V. Cowles wrote: >>>>> Hey guys, >>>>> >>>>> So in the past we had 3 prioritization factors in effect: partition, age >>>>> and fairshare and they were working wonderfully. Currently partition has >>>>> no effect for us as it's all one large shared partition so everyone gets >>>>> the same value there. So everything is balanced in age and fairshare, In >>>>> the past age and fairshare worked splendidly, and we have it set as I >>>>> understand to refresh counters every 2 weeks... so basically everyone had >>>>> a blank slate this past weekend. What our current issue is as follows... >>>>> >>>>> A problematic user has submitted 70k jobs to a partition with 512 slots >>>>> and she is currently consuming all slots... basically locking up the >>>>> queue for anybody else that wants to try and work. >>>>> >>>>> Normally fairshare kicks in and jumps other users to the top of the queue >>>>> but when a new user submitted 25 jobs (vs the 70k) he didn't get any >>>>> fairshare weighting at all... >>>>> >>>>> JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS >>>>> NICE >>>>> 162986 uid1 8371 371 0 0 8000 0 >>>>> 0 >>>>> 162987 uid1 8371 371 0 0 8000 0 >>>>> 0 >>>>> 162988 uid1 8371 371 0 0 8000 0 >>>>> 0 >>>>> 180698 uid2 8320 321 0 0 8000 0 >>>>> 0 >>>>> 180699 uid2 8320 321 0 0 8000 0 >>>>> 0 >>>>> 180700 uid2 8320 321 0 0 8000 0 >>>>> 0 >>>>> 180701 uid2 8320 321 0 0 8000 0 >>>>> 0 >>>>> >>>>> >>>>> I'm used to seeing a user like that get 5000 fairshare to start out >>>>> with... Thoughts? >>>>> >>>>> AC >>>>> >>>>> >>>>> >>>> >>> >
