There is an error in creating a second account for a user that results in 
subsequent jobs submitted by that user being run under the wrong default 
account.  The wrong account persists until slurmctld is stopped and 
restarted.  This occurs running 'slurmdbd' and using 'MySQL' for the 
database.  Here is an simple example:

Create a new account and then add a new user:

>sacctmgr add account name=acct01
>sacctmgr add user name=user01 account=acct01

After confirming the commit, using "mysql" to dump the records (some rows 
omitted for brevity) shows:

mysql> select id_assoc,is_def,user,acct,parent_acct from 
dja_cluster_assoc_table;
+----------+--------+---------+--------+-------------+
| id_assoc | is_def | user    | acct   | parent_acct |
+----------+--------+---------+--------+-------------+
|        1 |      0 |         | root   |             |
|       18 |      0 |         | acct01 | root        |
|       19 |      1 | user01  | acct01 |             |
+----------+--------+---------+--------+-------------+

which shows that the "add user" has marked "acct01" as the default account 
("is_def = 1") for this user.

The user "user01" can now submit jobs using "srun".   He can explicitly 
specify "account=acct01" on the command, or if not, the default will be 
"acct01", as shown by:

>srun hostname
stag
>sacct
       JobID    JobName  Partition    Account  AllocCPUS      State 
ExitCode
------------ ---------- ---------- ---------- ---------- ---------- 
--------
55             hostname    phoenix     acct01          1  COMPLETED 0:0

If we now add a second account for the same user:

>sacctmgr add user name=user01 account=acct02

and dump the records:

mysql> select id_assoc,is_def,user,acct,parent_acct from 
dja_cluster_assoc_table;
+----------+--------+---------+--------+-------------+
| id_assoc | is_def | user    | acct   | parent_acct |
+----------+--------+---------+--------+-------------+
|        1 |      0 |         | root   |             |
|       18 |      0 |         | acct01 | root        |
|       19 |      1 | user01  | acct01 |             |
|       20 |      0 |         | acct02 | root        |
|       21 |      0 | user01  | acct02 |             |
+----------+--------+---------+--------+-------------+

the user "user01" now has an association that includes "acct02".  But note 
that "acct01" is still marked as the default account for "user01".  But if 
we submit a job with "srun" now,  it gets run under account "acct02".

>srun hostname
stag
>sacct
       JobID    JobName  Partition    Account  AllocCPUS      State 
ExitCode
------------ ---------- ---------- ---------- ---------- ---------- 
--------
55             hostname    phoenix     acct01          1  COMPLETED 0:0
56             hostname    phoenix     acct02          1  COMPLETED 0:0

This situation persists until "slurmctld" is stopped and restarted.  After 
the restart, the "srun" with no explicit account parameter gets run under 
the default of "acct01", as it should.

I went round and round for a while with "sacctmgr", "slurmdbd", and 
"slurmctld",   but I think I have finally figured out what is happening. 
It starts with "sacctmgr", in the "sacctmgr_add_user" function,  where an 
"association record" is allocated and initialized.  The "is_def" field, 
like most of the others, is initialized to "NO_VAL".    During the rest of 
the processing, if a new user is being added, the assoc->acct is set and 
"is_def" ends up being set to "1".   But if the user already exists, and 
we are just creating a new association for that user, then "is_def" is not 
modified, and remains as "NO_VAL". 

Eventually the list of one or more associations is passed to 
"acct_storage_g_add_associations", which results in a RPC call to 
"slurmdbd",  where the information ends up being sent to "mysqld" to be 
added to the database.  All this time, the "is_def" field for the 
association with the "acct02" account is still "NO_VAL".   Somehow this 
field ends up as "0" in the actual database,  but that is not what 
concerns us here.   When the data "commit" is done,  slurmdbd also does an 
RPC call to "slurmctld" with a message type of "ACCOUNTING_UPDATE_MSG" to 
pass the changes to slurmctld, and passes the same set of association 
records.   In "assoc_mgr.c", in function "_set_user_default_acct",  the 
value of "is_def" is tested for non-zero, and if non-zero, the "acct" is 
set as the default account in the "user" record in the 
"assoc_mgr_user_list".   Of course, both "1" and "NO_VAL" are non-zero, so 
the default account is set to "acct02".   So for the life of the 
slurmctld,  the default for "user01" stays "acct02", even though the 
default in the database records is "acct01".   When slurmctld is stopped 
and restarted, the internal tables are rebuilt from the database records, 
so everything is set the way it should be.

The easiest way to fix this is to change the test of "is_def" in 
"assoc_mgr.c" to be an explicit test for "1" instead of just for non-zero. 
  That is what the patch below does.   This appears to solve the problem.  
Another possibility is to explicitly initialize "is_def" in the 
association record to "0" instead of "NO_VAL", but I don't know if this 
has implications elsewhere in the code.

        -Don Albert-

The following patch is based on SLURM version 2.2.5.

[stag] (dalbert) common> cvs diff -u assoc_mgr.c
Index: assoc_mgr.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/common/assoc_mgr.c,v
retrieving revision 1.1.1.32.2.1
diff -u -r1.1.1.32.2.1 assoc_mgr.c
--- assoc_mgr.c 6 May 2011 17:27:01 -0000       1.1.1.32.2.1
+++ assoc_mgr.c 26 May 2011 00:00:57 -0000
@@ -293,7 +293,7 @@
        xassert(assoc_mgr_user_list);

        /* set up the default if this is it */
-       if (assoc->is_def && (assoc->uid != NO_VAL)) {
+       if ((assoc->is_def == 1) && (assoc->uid != NO_VAL)) {
                slurmdb_user_rec_t *user = NULL;
                ListIterator user_itr =
                        list_iterator_create(assoc_mgr_user_list);

Reply via email to