There is an error in creating a second account for a user that results in
subsequent jobs submitted by that user being run under the wrong default
account. The wrong account persists until slurmctld is stopped and
restarted. This occurs running 'slurmdbd' and using 'MySQL' for the
database. Here is an simple example:
Create a new account and then add a new user:
>sacctmgr add account name=acct01
>sacctmgr add user name=user01 account=acct01
After confirming the commit, using "mysql" to dump the records (some rows
omitted for brevity) shows:
mysql> select id_assoc,is_def,user,acct,parent_acct from
dja_cluster_assoc_table;
+----------+--------+---------+--------+-------------+
| id_assoc | is_def | user | acct | parent_acct |
+----------+--------+---------+--------+-------------+
| 1 | 0 | | root | |
| 18 | 0 | | acct01 | root |
| 19 | 1 | user01 | acct01 | |
+----------+--------+---------+--------+-------------+
which shows that the "add user" has marked "acct01" as the default account
("is_def = 1") for this user.
The user "user01" can now submit jobs using "srun". He can explicitly
specify "account=acct01" on the command, or if not, the default will be
"acct01", as shown by:
>srun hostname
stag
>sacct
JobID JobName Partition Account AllocCPUS State
ExitCode
------------ ---------- ---------- ---------- ---------- ----------
--------
55 hostname phoenix acct01 1 COMPLETED 0:0
If we now add a second account for the same user:
>sacctmgr add user name=user01 account=acct02
and dump the records:
mysql> select id_assoc,is_def,user,acct,parent_acct from
dja_cluster_assoc_table;
+----------+--------+---------+--------+-------------+
| id_assoc | is_def | user | acct | parent_acct |
+----------+--------+---------+--------+-------------+
| 1 | 0 | | root | |
| 18 | 0 | | acct01 | root |
| 19 | 1 | user01 | acct01 | |
| 20 | 0 | | acct02 | root |
| 21 | 0 | user01 | acct02 | |
+----------+--------+---------+--------+-------------+
the user "user01" now has an association that includes "acct02". But note
that "acct01" is still marked as the default account for "user01". But if
we submit a job with "srun" now, it gets run under account "acct02".
>srun hostname
stag
>sacct
JobID JobName Partition Account AllocCPUS State
ExitCode
------------ ---------- ---------- ---------- ---------- ----------
--------
55 hostname phoenix acct01 1 COMPLETED 0:0
56 hostname phoenix acct02 1 COMPLETED 0:0
This situation persists until "slurmctld" is stopped and restarted. After
the restart, the "srun" with no explicit account parameter gets run under
the default of "acct01", as it should.
I went round and round for a while with "sacctmgr", "slurmdbd", and
"slurmctld", but I think I have finally figured out what is happening.
It starts with "sacctmgr", in the "sacctmgr_add_user" function, where an
"association record" is allocated and initialized. The "is_def" field,
like most of the others, is initialized to "NO_VAL". During the rest of
the processing, if a new user is being added, the assoc->acct is set and
"is_def" ends up being set to "1". But if the user already exists, and
we are just creating a new association for that user, then "is_def" is not
modified, and remains as "NO_VAL".
Eventually the list of one or more associations is passed to
"acct_storage_g_add_associations", which results in a RPC call to
"slurmdbd", where the information ends up being sent to "mysqld" to be
added to the database. All this time, the "is_def" field for the
association with the "acct02" account is still "NO_VAL". Somehow this
field ends up as "0" in the actual database, but that is not what
concerns us here. When the data "commit" is done, slurmdbd also does an
RPC call to "slurmctld" with a message type of "ACCOUNTING_UPDATE_MSG" to
pass the changes to slurmctld, and passes the same set of association
records. In "assoc_mgr.c", in function "_set_user_default_acct", the
value of "is_def" is tested for non-zero, and if non-zero, the "acct" is
set as the default account in the "user" record in the
"assoc_mgr_user_list". Of course, both "1" and "NO_VAL" are non-zero, so
the default account is set to "acct02". So for the life of the
slurmctld, the default for "user01" stays "acct02", even though the
default in the database records is "acct01". When slurmctld is stopped
and restarted, the internal tables are rebuilt from the database records,
so everything is set the way it should be.
The easiest way to fix this is to change the test of "is_def" in
"assoc_mgr.c" to be an explicit test for "1" instead of just for non-zero.
That is what the patch below does. This appears to solve the problem.
Another possibility is to explicitly initialize "is_def" in the
association record to "0" instead of "NO_VAL", but I don't know if this
has implications elsewhere in the code.
-Don Albert-
The following patch is based on SLURM version 2.2.5.
[stag] (dalbert) common> cvs diff -u assoc_mgr.c
Index: assoc_mgr.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/common/assoc_mgr.c,v
retrieving revision 1.1.1.32.2.1
diff -u -r1.1.1.32.2.1 assoc_mgr.c
--- assoc_mgr.c 6 May 2011 17:27:01 -0000 1.1.1.32.2.1
+++ assoc_mgr.c 26 May 2011 00:00:57 -0000
@@ -293,7 +293,7 @@
xassert(assoc_mgr_user_list);
/* set up the default if this is it */
- if (assoc->is_def && (assoc->uid != NO_VAL)) {
+ if ((assoc->is_def == 1) && (assoc->uid != NO_VAL)) {
slurmdb_user_rec_t *user = NULL;
ListIterator user_itr =
list_iterator_create(assoc_mgr_user_list);