Hey Don, thanks for the great analysis of this bug. Attached is a more
complete patch fixing the real problem and also using your sanity check.
This will be in 2.2.6.
Danny
>
> There is an error in creating a second account for a user that results in
> subsequent jobs submitted by that user being run under the wrong default
> account. The wrong account persists until slurmctld is stopped and
> restarted. This occurs running 'slurmdbd' and using 'MySQL' for the
> database. Here is an simple example:
>
> Create a new account and then add a new user:
>
> >sacctmgr add account name=acct01
> >sacctmgr add user name=user01 account=acct01
>
> After confirming the commit, using "mysql" to dump the records (some rows
> omitted for brevity) shows:
>
> mysql> select id_assoc,is_def,user,acct,parent_acct from
> dja_cluster_assoc_table;
> +----------+--------+---------+--------+-------------+
> | id_assoc | is_def | user | acct | parent_acct |
> +----------+--------+---------+--------+-------------+
> | 1 | 0 | | root | |
> | 18 | 0 | | acct01 | root |
> | 19 | 1 | user01 | acct01 | |
> +----------+--------+---------+--------+-------------+
>
> which shows that the "add user" has marked "acct01" as the default account
> ("is_def = 1") for this user.
>
> The user "user01" can now submit jobs using "srun". He can explicitly
> specify "account=acct01" on the command, or if not, the default will be
> "acct01", as shown by:
>
> >srun hostname
> stag
> >sacct
> JobID JobName Partition Account AllocCPUS State ExitCode
> ------------ ---------- ---------- ---------- ---------- ---------- --------
> 55 hostname phoenix acct01 1 COMPLETED 0:0
>
> If we now add a second account for the same user:
>
> >sacctmgr add user name=user01 account=acct02
>
> and dump the records:
>
> mysql> select id_assoc,is_def,user,acct,parent_acct from
> dja_cluster_assoc_table;
> +----------+--------+---------+--------+-------------+
> | id_assoc | is_def | user | acct | parent_acct |
> +----------+--------+---------+--------+-------------+
> | 1 | 0 | | root | |
> | 18 | 0 | | acct01 | root |
> | 19 | 1 | user01 | acct01 | |
> | 20 | 0 | | acct02 | root |
> | 21 | 0 | user01 | acct02 | |
> +----------+--------+---------+--------+-------------+
>
> the user "user01" now has an association that includes "acct02". But note
> that "acct01" is still marked as the default account for "user01". But if we
> submit a job with "srun" now, it gets run under account "acct02".
>
> >srun hostname
> stag
> >sacct
> JobID JobName Partition Account AllocCPUS State ExitCode
> ------------ ---------- ---------- ---------- ---------- ---------- --------
> 55 hostname phoenix acct01 1 COMPLETED 0:0
> 56 hostname phoenix acct02 1 COMPLETED 0:0
>
> This situation persists until "slurmctld" is stopped and restarted. After
> the restart, the "srun" with no explicit account parameter gets run under the
> default of "acct01", as it should.
>
> I went round and round for a while with "sacctmgr", "slurmdbd", and
> "slurmctld", but I think I have finally figured out what is happening. It
> starts with "sacctmgr", in the "sacctmgr_add_user" function, where an
> "association record" is allocated and initialized. The "is_def" field, like
> most of the others, is initialized to "NO_VAL". During the rest of the
> processing, if a new user is being added, the assoc->acct is set and "is_def"
> ends up being set to "1". But if the user already exists, and we are just
> creating a new association for that user, then "is_def" is not modified, and
> remains as "NO_VAL".
>
> Eventually the list of one or more associations is passed to
> "acct_storage_g_add_associations", which results in a RPC call to "slurmdbd",
> where the information ends up being sent to "mysqld" to be added to the
> database. All this time, the "is_def" field for the association with the
> "acct02" account is still "NO_VAL". Somehow this field ends up as "0" in
> the actual database, but that is not what concerns us here. When the data
> "commit" is done, slurmdbd also does an RPC call to "slurmctld" with a
> message type of "ACCOUNTING_UPDATE_MSG" to pass the changes to slurmctld, and
> passes the same set of association records. In "assoc_mgr.c", in function
> "_set_user_default_acct", the value of "is_def" is tested for non-zero, and
> if non-zero, the "acct" is set as the default account in the "user" record in
> the "assoc_mgr_user_list". Of course, both "1" and "NO_VAL" are non-zero,
> so the default account is set to "acct02". So for the life of the
> slurmctld, the default for "user01" stays "acct02", even though the default
> in the database records is "acct01". When slurmctld is stopped and
> restarted, the internal tables are rebuilt from the database records, so
> everything is set the way it should be.
>
> The easiest way to fix this is to change the test of "is_def" in
> "assoc_mgr.c" to be an explicit test for "1" instead of just for non-zero.
> That is what the patch below does. This appears to solve the problem.
> Another possibility is to explicitly initialize "is_def" in the association
> record to "0" instead of "NO_VAL", but I don't know if this has implications
> elsewhere in the code.
>
> -Don Albert-
>
> The following patch is based on SLURM version 2.2.5.
>
> [stag] (dalbert) common> cvs diff -u assoc_mgr.c
> Index: assoc_mgr.c
> ===================================================================
> RCS file: /cvsroot/slurm/slurm/src/common/assoc_mgr.c,v
> retrieving revision 1.1.1.32.2.1
> diff -u -r1.1.1.32.2.1 assoc_mgr.c
> --- assoc_mgr.c 6 May 2011 17:27:01 -0000 1.1.1.32.2.1
> +++ assoc_mgr.c 26 May 2011 00:00:57 -0000
> @@ -293,7 +293,7 @@
> xassert(assoc_mgr_user_list);
>
> /* set up the default if this is it */
> - if (assoc->is_def && (assoc->uid != NO_VAL)) {
> + if ((assoc->is_def == 1) && (assoc->uid != NO_VAL)) {
> slurmdb_user_rec_t *user = NULL;
> ListIterator user_itr =
> list_iterator_create(assoc_mgr_user_list);
>
diff --git a/src/common/assoc_mgr.c b/src/common/assoc_mgr.c
index f1562c9..13117d6 100644
--- a/src/common/assoc_mgr.c
+++ b/src/common/assoc_mgr.c
@@ -293,7 +293,7 @@ static void _set_user_default_acct(slurmdb_association_rec_t *assoc)
xassert(assoc_mgr_user_list);
/* set up the default if this is it */
- if (assoc->is_def && (assoc->uid != NO_VAL)) {
+ if ((assoc->is_def == 1) && (assoc->uid != NO_VAL)) {
slurmdb_user_rec_t *user = NULL;
ListIterator user_itr =
list_iterator_create(assoc_mgr_user_list);
@@ -321,7 +321,7 @@ static void _set_user_default_wckey(slurmdb_wckey_rec_t *wckey)
xassert(assoc_mgr_user_list);
/* set up the default if this is it */
- if (wckey->is_def && (wckey->uid != NO_VAL)) {
+ if ((wckey->is_def == 1) && (wckey->uid != NO_VAL)) {
slurmdb_user_rec_t *user = NULL;
ListIterator user_itr =
list_iterator_create(assoc_mgr_user_list);
@@ -2398,6 +2398,11 @@ extern int assoc_mgr_update_assocs(slurmdb_update_object_t *update)
if (!object->usage)
object->usage =
create_assoc_mgr_association_usage();
+ /* If is_def is uninitialized the value will
+ be NO_VAL, so if it isn't 1 make it 0.
+ */
+ if (object->is_def != 1)
+ object->is_def = 0;
list_append(assoc_mgr_association_list, object);
object = NULL;
parents_changed = 1; /* set since we need to
@@ -2643,8 +2648,13 @@ extern int assoc_mgr_update_wckeys(slurmdb_update_object_t *update)
} else
object->uid = pw_uid;
- if (object->is_def)
+ /* If is_def is uninitialized the value will
+ be NO_VAL, so if it isn't 1 make it 0.
+ */
+ if (object->is_def == 1)
_set_user_default_wckey(object);
+ else
+ object->is_def = 0;
list_append(assoc_mgr_wckey_list, object);
object = NULL;
break;