Hey Don, thanks for the great analysis of this bug.  Attached is a more 
complete patch fixing the real problem and also using your sanity check.

This will be in 2.2.6.

Danny

> 
> There is an error in creating a second account for a user that results in 
> subsequent jobs submitted by that user being run under the wrong default 
> account.  The wrong account persists until slurmctld is stopped and 
> restarted.  This occurs running 'slurmdbd' and using 'MySQL' for the 
> database.  Here is an simple example:
> 
> Create a new account and then add a new user:
> 
> >sacctmgr add account name=acct01
> >sacctmgr add user name=user01 account=acct01
> 
> After confirming the commit, using "mysql" to dump the records (some rows 
> omitted for brevity) shows:
> 
> mysql> select id_assoc,is_def,user,acct,parent_acct from 
> dja_cluster_assoc_table;
> +----------+--------+---------+--------+-------------+
> | id_assoc | is_def | user    | acct   | parent_acct |
> +----------+--------+---------+--------+-------------+
> |        1 |      0 |         | root   |             |
> |       18 |      0 |         | acct01 | root        |
> |       19 |      1 | user01  | acct01 |             |
> +----------+--------+---------+--------+-------------+
> 
> which shows that the "add user" has marked "acct01" as the default account 
> ("is_def = 1") for this user.
> 
> The user "user01" can now submit jobs using "srun".   He can explicitly 
> specify "account=acct01" on the command, or if not, the default will be 
> "acct01", as shown by:
> 
> >srun hostname
> stag
> >sacct
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
> ------------ ---------- ---------- ---------- ---------- ---------- --------
> 55             hostname    phoenix     acct01          1  COMPLETED      0:0
> 
> If we now add a second account for the same user:
> 
> >sacctmgr add user name=user01 account=acct02
> 
> and dump the records:
> 
> mysql> select id_assoc,is_def,user,acct,parent_acct from 
> dja_cluster_assoc_table;
> +----------+--------+---------+--------+-------------+
> | id_assoc | is_def | user    | acct   | parent_acct |
> +----------+--------+---------+--------+-------------+
> |        1 |      0 |         | root   |             |
> |       18 |      0 |         | acct01 | root        |
> |       19 |      1 | user01  | acct01 |             |
> |       20 |      0 |         | acct02 | root        |
> |       21 |      0 | user01  | acct02 |             |
> +----------+--------+---------+--------+-------------+
> 
> the user "user01" now has an association that includes "acct02".  But note 
> that "acct01" is still marked as the default account for "user01".  But if we 
> submit a job with "srun" now,  it gets run under account "acct02".
> 
> >srun hostname
> stag
> >sacct
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
> ------------ ---------- ---------- ---------- ---------- ---------- --------
> 55             hostname    phoenix     acct01          1  COMPLETED      0:0
> 56             hostname    phoenix     acct02          1  COMPLETED      0:0
> 
> This situation persists until "slurmctld" is stopped and restarted.  After 
> the restart, the "srun" with no explicit account parameter gets run under the 
> default of "acct01", as it should.
> 
> I went round and round for a while with "sacctmgr", "slurmdbd", and 
> "slurmctld",   but I think I have finally figured out what is happening.   It 
> starts with "sacctmgr", in the "sacctmgr_add_user" function,  where an 
> "association record" is allocated and initialized.  The "is_def" field, like 
> most of the others, is initialized to "NO_VAL".    During the rest of the 
> processing, if a new user is being added, the assoc->acct is set and "is_def" 
> ends up being set to "1".   But if the user already exists, and we are just 
> creating a new association for that user, then "is_def" is not modified, and 
> remains as "NO_VAL".
> 
> Eventually the list of one or more associations is passed to 
> "acct_storage_g_add_associations", which results in a RPC call to "slurmdbd", 
>  where the information ends up being sent to "mysqld" to be added to the 
> database.  All this time, the "is_def" field for the association with the 
> "acct02" account is still "NO_VAL".   Somehow this field ends up as "0" in 
> the actual database,  but that is not what concerns us here.   When the data 
> "commit" is done,  slurmdbd also does an RPC call to "slurmctld" with a 
> message type of "ACCOUNTING_UPDATE_MSG" to pass the changes to slurmctld, and 
> passes the same set of association records.   In "assoc_mgr.c", in function 
> "_set_user_default_acct",  the value of "is_def" is tested for non-zero, and 
> if non-zero, the "acct" is set as the default account in the "user" record in 
> the "assoc_mgr_user_list".   Of course, both "1" and "NO_VAL" are non-zero, 
> so the default account is set to "acct02".   So for the life of the 
> slurmctld,  the default for "user01" stays "acct02", even though the default 
> in the database records is "acct01".   When slurmctld is stopped and 
> restarted, the internal tables are rebuilt from the database records, so 
> everything is set the way it should be.
> 
> The easiest way to fix this is to change the test of "is_def" in 
> "assoc_mgr.c" to be an explicit test for "1" instead of just for non-zero.   
> That is what the patch below does.   This appears to solve the problem.   
> Another possibility is to explicitly initialize "is_def" in the association 
> record to "0" instead of "NO_VAL", but I don't know if this has implications 
> elsewhere in the code.
> 
>         -Don Albert-
> 
> The following patch is based on SLURM version 2.2.5.
> 
> [stag] (dalbert) common> cvs diff -u assoc_mgr.c
> Index: assoc_mgr.c
> ===================================================================
> RCS file: /cvsroot/slurm/slurm/src/common/assoc_mgr.c,v
> retrieving revision 1.1.1.32.2.1
> diff -u -r1.1.1.32.2.1 assoc_mgr.c
> --- assoc_mgr.c 6 May 2011 17:27:01 -0000       1.1.1.32.2.1
> +++ assoc_mgr.c 26 May 2011 00:00:57 -0000
> @@ -293,7 +293,7 @@
>         xassert(assoc_mgr_user_list);
> 
>         /* set up the default if this is it */
> -       if (assoc->is_def && (assoc->uid != NO_VAL)) {
> +       if ((assoc->is_def == 1) && (assoc->uid != NO_VAL)) {
>                 slurmdb_user_rec_t *user = NULL;
>                 ListIterator user_itr =
>                         list_iterator_create(assoc_mgr_user_list);
> 

diff --git a/src/common/assoc_mgr.c b/src/common/assoc_mgr.c
index f1562c9..13117d6 100644
--- a/src/common/assoc_mgr.c
+++ b/src/common/assoc_mgr.c
@@ -293,7 +293,7 @@ static void _set_user_default_acct(slurmdb_association_rec_t *assoc)
 	xassert(assoc_mgr_user_list);
 
 	/* set up the default if this is it */
-	if (assoc->is_def && (assoc->uid != NO_VAL)) {
+	if ((assoc->is_def == 1) && (assoc->uid != NO_VAL)) {
 		slurmdb_user_rec_t *user = NULL;
 		ListIterator user_itr =
 			list_iterator_create(assoc_mgr_user_list);
@@ -321,7 +321,7 @@ static void _set_user_default_wckey(slurmdb_wckey_rec_t *wckey)
 	xassert(assoc_mgr_user_list);
 
 	/* set up the default if this is it */
-	if (wckey->is_def && (wckey->uid != NO_VAL)) {
+	if ((wckey->is_def == 1) && (wckey->uid != NO_VAL)) {
 		slurmdb_user_rec_t *user = NULL;
 		ListIterator user_itr =
 			list_iterator_create(assoc_mgr_user_list);
@@ -2398,6 +2398,11 @@ extern int assoc_mgr_update_assocs(slurmdb_update_object_t *update)
 			if (!object->usage)
 				object->usage =
 					create_assoc_mgr_association_usage();
+			/* If is_def is uninitialized the value will
+			   be NO_VAL, so if it isn't 1 make it 0.
+			*/
+			if (object->is_def != 1)
+				object->is_def = 0;
 			list_append(assoc_mgr_association_list, object);
 			object = NULL;
 			parents_changed = 1; /* set since we need to
@@ -2643,8 +2648,13 @@ extern int assoc_mgr_update_wckeys(slurmdb_update_object_t *update)
 			} else
 				object->uid = pw_uid;
 
-			if (object->is_def)
+			/* If is_def is uninitialized the value will
+			   be NO_VAL, so if it isn't 1 make it 0.
+			*/
+			if (object->is_def == 1)
 				_set_user_default_wckey(object);
+			else
+				object->is_def = 0;
 			list_append(assoc_mgr_wckey_list, object);
 			object = NULL;
 			break;

Reply via email to