Danny,
Sorry for the confusion, I knew I was going to confuse things by talking
about two problems that I thought might be related and I didn't want to
just dump out the problem until I found the root cause ;). Lets take it
from the top then. We are using MySQL simply because we did not want to add
an extra layer and complicate testing on our XT SLURM installation. That
being said, I had planned on testing with slurmdbd once I had the
configuration I wanted in a good state and well tested. Lets move onto the
main issue:
The main problem I am seeing happens in the following situation on both my
Cray XT(slurm 2.3.2) and x86(confirmed this morning with 2.2.6) Testing
platforms. The slurm.conf specifies AccountingStorage=Associations, Limits,
Qos and AccoutingStorage is set to MySQL.In sacctmgr, I have a list of
accounts and sub accounts already created resembling the following
association tree (sacctmgr: list association tree):
clusterx root .....
clusterx parent1 ... (with only Share=#, QoS={list of QoS}, and
DefaultQoS={ValidQoS} specified)
clusterx child1 ... (with Partition=mypart, Share=Parent, QoS={all of
the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child1 bobby ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child1 sue ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
clusterx parent2
clusterx child1 ... (with Partition=mypart, Share=Parent, QoS={all of
the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child1 bobby ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child1 sue ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
So, if user Fred tries the following: srun -n1 hostname, an association
error appears. Logic dictates I need to add user fred to an account in
order to follow my previous accounting structure and for him to be able to
submit a job to the cluster. So I perform the following commands in the
sacctmgr "shell":
sacctmgr: list association tree
clusterx root .....
clusterx parent1 ... (with only Share=#, QoS={list of QoS}, and
DefaultQoS={ValidQoS} specified)
clusterx child1 ... (with Partition=mypart, Share=Parent, QoS={all of
the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child1 bobby ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child1 sue ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
clusterx parent2
clusterx child1 ... (with Partition=mypart, Share=Parent, QoS={all of
the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child1 bobby ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child1 sue ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
sacctmgr: add user fred account=child1 Fairshare=parent QoS={all of the
QoSs I have} DefaultQoS={a Valid QoS} defaultaccount=child1 partition=mypart
CONFIRMATION AND SUCCESS MESSAGE
sacctmgr: list association tree
clusterx root .....
clusterx parent1 ... (with only Share=#, QoS={list of QoS}, and
DefaultQoS={ValidQoS} specified)
clusterx child1 ... (with Partition=mypart, Share=Parent, QoS={all of
the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child1 bobby ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child1 sue ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child1 Fred ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
clusterx parent2
clusterx child2 ... (with Partition=mypart, Share=Parent, QoS={all of
the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child2 bobby ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
clusterx child2 sue ... (with Partition=mypart, Share=Parent,
QoS={all of the QoSs}, and DefaultQoS={ValidQoS} )
If I now run srun -n1 hostname AS FRED, I still get an association error
and it only goes away if I restart the slurmctld daemon.
****
Now for the other issue I was talking about. Say I now want to add an
association between Fred and the child2 account, I would expect to be able
to simply run the sacctmgr "shell" and execute the following:
sacctmgr: modify user fred set account=child1,child2
OR
sacctmgr: modify user set account=child1,child2 where name=fred
OR
sacctmgr: modify user where name=fred set account=child1,child2
However all of these come back with variations of the following error:
"unknown option: account=child1,child2
use keyword 'where' to modify condition"
In order for me to add the user to another account I need to "create" a
"new" user. (although in retrospect, what is really being created is a new
association between a UID and an account, but the user keyword needs to be
used):
sacctmgr: create user fred account=child2 partition=mypar fairshare=parent
defaultqos={a valid QoS}
Hope this clears things.
As I have said, we do have plans to move to slurmdbd at some point.
However, based on your response, we may need to move on this sooner than
expected.
Thanks for your help,
Fred
2012/2/22 Danny Auble <[email protected]>
> Is there a reason you aren't using the SlurmDBD? Using straight MySQL is
> not tested nearly as much as with the SlurmDBD on top of MySQL. I would
> still expect this to work though. When you create your associations you
> should see some output in the slurmctld.log informing of the changes. I
> think they are at the debug level.
>
> I am not sure what you are explaining with modify a user's account
> assignment without recreating the user? Meaning move a user from one
> account to another or change the user association's account? Perhaps you
> are just explaining the process of adding a user to an account?
>
>
> On 02/22/12 19:17, Frédérick Lessard wrote:
>
> Danny,
>
> Thanks for responding so quickly. Earlier this afternoon I confirmed the
> bug on SLURM 2.3.x (I can confirm the revision in the morning once i'm back
> at the office) on a Cray XT system using straight MySQL (I believe I have
> seen the same bug on 2.2.6 on an x86 cluster, but am now unsure and will
> check in the morning). My instinct tells me this is related to the fact
> that you can't modify a user's account assignment without recreating the
> user. As you may know once a user is created, you can't modify his account
> assignment (can't do sacctmgr modify user fred account=newaccount1,
> newaccount2). I would like to try my hand at that problem later, but for
> now the fact that I need to restart slurmctld is a nuisance for me right
> now. Hopefully, this is just a config issue and that is why I was trying to
> figure out where/when the association data is sent to the slurmdctld. Here
> are the steps to recreate the issue on a Cray XT.
>
> First off I configure SLURM to use Associtation, Limits and QoS as
> enforcement rules. Then, I create a set of QoS's and parent accounts, and
> assign users to them. At this point if a user tries to submit a job with
> srun, an error is generated about not having a valid association. In order
> to fix it, I need to restart the slurmctld. Once restarted, all is well.
> Now if I want to add a user, I will need to restart slurmctld again for the
> changes to take effect. As I said, I have definitely confirmed this on my
> Cray test system, but I will recheck on my 2.2.6 x86 test system.
>
> Many thanks!
>
> Fred
>
> 2012/2/22 Danny Auble <[email protected]>
>
>>
>> What version of SLURM are you using? This seems like a bug that would
>> bite a lot of people. I don't see it in 2.3 or 2.4. (with the SlurmDBD,
>> I didn't test a direct mysql plugin, but that should work as well.)
>>
>> Danny
>>
>> On 02/22/12 16:29, Frédérick Lessard wrote:
>> > Hello (I hope you have only received this email once...had problems
>> sending
>> > it to the list...),
>> >
>> > I'm running into issues where information is not available for the
>> > slurmctld until the daemon is restarted following an update in sacctmgr
>> and
>> > I would like to try and fix it in the code (and submit a patch back to
>> you
>> > guys if I find it!). Can someone provide me with some direction as to
>> where
>> > to look into the code to understand when the information does get
>> > propagated from the association database (because it does appear there)
>> to
>> > the slurmctld?
>> >
>> > A bit more background: Whenever I add a user, I need to restart the
>> > slurmctld in order for that user to be allowed to use the association
>> > created between the user and the account. Same thing applies for adding
>> > more accounts to that user.... I'm using mysql as the accounting DB and
>> > enforce Associations, Limits and QoS in the slurm.conf.
>> >
>> > Thanks for your help
>> >
>> > Fred.
>> >
>>
>
>