[slurm-dev] Re: Need to restart slurmctld when adding user to accounting

2016-03-30 Thread Danny Auble
Chris is right.  If you ever have this problem it should be fairly clearly 
marked in both slurmctld and slurmdbd logs when it fails.  Usually a firewall 
like iptables is to blame or different slurm users set in the various .conf 
files as mentioned before. 

On March 30, 2016 5:57:19 PM PDT, Gene Soudlenkov  
wrote:
>
>We've been having the same problem for years - and we still need to do
>it.
>
>Gene
>
>On 31/03/16 13:46, Christopher Samuel wrote:
>> On 31/03/16 11:33, Terri Knight wrote:
>>
>>> Upon further testing, I only need restart the slurmctld daemon to
>get
>>> the new user added such that he can run a job.
>> I think when you add a user with sacctmgr slurmdbd will try and do an
>> RPC to slurmctld on the registered clusters to inform them of this.
>>
>> If slurmdbd can't do so then you should see an error logged in the
>> slurmdbd logs and consequently slurmctld won't realise this new user
>> exists until it reloads its list of users from slurmdbd (say on a
>restart).
>>
>> Check your slurmdbd logs and also check that:
>>
>> sacctmgr list cluster format=cluster,controlhost
>>
>> reports an IP address that slurmdbd can talk to for each cluster.
>>
>> Best of luck,
>> Chris
>
>-- 
>New Zealand eScience Infrastructure
>Centre for eResearch
>The University of Auckland
>e: g.soudlen...@auckland.ac.nz
>p: +64 9 3737599 ext 89834 c: +64 21 840 825 f: +64 9 373 7453
>w: www.nesi.org.nz


[slurm-dev] Re: Need to restart slurmctld when adding user to accounting

2016-03-30 Thread Gene Soudlenkov


We've been having the same problem for years - and we still need to do it.

Gene

On 31/03/16 13:46, Christopher Samuel wrote:

On 31/03/16 11:33, Terri Knight wrote:


Upon further testing, I only need restart the slurmctld daemon to get
the new user added such that he can run a job.

I think when you add a user with sacctmgr slurmdbd will try and do an
RPC to slurmctld on the registered clusters to inform them of this.

If slurmdbd can't do so then you should see an error logged in the
slurmdbd logs and consequently slurmctld won't realise this new user
exists until it reloads its list of users from slurmdbd (say on a restart).

Check your slurmdbd logs and also check that:

sacctmgr list cluster format=cluster,controlhost

reports an IP address that slurmdbd can talk to for each cluster.

Best of luck,
Chris


--
New Zealand eScience Infrastructure
Centre for eResearch
The University of Auckland
e: g.soudlen...@auckland.ac.nz
p: +64 9 3737599 ext 89834 c: +64 21 840 825 f: +64 9 373 7453
w: www.nesi.org.nz


[slurm-dev] Re: Need to restart slurmctld when adding user to accounting

2016-03-30 Thread Christopher Samuel

On 31/03/16 11:33, Terri Knight wrote:

> Upon further testing, I only need restart the slurmctld daemon to get
> the new user added such that he can run a job.

I think when you add a user with sacctmgr slurmdbd will try and do an
RPC to slurmctld on the registered clusters to inform them of this.

If slurmdbd can't do so then you should see an error logged in the
slurmdbd logs and consequently slurmctld won't realise this new user
exists until it reloads its list of users from slurmdbd (say on a restart).

Check your slurmdbd logs and also check that:

sacctmgr list cluster format=cluster,controlhost

reports an IP address that slurmdbd can talk to for each cluster.

Best of luck,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Need to restart slurmctld when adding user to accounting

2016-03-30 Thread Douglas Jacobsen
Sorry, you just said they were, somehow misread this.  Try increasing
logging level, perhaps the easiest way is running slurmctld and slurmdbd
interactively with the -Dvvv arguments.  Then add a user and see if any
errors occur, particularly on the slurmctld side after the sacctmgr update
is done.

slurmdbd will send the accounting update to slurmctld slightly after
sacctmgr returns.


Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center 
dmjacob...@lbl.gov

- __o
-- _ '\<,_
--(_)/  (_)__


On Wed, Mar 30, 2016 at 5:38 PM, Douglas Jacobsen 
wrote:

> Are both slurmdbd and slurmctld running as the same UID?  (if not they
> need to be, I believe you can see the errors on slurmdbd debug2 or debug3)
>
>
>
> 
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> National Energy Research Scientific Computing Center
> 
> dmjacob...@lbl.gov
>
> - __o
> -- _ '\<,_
> --(_)/  (_)__
>
>
> On Wed, Mar 30, 2016 at 5:32 PM, Terri Knight 
> wrote:
>
>>
>> I posted earlier (Dec 28, 2015) about this issue and was told to check
>> that the slurmdbd and slurmctl daemons were running as the same user- they
>> weren't at that time. I thought making that change would resolve the
>> problem but it did not.
>>
>> These daemons are now both running as root
>> root  6463 1  0 17:01 ?00:00:00
>> /share/apps/slurm-15.08.8/sbin/slurmdbd
>> root  6743 1  0 17:05 ?00:00:00
>> /share/apps/slurm-15.08.8//sbin/slurmctld
>>
>> on the compute node:
>> root  7874 1  0 17:03 ?00:00:00
>> /share/apps/slurm-15.08.8//sbin/slurmd
>>
>> Upon further testing, I only need restart the slurmctld daemon to get the
>> new user added such that he can run a job. So not as big a deal to me now
>> but it is different than in older versions of slurm.
>>
>> I'm adding a new user to an existing account and before I restart
>> slurmctld I see this in the slurmctld log when I try to "srun date" as that
>> user:
>>
>> [2016-03-30T17:04:50.107] error: User 9101 not found
>> [2016-03-30T17:04:50.107] _job_create: invalid account or partition for
>> user 9101, account '(null)', and partition 'debug'
>> [2016-03-30T17:04:50.142] _slurm_rpc_allocate_resources: Invalid account
>> or account/partition combination specified
>> [2016-03-30T17:05:11.381] Terminate signal (SIGINT or SIGTERM) received
>>
>> Oddly the account is "null"
>>
>> Here is the command to add the user,
>> sacctmgr add user johndoe defaultaccount=boris
>> partition=low,med,high,debug cluster=jane
>>
>> slurm-15.08.8 on Ubuntu 14.04.4
>>
>> Like I said, I can live with it since its only 1 restart.
>>
>> Thanks,
>> Terri
>>
>
>


[slurm-dev] Re: Need to restart slurmctld when adding user to accounting

2016-03-30 Thread Douglas Jacobsen
Are both slurmdbd and slurmctld running as the same UID?  (if not they need
to be, I believe you can see the errors on slurmdbd debug2 or debug3)




Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center 
dmjacob...@lbl.gov

- __o
-- _ '\<,_
--(_)/  (_)__


On Wed, Mar 30, 2016 at 5:32 PM, Terri Knight  wrote:

>
> I posted earlier (Dec 28, 2015) about this issue and was told to check
> that the slurmdbd and slurmctl daemons were running as the same user- they
> weren't at that time. I thought making that change would resolve the
> problem but it did not.
>
> These daemons are now both running as root
> root  6463 1  0 17:01 ?00:00:00
> /share/apps/slurm-15.08.8/sbin/slurmdbd
> root  6743 1  0 17:05 ?00:00:00
> /share/apps/slurm-15.08.8//sbin/slurmctld
>
> on the compute node:
> root  7874 1  0 17:03 ?00:00:00
> /share/apps/slurm-15.08.8//sbin/slurmd
>
> Upon further testing, I only need restart the slurmctld daemon to get the
> new user added such that he can run a job. So not as big a deal to me now
> but it is different than in older versions of slurm.
>
> I'm adding a new user to an existing account and before I restart
> slurmctld I see this in the slurmctld log when I try to "srun date" as that
> user:
>
> [2016-03-30T17:04:50.107] error: User 9101 not found
> [2016-03-30T17:04:50.107] _job_create: invalid account or partition for
> user 9101, account '(null)', and partition 'debug'
> [2016-03-30T17:04:50.142] _slurm_rpc_allocate_resources: Invalid account
> or account/partition combination specified
> [2016-03-30T17:05:11.381] Terminate signal (SIGINT or SIGTERM) received
>
> Oddly the account is "null"
>
> Here is the command to add the user,
> sacctmgr add user johndoe defaultaccount=boris
> partition=low,med,high,debug cluster=jane
>
> slurm-15.08.8 on Ubuntu 14.04.4
>
> Like I said, I can live with it since its only 1 restart.
>
> Thanks,
> Terri
>