[slurm-dev] slurm_load_partitions: Unable to contact slurm controller (connect failure)

2016-10-24 Thread Peixin Qiao
Hello,

I installed slurm-llnl on Debian on one computer. When I ran slurmctld and
slurmd, I got the error:
slurm_load_partitions: Unable to contact slurm controller (connect failure).

The slurm.conf is as follows:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=debian
#ControlAddr=
#
AuthType=auth/none
CacheGroups=0
CryptoType=crypto/openssl
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
JobCompType=jobcomp/none
JobCredentialPrivateKey = /usr/local/etc/slurm.key
JobCredentialPublicCertificate = /usr/local/etc/slurm.cert
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=debian CPUs=4 RealMemory=5837 Sockets=4
PartitionName=debug Nodes=debian Default=YES

Best Regards,
Peixin


[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-24 Thread Lachlan Musicman
On 25 October 2016 at 09:17, Tuo Chen Peng  wrote:

> Oh ok thanks for pointing this out.
>
> I thought ‘scontrol update’ command is for letting slurmctld to pick up
> any change in slurm.conf.
>
> But after reading the manual again, it seems this command is instead to
> change the setting at runtime, instead of reading any change from
> slurm.conf.
>
>
>
> So is restarting slurmctld the only way to let it pick up changes in
> slurm.conf?
>
> And if I change (2.2) in my plan to
>
> (2.2) restart slurmctld to pick changes in slurm.conf, then use ‘scontrol
> reconfigure’ to push changes to all nodes
>
> Do you see any impact to the running jobs in the cluster?
>
>
There shouldn't be any impact on running jobs at all, but of course there
are always caveats:
 - while slurmctld is restarting, no one will be able to send in any jobs
(although it should take ~5 seconds to restart unless you have made an
error, in which case it will take probably 1 minute to restart while you
fix and/or roll back, so no one should even notice)
 - as an extension of the above, if any of the jobs on the queue has a
running job as a dependency, and that job finishes in the x seconds that
slurmctld is down...but I doubt it.
 -  I can't remember exactly what they do, but if you look over the list, I
think some people save the contents of /var/spool/slurmd (which I believe
holds the "state" information of all running jobs)

(note that none of these is a real concern, they are just possible)


L.


--
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper


[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-24 Thread Tuo Chen Peng
Oh ok thanks for pointing this out.
I thought ‘scontrol update’ command is for letting slurmctld to pick up any 
change in slurm.conf.
But after reading the manual again, it seems this command is instead to change 
the setting at runtime, instead of reading any change from slurm.conf.

So is restarting slurmctld the only way to let it pick up changes in slurm.conf?
And if I change (2.2) in my plan to
(2.2) restart slurmctld to pick changes in slurm.conf, then use ‘scontrol 
reconfigure’ to push changes to all nodes
Do you see any impact to the running jobs in the cluster?

Thanks

From: Lachlan Musicman [mailto:data...@gmail.com]
Sent: Monday, October 24, 2016 2:58 PM
To: slurm-dev
Subject: [slurm-dev] Re: Impact to jobs when reconfiguring partitions?

On 25 October 2016 at 08:42, Tuo Chen Peng 
> wrote:
Hello all,
This is my first post in the mailing list - nice to join the community!

Welcome!


I have a general question regarding slurm partition change:
If I move one node from one partition to the other, will it cause any impact to 
the jobs that are still running on other nodes, in both partitions?

No, it shouldn't, depending on how you execute the plan...

But we would like to do this without interrupting existing, running jobs.
What would be the safe way to do this?

And here’s my plan:
(1) drain the node in main partition for the move, and only drain that node - 
keep other nodes available for job submission.
(2) move node from main partition to short job partition
(2.1) update slurm.conf on both control node and node to be moved, so that this 
node is listed under short job partition
(2.2) Run scontrol update on both control node and node just moved, to let 
slurm pick up configuration change.
(3) node should now be moved to short job partition, set the node back to 
normal / idle state.

Is “scontrol update” the right command to use in this case?
Does anyone see any impact / concern in above sequence?
I’m mostly worried mostly about whether such partition change could cause 
user’s existing jobs to be killed or fail for some reason.

Looks correct except for 2.2 - my understanding is that you would need to 
restart the slurmctld process (`systemctl restart slurm`) at this point - which 
is the point the slurm "head" node picks up the changes to the slurm.conf - and 
 then `scontrol reconfigure` to distribute that change to the nodes.


Cheers
L.


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

2016-10-24 Thread Lachlan Musicman
On 25 October 2016 at 08:42, Tuo Chen Peng  wrote:

> Hello all,
>
> This is my first post in the mailing list - nice to join the community!
>

Welcome!


>
>
> I have a general question regarding slurm partition change:
>
> If I move one node from one partition to the other, will it cause any
> impact to the jobs that are still running on other nodes, in both
> partitions?
>
>
No, it shouldn't, depending on how you execute the plan...


> But we would like to do this without interrupting existing, running jobs.
>
> What would be the safe way to do this?
>
>
>
> And here’s my plan:
>
> (1) drain the node in main partition for the move, and only drain that
> node - keep other nodes available for job submission.
>
> (2) move node from main partition to short job partition
>
> (2.1) update slurm.conf on both control node and node to be moved, so that
> this node is listed under short job partition
>
> (2.2) Run scontrol update on both control node and node just moved, to let
> slurm pick up configuration change.
>
> (3) node should now be moved to short job partition, set the node back to
> normal / idle state.
>
>
>
> Is “scontrol update” the right command to use in this case?
>
> Does anyone see any impact / concern in above sequence?
>
> I’m mostly worried mostly about whether such partition change could cause
> user’s existing jobs to be killed or fail for some reason.
>

Looks correct except for 2.2 - my understanding is that you would need to
restart the slurmctld process (`systemctl restart slurm`) at this point -
which is the point the slurm "head" node picks up the changes to the
slurm.conf - and  then `scontrol reconfigure` to distribute that change to
the nodes.


Cheers
L.


[slurm-dev] Impact to jobs when reconfiguring partitions?

2016-10-24 Thread Tuo Chen Peng
Hello all,
This is my first post in the mailing list - nice to join the community!

I have a general question regarding slurm partition change:
If I move one node from one partition to the other, will it cause any impact to 
the jobs that are still running on other nodes, in both partitions?


I'm managing 1 cluster with 2 partitions:
- main partition for long jobs running
- the other, short job partition, dedicated to jobs that only run for less 2 
hours

Right now we are seeing our short job partition needs more resource and are 
planning to move 1 node from main partition to short job partition.
But we would like to do this without interrupting existing, running jobs.
What would be the safe way to do this?

And here's my plan:
(1) drain the node in main partition for the move, and only drain that node - 
keep other nodes available for job submission.
(2) move node from main partition to short job partition
(2.1) update slurm.conf on both control node and node to be moved, so that this 
node is listed under short job partition
(2.2) Run scontrol update on both control node and node just moved, to let 
slurm pick up configuration change.
(3) node should now be moved to short job partition, set the node back to 
normal / idle state.

Is "scontrol update" the right command to use in this case?
Does anyone see any impact / concern in above sequence?
I'm mostly worried mostly about whether such partition change could cause 
user's existing jobs to be killed or fail for some reason.

Thank you
TuoChen Peng


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


[slurm-dev] Re: set maximum CPU usage per user

2016-10-24 Thread Benjamin Redling

Hi,

On 10/21/2016 18:58, Steven Lo wrote:
> Is MaxTRESPerUser a better option to use?

if you only ever want to restrict every user alike, that seems reasonable.
I would choose whatever fits your needs right now and in the not so
distant future. That way you gain time to learn about the options slurm
provides.

Anyway, didn't you have any progress with your former setup?
Did you understand what happened?

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] Re: slurmd: fatal: Frontend not configured correctly in slurm.conf

2016-10-24 Thread Alexandre Strube
Slurm in ubuntu is broken. See my previous messages here:

https://groups.google.com/forum/#!searchin/slurm-devel/strube|sort:relevance/slurm-devel/Hbh5BsvaVnA/u5zRC22NDAAJ
https://groups.google.com/forum/#!searchin/slurm-devel/strube|sort:relevance/slurm-devel/-JdsUDAteI8/uzMLSfSJDAAJ

and the specific bug reports for ubuntu:

https://bugs.launchpad.net/ubuntu/+source/slurm-llnl/+bug/1629025
https://bugs.launchpad.net/ubuntu/+source/slurm-llnl/+bug/1629027
https://bugs.launchpad.net/ubuntu/+source/slurm-llnl/+bug/1629030

Debian Jessie works perfectly, though, with exactly the same slurm.conf.


2016-10-24 17:05 GMT+02:00 Peixin Qiao :

> Hello,
>
> When I install slurm and start it on ubuntu 16.04, I got the error:
>
> slurred: fatal: Frontend not configured correctly in slurm.conf. See man
> slurm.conf for frontendname
>
> After seeing man slurm.conf, I still confused about how to change
> slurm.conf. Could you please help me with the detailed change in the
> slurm.conf file?
>
> Best Regards,
> Peixin
> Ph.D. candidate in Computer Science
> Illinois Institute of Technology
>



-- 
[]
Alexandre Strube
su...@ubuntu.com


[slurm-dev] slurmd: fatal: Frontend not configured correctly in slurm.conf

2016-10-24 Thread Peixin Qiao
Hello,

When I install slurm and start it on ubuntu 16.04, I got the error:

slurred: fatal: Frontend not configured correctly in slurm.conf. See man
slurm.conf for frontendname

After seeing man slurm.conf, I still confused about how to change
slurm.conf. Could you please help me with the detailed change in the
slurm.conf file?

Best Regards,
Peixin
Ph.D. candidate in Computer Science
Illinois Institute of Technology