[slurm-dev] Re: Share free cpus

2016-01-16 Thread Benjamin Redling
Hello Jordan, On 2016-01-16 01:21, Jordan Willis wrote: > If my partition is used up according to the node configuration, but still has > available CPUS, is there a way to allow a user to who only has a task that > takes 1 cpu on that node? > > For instance here is my partition: > > NODELIST

[slurm-dev] jobs vanishing w/o trace(?)

2016-01-16 Thread Benjamin Redling
Hello everybody, I loose every job that gets allocated on a certain node (KVM instance). Background: to enable and test the resources of a cluster of new machines I run Slurm 2.6 inside a Debian 7 KVM instance. Mainly because the hosts run Debian 8 and the old cluster is Debian 7. I prefer the D

[slurm-dev] Re: Share free cpus

2016-01-18 Thread Benjamin Redling
mem-per-cpu as part of SallocDefaultCommand in the slurm.conf and go with DefMemPerCPU, DefMemPerNode, MaxMemPerCPU and MaxMemPerNode as mentioned in the second last paragraph and let the user set --mem-per-cpu. As recommended. Regards, Benjamin On Jan 16, 2016, at 7:34 AM, Benjamin Redling m

[slurm-dev] Re: NodeName and PartitionName format in slurm.conf

2016-01-20 Thread Benjamin Redling
Am 19.01.2016 um 20:37 schrieb Andrus, Brian Contractor: I am testing our slurm to replace our torque/moab setup here. The issue I have is to try and put all our node names in the NodeName and PartitionName entries. In our cluster, we name our nodes compute-- That seems to be problem enough wit

[slurm-dev] Re: NodeName and PartitionName format in slurm.conf

2016-01-20 Thread Benjamin Redling
Am 20.01.2016 um 11:00 schrieb Benjamin Redling: Am 19.01.2016 um 20:37 schrieb Andrus, Brian Contractor: I am testing our slurm to replace our torque/moab setup here. The issue I have is to try and put all our node names in the NodeName and PartitionName entries. In our cluster, we name our

[slurm-dev] Re: jobs vanishing w/o trace(?)

2016-01-22 Thread Benjamin Redling
Am 16.01.2016 um 21:10 schrieb Benjamin Redling: I loose every job that gets allocated on a certain node (KVM instance). [...] Now I had to change the default route of the host because of a brittle non-slurm instances with a web app. after starting the unchanged instance several days later

[slurm-dev] Re: jobs vanishing w/o trace(?)

2016-01-25 Thread Benjamin Redling
Am 16.01.2016 um 21:10 schrieb Benjamin Redling: [...] how is it at all possible that the jobs get lost? What happened that the slurm master thinks all went well? (Does it? Am I just missing something?) Where can I start to investigate next? I could fire several hundert jobs with a dummy

[slurm-dev] slurm-dev output dir missing (Re: Re: jobs vanishing w/o trace(?))

2016-01-27 Thread Benjamin Redling
Am 25.01.2016 um 16:41 schrieb Benjamin Redling: > I could fire several hundert jobs with a dummy shell script against that > node but as soon as one of my users tries a complex pipeline jobs get > lost with a slurm-*.out typo: lost _without_ a .out-file Question: > Wh

[slurm-dev] slurm-dev check health and trigger actions via monitoring (Re: Re: NHC and disk / dell server health)

2016-01-28 Thread Benjamin Redling
Am 27.01.2016 um 09:53 schrieb Ole Holm Nielsen: > On 01/27/2016 09:12 AM, Johan Guldmyr wrote: >> has anybody already made some custom NHC checks that can be used to >> check disk health or perhaps even hardware health on a dell server? >> I've been thinking of using smartctl + NHC to test if th

[slurm-dev] Re: Share free cpus

2016-01-29 Thread Benjamin Redling
Am 18.01.2016 um 18:42 schrieb Benjamin Redling: > Am 18.01.2016 um 01:39 schrieb Jordan Willis: >> CompleteWait=60 >> SlurmdUser=root > side note: really root? Why not a dedicated user? It is at least the Debian default and I just didn't se

[slurm-dev] Re: Ressouces allocation problem

2016-01-29 Thread Benjamin Redling
Am 29.01.2016 um 15:08 schrieb David Roman: > I created 2 jobs > Job_A uses 8 CPUS in partion DEV > Job_B uses 16 CPUS in partion LOW > > If I start Job_A before Job_B, all is ok. Job_A is in RUNNING state and Job_B > is in PENDING state > > BUT, If I start Job_B before Job_A. The both jobs a

[slurm-dev] Re: Ressouces allocation problem

2016-01-29 Thread Benjamin Redling
Am 29.01.2016 um 15:31 schrieb Dennis Mungai: > Add SHARE=FORCE to your partition settings for each partition entry in > the configuration file. https://computing.llnl.gov/linux/slurm/cons_res_share.html selection setting was: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory Share

[slurm-dev] Re: Share free cpus

2016-01-29 Thread Benjamin Redling
s running as "slurm". I went back to this thread because Brian Freed "slurmd node state down" on 28th Jan. got a reply by Trey Dockendorf that gave me that hint that I mixed things up a few days before /B > On 01/29/2016 08:10 AM, Benjamin Redling wrote: >> Am 18.0

[slurm-dev] Re: Ressouces allocation problem

2016-01-29 Thread Benjamin Redling
As far as I understood Slurm with setting Share=FORCE you risk over-committing. /Benjamin On 2016-01-29 16:10, Dennis Mungai wrote: > And with the SHARE=FORCE:8 parameter, each consumable processor, socket or > core can be shared by 8 jobs, as an example. > > On Jan 29, 2016 5:08 PM, David Rom

[slurm-dev] Re: Ressouces allocation problem

2016-01-29 Thread Benjamin Redling
t; http://slurm.schedmd.com/cons_res_share.html Full ACK. Was leaving the office and just used the highest page ranking to make a point. > On January 29, 2016 6:42:24 AM PST, Benjamin Redling > wrote: >> >> Am 29.01.2016 um 15:31 schrieb Dennis Mungai: >>> Add SHARE=FORCE to

[slurm-dev] Re: Ressouces allocation problem

2016-01-29 Thread Benjamin Redling
sbatch: error: Batch job submission failed: Requested node configuration is > not available > I try the other solutions that you give me, and I tell you what happens. > > PS : I'm sorry, but my English is not very good. > > David > > > -Message d'ori

[slurm-dev] Re: Ressouces allocation problem

2016-01-29 Thread Benjamin Redling
llocation problem > > > Can you change your consumable resources from CR_Core_Memory to CR_CPU_Memory? > On Jan 29, 2016 5:42 PM, Benjamin Redling > mailto:benjamin.ra...@uni-jena.de>> wrote: > > Am 29.01.2016 um 15:31 schrieb Dennis Mungai: >> Add SHARE=F

[slurm-dev] Re: Ressouces allocation problem

2016-01-29 Thread Benjamin Redling
ume_. /Benjamin > > > I try the other solutions that you give me, and I tell you what happens. > > PS : I'm sorry, but my English is not very good. > > David > > > -Message d'origine- > De : Benjamin Redling [mailto:benjamin.ra...@uni-jena.

[slurm-dev] Re: Ressouces allocation problem

2016-02-01 Thread Benjamin Redling
On 2016-02-01 11:08, David Roman wrote: > The both nodes are the same. They are virtual machine (VMWARE) to do some > tests. That makes me wonder why changing fastschedule=0 to 1 results in comprehensible behavior. Have you looked into the log files on the master and the node? (Apart from that

[slurm-dev] Re: Job limits

2016-02-03 Thread Benjamin Redling
I haven't understood why qos GrpJobs= -- assoc per user? -- won't work for you. Am 4. Februar 2016 01:50:22 MEZ, schrieb "Skouson, Gary B" : > >I'd like a way to be able to limit the number of jobs that a user is >allowed to run before we only allow them to run by backfilling. > >For example, le

[slurm-dev] Re: slurm-dev Job not using cores from different nodes

2016-02-04 Thread Benjamin Redling
Can you post how you submitted the job? Mira on 60 cores needs MPI in your case. Multi threading works w/o BTW. Your config says 31cpus. Generated without incr index or intended? Am 4. Februar 2016 18:02:15 MEZ, schrieb Pierre Schneeberger : >Hi there, > >I'm setting up a small cluster composed

[slurm-dev] Re: Job limits

2016-02-04 Thread Benjamin Redling
On 2016-02-04 16:43, Skouson, Gary B wrote: > The GrpJobs limits the total number of jobs allowed to be running. Let's say > I want to allow 70 jobs per users. The GrpJobs would work fine for that. > However, I'd like to limit the number of jobs able to reserve resources in > the backfill s

[slurm-dev] Re: Setting up SLURM for a single multi-core node

2016-02-11 Thread Benjamin Redling
On 2016-02-11 07:36, Rohan Garg wrote: > [...] The machine has 16 physical cores > on 2 sockets with HyperThreading enabled. I'm using the EASY > scheduling algorithm with backfilling. The goal is to fully utilize all > the available cores at all times. > Given a list of three jobs with requireme

[slurm-dev] Re: Setting up SLURM for a single multi-core node

2016-02-11 Thread Benjamin Redling
On 2016-02-11 07:36, Rohan Garg wrote: > > Hello, > > I'm trying to set up SLURM-15.08.1 on a single multi-core node to > manage multi-threaded jobs. The machine has 16 physical cores > on 2 sockets with HyperThreading enabled. I'm using the EASY > scheduling algorithm with backfilling. The goal

[slurm-dev] Re: slurm-dev Job not using cores from different nodes

2016-02-12 Thread Benjamin Redling
Am 10.02.2016 um 09:04 schrieb Pierre Schneeberger: > I submitted the job with sbatch and the following command: > #!/bin/bash > #SBATCH -n 80 # number of cores > #SBATCH -o > /mnt/nfs/bio/HPC_related_material/Jobs_STDOUT_logs/slurm.%N.%j.out # STDOUT > #SBATCH -e > /mnt/nfs/bio/HPC_related_materi

[slurm-dev] Re: job can not requeue after preempted

2016-03-18 Thread Benjamin Redling
On 03/17/2016 04:01, 温圣召 wrote: > The preempted job1 show a PD reason of BeginTime > my job invocation at the info of them as follow: > [root@szwg]# sbatch --gres=gpu:4 -N 1 --partition=low mybatch.sh You demand for _4_ GPUs and 1 node. Your config says each node has Gres=gpu:2 > Submitted b

[slurm-dev] Re: job can not requeue after preempted

2016-03-19 Thread Benjamin Redling
On 2016-03-16 13:54, 温圣召 wrote: > my job ... can not be requeue when it preempted ... Can you please post the job invocation too? Does the preempted job1 show a PD reason (%R) in the queue? Regards, Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49

[slurm-dev] Re: job can not requeue after preempted

2016-03-19 Thread Benjamin Redling
On 03/17/2016 04:01, 温圣召 wrote: > The preempted job1 show a PD reason of BeginTime > my job invocation at the info of them as follow: > [root@szwg]# sbatch --gres=gpu:4 -N 1 --partition=low mybatch.sh You demand for _4_ GPUs and 1 node. Your config says each node has Gres=gpu:2 > Submitted b

[slurm-dev] Re: Regards Postgres Plugin for SLURM

2016-03-21 Thread Benjamin Redling
+1 have been bitten by MySQL many times in my career. * Constraints silently ignored (luckily MyISAM is long gone) * Dumps that are coming from one instance not going back into the same * Huge BLOBs that never get out via a dump * BLOBs destroying the whole integrity of a DB ... I don't trust it

[slurm-dev] Re: Slurm error

2016-03-22 Thread Benjamin Redling
Hi Helmi Azizan, On 03/22/2016 05:11, Helmi Azizan wrote: > I created a new slurm.conf using the easy configurator but am still facing > the following error: hopefully the correct version, fitting to the 2.6 version you are using. > helmi@Dellrackmount:~$ srun -N1 /bin/hostname > srun: error: U

[slurm-dev] Re: Multiple Python versions

2016-03-24 Thread Benjamin Redling
+1 If a user needs a newer Python version and a paper submission is due I don't want to be a blocker as an admin The conservative packages of server distributions are of no value to my users. They are free to run from their homes whatever they need, otherwise the clusters would be pointless! BR

[slurm-dev] slurm-dev please fix your quoting first (Re: Slurm Error)

2016-03-24 Thread Benjamin Redling
On 2016-03-24 04:25, Helmi Azizan wrote: You wrote > https://groups.google.com/d/msg/slurm-devel/LXmU3BoWGQw/ULqmA85qKAAJ I wrote: > hopefully the correct version, fitting to the 2.6 version you are using. You wrote: >> helmi@Dellrackmount:~$ srun -N1 /bin/hostname >> srun: error: Unable to all

[slurm-dev] Re: Overview of jobs in the cluster

2016-03-24 Thread Benjamin Redling
On 2016-03-24 13:59, Diego Zuccato wrote: > Is there an equivalent of torque's pbstop for SLURM? There are a lot of "rosetta" websites for workload schedulers. Most of the time Slurm, Torque, PBS and Sun Gridengine & variants are listed. > I already tried slurmtop, but it seems something is not

[slurm-dev] Re: Overview of jobs in the cluster

2016-03-25 Thread Benjamin Redling
On 2016-03-24 17:22, John DeSantis wrote: >>> What I'm looking for is a tool that gives me, for every node/cpu the >>> corresponding job. >> >> squeue -n >> >> As the man page explicitly mentions: can be a single node and >> either a NodeName or a NodeHostname > > I believe this is a typo, as w

[slurm-dev] Re: Overview of jobs in the cluster

2016-03-25 Thread Benjamin Redling
15.08 is in Debian testing. A bit risky but I would have a look with pinning what else would need an upgrade as a dependency. BR Am 25. März 2016 11:01:20 MEZ, schrieb Diego Zuccato : > >Il 25/03/2016 09:59, Diego Zuccato ha scritto: > >> I'm using SLURM 14.03.9 (the one packaged in Debian 8) an

[slurm-dev] Re: PySlurm for SLURM 2.3.2 API

2016-04-14 Thread Benjamin Redling
On 04/14/2016 11:08, Naajil Aamir wrote: > Hi hope you are doing well. I am currently working on a scheduling policy > of slurm 2.3.2 for that i need *PYSLURM* version that is compatible with > slurm 2.3.3 which i am unable to find on internet. It would be a great help > if you could provide a lin

[slurm-dev] Re: scontrol update not allowing jobs

2016-04-15 Thread Benjamin Redling
On 04/15/2016 16:22, Glen MacLachlan wrote: > I tried that already by leaving the field blank as in "flags=" but that > has no effect. Should I change it to something else? I set my nodes to State=IDLE after maintenance (from DOWN, DRAIN/DOWN). Depending on your cases you might have to look at

[slurm-dev] Re: scontrol update not allowing jobs

2016-04-15 Thread Benjamin Redling
On 2016-04-15 16:54, Benjamin Redling wrote: > > On 04/15/2016 16:22, Glen MacLachlan wrote: >> I tried that already by leaving the field blank as in "flags=" but that >> has no effect. Should I change it to something else? > > I set my nodes to State=IDLE a

[slurm-dev] Re: More than one job/task per node?

2016-04-29 Thread Benjamin Redling
On 2016-04-29 07:36, Lachlan Musicman wrote: > I'm finding this a little confusing. > > We have a very simple script we are using to test/train staff how to use > SLURM (16.05-pre2). They are moving from an old Torque/Maui system. > > I have a test partition set up, > > from slurm.conf > > Nod

[slurm-dev] Re: How to get command of a running/pending job

2016-05-13 Thread Benjamin Redling
On 2016-05-13 05:58, Husen R wrote: > Does slurm provide feature to get command that being executed/will be > executed by running/pending jobs ? scontrol show --detail job or scontrol show -d job Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49 3

[slurm-dev] Re: How to get command of a running/pending job

2016-05-17 Thread Benjamin Redling
On 05/17/2016 10:02, Loris Bennett wrote: > > Benjamin Redling > writes: > >> On 2016-05-13 05:58, Husen R wrote: >>> Does slurm provide feature to get command that being executed/will be >>> executed by running/pending jobs ? >> >> scontrol

[slurm-dev] Re: How to get command of a running/pending job

2016-05-17 Thread Benjamin Redling
On 2016-05-17 12:28, Carlos Fenoy wrote: > On Tue, May 17, 2016 at 10:02 AM, Loris Bennett > wrote: [...] >> Which version does this? 15.08.8 just seems to show the 'Command' entry, >> which is the file containing the actual command. > You will only see the script in the output of the scontrol

[slurm-dev] Re: How to get command of a running/pending job

2016-05-17 Thread Benjamin Redling
On 2016-05-17 12:19, Loris Bennett wrote: > > Benjamin Redling > writes: > >> On 05/17/2016 10:02, Loris Bennett wrote: >>> >>> Benjamin Redling >>> writes: >>> >>>> On 2016-05-13 05:58, Husen R wrote: [...] >>>

[slurm-dev] Re: NFSv4

2016-05-26 Thread Benjamin Redling
Hi, On 05/25/2016 13:21, Mike Johnson wrote: > I know this is a long-standing question, but thought it was worth > asking. I am in an environment that uses NFSv4, which obviously needs > user credentials to grant access to filesystems. Has anyone else > tackled the issue of unattended batch job

[slurm-dev] Re: Munge time error, but from WHICH node?

2016-05-26 Thread Benjamin Redling
On 05/26/2016 12:16, Per Lönnborg wrote: > Example from logfile below. LOTS of info saying that one ore several > nodes has incorrect time. I want to see which node(s)! > Of course I can ask all nodes about the time, but it´s a bit dull. Even > if we do it in parallell. A monitoring application i

[slurm-dev] Re: Processes sharing cores

2016-06-03 Thread Benjamin Redling
On 2016-06-03 21:25, Jason Bacon wrote: > It might be worth mentioning that the calcpi-parallel jobs are run with > --array (no srun). > > Disabling the task/affinity plugin and using "mpirun --bind-to core" > works around the issue. The MPI processes bind to specific cores and > the embarrassin

[slurm-dev] Re: How to setup node sequence

2016-06-13 Thread Benjamin Redling
On 06/13/2016 09:50, Husen R wrote: > Hi all, > > How to setup node sequence/order in slurm ? > I configured nodes in slurm.conf like this -> Nodes = head,compute,spare. > > Using that configuration, if I use one node in my job, I hope slurm will > choose head as computing node (as it is in a

[slurm-dev] Re: Allocation Error

2016-06-14 Thread Benjamin Redling
Hi, On 2016-06-14 20:19, Martin Kohn wrote: > As you can see even with an job array only one job runs. Below you can find > the script I submit and my configuration. > SchedulerType=sched/buildin > #SchedulerType=sched/backfill > #SchedulerPort=7321 > #SelectType=select/linear > SelectType=sele

[slurm-dev] Re: Allocation Error

2016-06-14 Thread Benjamin Redling
Hi, On 2016-06-14 20:19, Martin Kohn wrote: > As you can see even with an job array only one job runs. Below you can find > the script I submit and my configuration. > SchedulerType=sched/buildin > #SchedulerType=sched/backfill > #SchedulerPort=7321 > #SelectType=select/linear > SelectType=sele

[slurm-dev] Re: Allocation Error

2016-06-16 Thread Benjamin Redling
Hallo Martin, On 2016-06-16 20:21, Martin Kohn wrote: > Hello Benjamin, > > thanks for your answer, I tried it to set SelectTypeParameters=CR_Core but no > success. > > Good know at least for me is that it is the default behavior from slurm to > take the entire node. As I'm coming from Torque

[slurm-dev] Re: how does slurm choose node to allocate ? how to modify this strategie ?

2016-07-06 Thread Benjamin Redling
Hi, On 07/06/2016 11:17, Laurent Facq wrote: > i would like to use only one partition with the 80 nodes, > and that users who need OPA nodes could add a constraint "OPA+IB" to > choose OPA+IB nodes > and, that users who dont need OPA are given IB nodes if some are free, > and OPA+IB nodes ONLY if

[slurm-dev] Re: Single CPU job consuming 8 CPUs

2016-07-12 Thread Benjamin Redling
Hi Yuri, On 2016-07-12 20:53, Yuri wrote: > In slurm.conf I have CPUs=4 for each node (but each node actually has a > Intel Core i7). My question is: why is slurm assigning only one job per > node and each job is consuming 8 CPUs? considering that you only provide "CPU... for each node" the sbat

[slurm-dev] Re: Oversubscription and running job priority

2016-07-26 Thread Benjamin Redling
Hi, On 2016-07-25 22:46, Joshua Baker-LePain wrote: > I think that my initial question was too complex/detailed. Let me ask a > more open-ended one. Do folks have any strategies they'd like to share > on partition setups that favor paying customers while also allowing for > usage of spare resou

[slurm-dev] Re: A question about SLURM environment

2016-07-29 Thread Benjamin Redling
On 2016-07-29 11:15, Kolodiev, Vladimir wrote: > Hello, > I am Vladimir Kolodiev, a SW engineer from Intel Corp. > > I work with SLURM now and I have a question about SLERM_STEP_NODELIST and > SLURM_TASKS_PER_NODE formats. > > I understood that their formats are "hostA[1-18,22],hostB,hostC[001-

[slurm-dev] Re: configuring slurm nodes to use less than the total number of processes on a machine

2016-08-24 Thread Benjamin Redling
Hi, On 08/23/2016 22:29, Tom G wrote: > I have some slurm nodes with 8 core processors and hyperthreading, so 16 > CPUs in effect. I'd like to restrict slurm to only use 12 CPUs on this > machine. What are the right slurm.conf settings to do this? > Doing 8 or 16 CPUs seems straightforward sin

[slurm-dev] Re: Multiple simultaneous jobs on a single node on SLURM 15.08.

2016-08-30 Thread Benjamin Redling
Hi, I didn't see an answer so far, so I try to reason: On 08/29/2016 19:40, Luis Torres wrote: > We have recently deployed SLURM v 15.08.7-build1 on Ubuntu 16.04 > submission and execution nodes with apt-get; we built and installed the > source packages of the same release on Ubuntu 14.04 for th

[slurm-dev] Re: resource usage, TRES and --exclusive option

2016-09-01 Thread Benjamin Redling
Hi On 09/01/2016 10:16, Christof Koehler wrote: > Now, the point we are not sure about is what happens if a user allocates > 10 out of 40 and sets "--exclusive" (if possible). Is the usage of that > user (job) actually computed with 40 CPUs as most people would expect ? > As described before oth

[slurm-dev] Re: single node workstation

2016-09-09 Thread Benjamin Redling
Hi, I think your case is mentioned in the FAQ Q30 in the "NOTE". -- according to this you set CR_CPU and "CPU" only; no cores, no threads, ...: http://slurm.schedmd.com/faq.html [...] 30. Slurm documentation refers to CPUs, cores and threads. What exactly is considered a CPU? If your nodes are

[slurm-dev] Re: Configuring slurm to use all CPUs on a node

2016-09-13 Thread Benjamin Redling
On 09/12/2016 16:48, Uwe Sauter wrote: > > Try SelectTypeParameters=CR_Core instead of CR_CPU That alone is not sufficient: http://slurm.schedmd.com/faq.html#cpu_count BR -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49 3641 9 44321

[slurm-dev] Re: Configuring slurm to use all CPUs on a node

2016-09-13 Thread Benjamin Redling
On 09/12/2016 16:55, Uwe Sauter wrote: > > Also. CPUs=32 is wrong. You need > > Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 Setting "CPU" is not wrong according to the FAQ: http://slurm.schedmd.com/faq.html#cpu_count BR -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 443

[slurm-dev] Re: Configuring slurm to use all CPUs on a node

2016-09-13 Thread Benjamin Redling
On 09/12/2016 18:57, andrealphus wrote: > It doesnt seem like changing it to a different resource allocation > method makes a difference, and almost seems buggy to me, but I guess > is just a quirk of multithread systems. Your issue ("using all hyperthreads") was discussed multiple times on the l

[slurm-dev] Re: Slurm Drain/Down Issue

2016-10-04 Thread Benjamin Redling
ThreadsPerCore should be 1, u set it to 4 BR, Benjamin Am 4. Oktober 2016 16:41:33 MESZ, schrieb evan clark : > >I am not sure if this is the correct place to share this, but maybe >someone can point me in the correct directions. I recently setup a >Centos 7 based slurm cluster, however my nodes

[slurm-dev] Re: set maximum CPU usage per user

2016-10-19 Thread Benjamin Redling
Hi, what are your AccountingStorage settings? Esp. AccountingStorageEnforce. Did limits work before, or is this a first try? Regards, Benjamin Am 19. Oktober 2016 22:14:27 MESZ, schrieb Steven Lo : > > >By the way, we do have the following attribute set: > > PriorityType=priority/multifactor

[slurm-dev] Re: set maximum CPU usage per user

2016-10-20 Thread Benjamin Redling
Hi Steven, On 10/20/2016 00:22, Steven Lo wrote: > We have the attribute commented out: > #AccountingStorageEnforce=0 I think the best is to (re)visit "Accounting and Resource Limits": http://slurm.schedmd.com/accounting.html Right know I have no setup that needs accounting but as far as I curr

[slurm-dev] Re: set maximum CPU usage per user

2016-10-24 Thread Benjamin Redling
Hi, On 10/21/2016 18:58, Steven Lo wrote: > Is MaxTRESPerUser a better option to use? if you only ever want to restrict every user alike, that seems reasonable. I would choose whatever fits your needs right now and in the not so distant future. That way you gain time to learn about the options s

[slurm-dev] RE: slurm_load_partitions: Unable to contact slurm controller (connect failure)

2016-10-25 Thread Benjamin Redling
Hi, are you both working on the same cluster as the OP? On 10/25/2016 08:12, suprita.bot...@wipro.com wrote: > I have installed slurm on a 2 node cluster. > > On the master node when I run sinfo command I get below output. [...] > But on compute node:Slurmd daemon is also running but it gives

[slurm-dev] Re: Set Limit Time Per Job

2016-10-26 Thread Benjamin Redling
Hi, Am 26.10.2016 um 15:35 schrieb Achi Hamza: > But when i run a job more than 3 minutes it does not stop, like: > srun -n1 sleep 300 > > I also set MaxWall parameter but to no avail: > sacctmgr show qos format=MaxWall > MaxWall > --- > 00:03:00 > > Please advice where

[slurm-dev] Re: Job steps facility as in LoadLeveler?

2016-10-26 Thread Benjamin Redling
Hi, Am 25.10.2016 um 13:20 schrieb Patrice Peterson: > is there a build-in way to queue LoadLeveler-like job steps in SLURM? > Something like this: > > #!/bin/bash > #SBATCH --num-tasks=1 > echo "prepping data, simple stuff" > #SBATCH --- END STEP > > #SBATCH --num-tasks

[slurm-dev] Re: poor utilization, jobs not being scheduled

2016-10-29 Thread Benjamin Redling
Are you allowed and able to post the slurm.conf? What does sprio -o %Q -j say about that job? BR, Benjamin Am 29. Oktober 2016 20:56:18 MESZ, schrieb Vlad Firoiu : >I'm trying to figure out why utilization is low on our university >cluster. >It appears that many cores are available, but a minima

[slurm-dev] Re: poor utilization, jobs not being scheduled

2016-10-30 Thread Benjamin Redling
tyMaxAge=1-0 > >The particular job in question has 0 priority. > >On Sat, Oct 29, 2016 at 7:50 PM, Benjamin Redling < >benjamin.ra...@uni-jena.de> wrote: > >> Are you allowed and able to post the slurm.conf? What does sprio -o >%Q -j >> say about that job? BR, B

[slurm-dev] Re: poor utilization, jobs not being scheduled

2016-10-30 Thread Benjamin Redling
Am 31.10.2016 um 00:23 schrieb Benjamin Redling: > Are you aware that as long as SchedulerType Sorry, typo. I meant *SelectType* (The rest I wrote next is just unfiltered noise from my brain while skimming the conf:) >is not set to anything explicitly, select/linear is the default? &

[slurm-dev] Re: poor utilization, jobs not being scheduled

2016-10-30 Thread Benjamin Redling
Am 31.10.2016 um 00:47 schrieb Vlad Firoiu: > What do you mean the ScheduleType is not explicit? I see > `SchedulerType=sched/backfill`. (I don't know too much about slurm so I > am probably misunderstanding something.) Vlad, you are right: ScheduleType _is_ set. I meant SelectType. See my other

[slurm-dev] Re: How to distribute each task of an array job to a different compute node?

2016-11-02 Thread Benjamin Redling
Hi, try adding "-N 10" to explicitly ask for ten nodes too. If you have access to the slurm.conf and don't have to share the cluster, or share it with people with the same needs, you might like SelectTypeParameters=CR_LLN or LLN on a per partition basis. (s. http://slurm.schedmd.com/slurm.conf)

[slurm-dev] Re: SLURM reports much higher memory usage than really used

2016-12-15 Thread Benjamin Redling
Am 15. Dezember 2016 14:48:24 MEZ, schrieb Stefan Doerr : >$ sinfo --version >slurm 15.08.11 > >$ sacct --format="CPUTime,MaxRSS" -j 72491 > CPUTime MaxRSS >-- -- > 00:27:06 > 00:27:06 37316236K > > >I will have to ask the sysadms about cgroups since I'm just a user >here.

[slurm-dev] Re: Error showing in slurmd daemon startup

2016-12-24 Thread Benjamin Redling
Am 24. Dezember 2016 04:43:36 MEZ, schrieb Will Dennis : >I see the following in the systemctl status ouput of the slurmd service >on my compute nodes: > >Dec 23 21:31:58 host01 slurmd[32101]: error: You are using cons_res or >gang scheduling with Fastschedule=0 and node configuration differs from

[slurm-dev] Re: Error showing in slurmd daemon startup

2016-12-24 Thread Benjamin Redling
Hi Will, Am 24.12.2016 um 21:10 schrieb Will Dennis: > Thanks for helping to interpret the error message… Clear enough to me now. You're welcome! I wrote a bit brief because I used my mobile. > I was told (by one of my researchers) that setting “FastSchedule=0” would > "tell Slurm to get the h

[slurm-dev] Re: How to sent jobs for all nodes automatically

2017-03-06 Thread Benjamin Redling
Hi David, Am 06.03.2017 um 12:05 schrieb David Ramírez: > I have little problem. Slurm allocated job allocated nodes (When a nodes > is full, sent job to next one). > > I need use all nodes without order (customer like that) I don't know "without order", but you can spread the load with "least

[slurm-dev] RE: MaxJobs on association not being respected

2017-03-16 Thread Benjamin Redling
Hello Will, On 2017-03-15 18:13, Will Dennis wrote: > Here are their definitions in slurm.conf: > > # PARTITIONS > PartitionName=batch Nodes=[nodelist] Default=YES DefMemPerCPU=2048 > DefaultTime=01:00:00 MaxTime=05:00:00 PriorityTier=100 PreemptMode=off > State=UP > PartitionName=long Nodes=[

[slurm-dev] RE: MaxJobs on association not being respected

2017-03-17 Thread Benjamin Redling
Re hi, On 2017-03-17 03:01, Will Dennis wrote: > My slurm.conf: > https://paste.fedoraproject.org/paste/RedFSPXVlR2auRlevS5t~F5M1UNdIGYhyRLivL9gydE=/raw > >> Are you sure the current running config is the one in the file? >> Did you double check via "scontrol show config" > > Yes, all params se

[slurm-dev] Re: Fwd: Dependency Problem In Full Queue

2017-03-17 Thread Benjamin Redling
Good examples: https://hpc.nih.gov/docs/job_dependencies.html BR On 2017-03-15 17:37, Álvaro pc wrote: > Hi again! > > I would really like to know about the behaviour of --dependency argument.. > > Nobody know anything? > > *Álvaro Ponce Cabrera.* > > > 2017-03-14 12:31 GMT+01:00 Álvaro pc

[slurm-dev] Re: Scheduling jobs according to the CPU load

2017-03-19 Thread Benjamin Redling
Am 19.03.2017 um 15:36 schrieb kesim: > ... I only want to find > the solution for the trivial problem. I also think that slurm was design > for HPC and it is performing well in such env. I agree with you that my > env. hardly qualifies as HPC but still one of the simplest concept > behind any sch

[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-21 Thread Benjamin Redling
Hi, if you don't want to depend on the whitespaces in the output of "uptime" (the number of fields depends on a locale) you can improve that via "awk '{print $3}' /proc/loadavg" (for the 15min avg) -- it's always better to avoid programmatically accessing output made for humans as long as possibl

[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-21 Thread Benjamin Redling
re hi, your script will occasionally fail because the number of fields in the output of "uptime" is variable. I was reminded by this one: http://stackoverflow.com/questions/11735211/get-last-five-minutes-load-average-using-ksh-with-uptime Even more a reason to use /proc... Regards, Benjamin Am

[slurm-dev] Re: Re:Best Way to Schedule Jobs based on predetermined Lists

2017-04-05 Thread Benjamin Redling
Am 05.04.2017 um 15:58 schrieb maviko.wag...@fau.de: [...] > The purpose of this cluster is to investigate how smart distribution of > workloads based on predetermined performance and energy data can benefit > hpc-clusters that consist of heterogenous systems that differ greatly > regarding energy

[slurm-dev] Re: LDAP required?

2017-04-11 Thread Benjamin Redling
Am 11. April 2017 08:21:31 MESZ, schrieb Uwe Sauter : > >Ray, > >if you're going with the easy "copy" method just be sure that the nodes >are all in the same state (user management-wise) before >you do your first copy. Otherwise you might accidentally delete already >existing users. > >I also enco

[slurm-dev] Re: LDAP required?

2017-04-11 Thread Benjamin Redling
AFAIK most request never hit LDAP servers. In production there is always a cache on the client side -- nscd might have issue, but that's another story. Regards, Benjamin On 2017-04-11 15:32, Grigory Shamov wrote: > On a larger cluster, deploying NIS, LDAP etc. might require some > thought, becau

[slurm-dev] Re: Slurm with Torque

2017-04-16 Thread Benjamin Redling
Hello Mahmood, Am 16.04.2017 um 16:11 schrieb Mahmood Naderan: > Hi, > Currently, Torque is running on our cluster. I want to know, is it > possible to install Slurm, create some test partitions, submit some test > jobs and be sure that it is working while Torque is running? > Then we are able to

[slurm-dev] Re: SLURM terminating jobs before they finish

2017-04-17 Thread Benjamin Redling
Hi Batsirai, Am 17.04.2017 um 14:54 schrieb Batsirai Mabvakure: > SLURM has been running okay until recently my jobs are terminating before > they finish. > I have tried increasing memory using --mem, but still the jobs stop halfway with an error in the slurm.out file. > I then tried running ag

[slurm-dev] Re: Multinode MATLAB jobs

2017-05-31 Thread Benjamin Redling
Hi, Am 31.05.2017 um 10:39 schrieb Loris Bennett: > Does any one know whether one can run multinode MATLAB jobs with Slurm > using only the Distributed Computing Toolbox? Or do I need to be > running a Distributed Computing Server too? if you can get a hand on the overpriced and underwhelming D

[slurm-dev] Re: Multifactor priority plugin

2017-06-06 Thread Benjamin Redling
Hello Sourabh, On 2017-06-06 10:52, sourabh shinde wrote: > Problem : > As per my understanding, high priority jobs are executed first and takes > all of the available nodes. > I need that atleast one low or normal priority job should be executed in > parallel with the high priority jobs. I want

[slurm-dev] Re: Announcing Slurm Job Packs

2017-07-14 Thread Benjamin Redling
On 2017-07-13 18:51, Perry, Martin wrote: > This email is to announce the latest version of the job packs feature > (heterogeneous resources and MPI-MPMD tight integration support) as > open-source code. [...] > The code can be cloned from this branch: > _https://github.com/RJMS-Bull/slurm/tree/dev

[slurm-dev] Re: Slurm with High Availabilty/Automatic failover

2017-07-26 Thread Benjamin Redling
Hello, Am 25.07.2017 um 16:19 schrieb J. Smith: > Does anyone has any suggestions in setting up high availability and > automatic failover between two servers that run a Controller daemon, > Database daemon and Mysql Database (i.e replication vs galera cluster)? > > Any input would be appreciated

[slurm-dev] Re: Reason for job state CANCELLED

2017-07-29 Thread Benjamin Redling
Am 29. Juli 2017 08:07:44 MESZ, schrieb Florian Pommerening : > >Hi everyone, > >is there a way to find out why a job was canceled by slurm? I would >like >to distinguish the cases where a resource limit was hit from all other >reasons (like a manual cancellation). In case a resource limit was

[slurm-dev] Re: Can you run SLURM on a single node ?

2017-08-10 Thread Benjamin Redling
Am 10. August 2017 13:47:21 MESZ, schrieb Sean McGrath : > >Yes, you can run slurm on a single node. There is no need for for a >different >head and compute node(s). > >You will need to set Shared=Yes if you want multiple people to be able >to run on >the machine simultaneously. > >The slurm.conf

[slurm-dev] Re: Fwd: using slurm at diffrent VLANS

2017-08-24 Thread Benjamin Redling
On 2017-08-24 09:18, nir wrote: [...] > slurm server ip 192.168.10.1 > compute nodes 10.2.2.3-40 > > Until yesterday the compute nodes were in the same VLAN as the slurm , > but i had to move them to new VLAN. > After i moved them there is ping connection between slurm server and the > compute no

[slurm-dev] Re: Fwd: using slurm at diffrent VLANS

2017-08-24 Thread Benjamin Redling
Re hi, On 2017-08-24 12:55, nir wrote: > Thank you for your answer. > Yes I went over this guide. > didn't find any problem since compute nodes communicate with slurm server. if you did so, what does "scontrol show node " give as a reason for "DOWN"? BR, BR -- FSU Jena | JULIELab.de/Staff/Ben

[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-13 Thread Benjamin Redling
On 13.09.2017 02:56, Christopher Samuel wrote: On 13/09/17 10:47, Lachlan Musicman wrote: Chris how does this sacrifice performance? If none of my software (bioinformatics/perl) is HT, surely I'm sacrificing capacity by leaving one thread unused as jobs take an entire core? A HT is not a cor

[slurm-dev] Re: How user can requeue an old job?

2017-09-14 Thread Benjamin Redling
On 14.09.2017 10:52, Taras Shapovalov wrote: Hey guys! As far as I know now there is a built-in 5 min time interval after a job is finished, which leads to the job removal from Slurm "memory" (not from accounting). This is ok until users need to requeue the job by some reason. Thus if 5 mi

[slurm-dev] Re: How user can requeue an old job?

2017-09-14 Thread Benjamin Redling
On 14.09.2017 11:12, Merlin Hartley wrote: > I wonder: what would be the ramifications of setting this to 0 in production? "A value of zero prevents any job record purging” > Or is that option only really there for debugging? (just guessing) should be horrible: once "MaxJobCount" (s. slurm.con

[slurm-dev] Re: Setting up Environment Modules package

2017-10-05 Thread Benjamin Redling
Hello Mike, On 10/4/17 6:10 PM, Mike Cammilleri wrote: I'm in search of a best practice for setting up Environment Modules for our Slurm 16.05.6 installation (we have not had the time to upgrade to 17.02 yet). We're a small group and had no explicit need for this in the beginning, but as we

  1   2   >