then to let PMIx handle pmix solely and let slurm handle the rest. Thanks!
Am I right in reading that you don't have to build slurm against PMIx?
So it just interoperates with it fine if you just have it installed and
specify pmix as the launch option? That's neat.
-Paul Edmon-
On 11/28/2017 6
is the right way of building PMIx and Slurm such that they
interoperate properly?
Suffice it to say little to no documentation exists on how to properly
this, so any guidance would be much appreciated.
-Paul Edmon-
is substantial, thus the lag crossing back and for can add up. I
would check to see if all your nodes can talk to each other and the
master and if your Timeouts are set high enough.
-Paul Edmon-
On 12/04/2017 01:57 PM, Stradling, Alden Reid (ars9ac) wrote:
I have a number of nodes that have, after our
There is a spank x11 plugin that I think pretty much everyone used:
https://github.com/hautreux/slurm-spank-x11
-Paul Edmon-
On 05/14/2018 02:44 PM, Mahmood Naderan wrote:
Hi,
I see --x11 option in [1], but there isn't any such option. Is that
for old versions? Also, there is a wrapper [2
Assuming you can build slurm and its dependencies this should work.
We've run slurm here with different OS's on various nodes for a while
and it works fine. That said I haven't tried odroids so I can't speak
specifically to that.
-Paul Edmon-
On 05/10/2018 08:26 AM, agostino bruno wrote
Not that I am aware of. Since the header isn't really part of the
script bash doesn't evaluate them as far as I know.
-Paul Edmon-
On 05/10/2018 09:19 AM, Dmitri Chebotarov wrote:
Hello
Is it possible to access environment variables in a submit script?
F.e. $SCRATCH is set to a path and I
to
limit usage.
-Paul Edmon-
On 05/08/2018 10:08 AM, Renfro, Michael wrote:
That’s the first limit I placed on our cluster, and it has generally worked out
well (never used a job limit). A single account can get 1000 CPU-days in
whatever distribution they want. I’ve just added a root-only
If you are in SystemD land the command is:
systemctl restart slurmctld
-Paul Edmon-
On 06/05/2018 06:00 AM, Mahmood Naderan wrote:
Yes
Yes/No
:)
Regards,
Mahmood
On Tue, Jun 5, 2018 at 2:18 PM, Buckley, Ronan <mailto:ronan.buck...@dell.com>> wrote:
Hi All,
I need t
You will get whatever cores Slurm can find which will be an assortment
of hosts.
-Paul Edmon-
On 6/20/2018 11:01 AM, Nathan Harper wrote:
sorry to hijack, but we've been considering a similar configuration,
but I was wondering what happens if you don't set a processor type
It sounds like your second partition is getting primarily scheduled by
the backfill scheduler. I would try the partition_job_depth option as
otherwise the main loop only looks at priority order and not by partition.
-Paul Edmon-
On 4/29/2018 5:32 AM, Zohar Roe MLM wrote:
Hello.
I am having
jobs can't run due to some vargarity in logic
(typically because it thinks that it won't fit due to time constraints).
Anyways that's where I would start.
-Paul Edmon-
On 7/3/2018 5:22 PM, Christopher Benjamin Coffey wrote:
Hello!
We are having an issue with high priority gpu jobs blocking
script doesn't catch it.
-Paul Edmon-
On 1/15/2018 8:31 AM, John Hearns wrote:
Juan, me kne-jerk reaction is to say 'containerisation' here.
However I guess that means that Slurm would have to be able to inspect
the contents of a container, and I do not think that is possible.
I may be very
Yeah, I've found that in those situations to have people wrap their
threaded programs in srun inside of sbatch. That way the scheduler
knows which process specifically gets the threading.
-Paul Edmon-
On 02/22/2018 10:39 AM, Loris Bennett wrote:
Hi Paul,
Paul Edmon <ped...@cfa.harvard.
though so perhaps we avoided that particular query
due to that.
From past experience these major upgrades can take quite a bit of time
as they typically change a lot about the DB structure in between major
versions.
-Paul Edmon-
On 02/22/2018 06:17 AM, Malte Thoma wrote:
FYI:
* We broke our
Typically changes like this only impact pending or newly submitted
jobs. Running jobs usually are not impacted, though they will count
against any new restrictions that you put in place.
-Paul Edmon-
On 1/4/2018 6:44 AM, Juan A. Cordero Varelaq wrote:
Hi,
A couple of jobs have been
Restarting slurmd should be fine assuming they come back before the
communications time out. I restart slurmd's all the time and haven't
had any real problems.
-Paul Edmon-
On 7/27/2018 6:51 PM, Chris Harwell wrote:
Ot is possible, but double check your config for timeouts first.
On Fri
Generally it is best that they should be. Slurm maps the users
environment into the job submission. So if things change in the OSt
under it it can lead to issues.
-Paul Edmon-
On 07/26/2018 12:39 PM, Liam Forbes wrote:
Morning All.
I'm attempting to set up a new submit host
So the recommendation I've gotten the past is to us option number 4 from
this FAQ:
https://www.open-mpi.org/faq/?category=tuning#setting-mca-params
This works for both mpirun and srun in slurm because its a flat file
that is read rather than options that are passed in.
-Paul Edmon-
On 07
So there are different options you can set for Return to Service in the
slurm.conf which can effect how the node is handled on reconnect. You
can also up the timeouts for the daemons.
-Paul Edmon-
On 8/31/2018 5:06 PM, Renfro, Michael wrote:
Hey, folks. I’ve got a Slurm 17.02 cluster (RPMs
You could probably accomplish this using a job submit lua script and
some crafted QoS's. It would take some doing but I imagine it could work.
-Paul Edmon-
On 03/12/2018 02:46 PM, Keith Ball wrote:
Hi All,
We are looking to have time-based partitions; e.g. a"day" and "ni
I would recommend putting a clean up process in your epilog script. We
have a check here that sees if the job completed and if so it then
terminates all the user processes by kill -9 to clean up any residuals.
If it fails it closes of the node so we can reboot it.
-Paul Edmon-
On 04/23
So if you use the showq utility it has functionality for that:
https://github.com/fasrc/slurm_showq
Happy to have contributors to this.
-Paul Edmon-
On 10/05/2018 09:56 AM, Alexandre Strube wrote:
Is there a way to show the actual position in the queue, given the
current priority? It’s
I'm not aware of one. This may be worth a feature request to the devs
at bugs.schedmd.com
-Paul Edmon-
On 10/16/18 7:29 AM, Antony Cleave wrote:
Hi All
Yes, I realise this is almost certainly the intended outcome. I have
wondered this for a long time but only recently got round to testing
in parallel jobs being distributed
across many nodes. Note that node *Weight* takes precedence over how
many idle resources are on each node. Also see the
*SelectParameters* configuration parameter *CR_LLN* to use the least
loaded nodes in every partition.
-Paul Edmon-
On 11/15/2018 4:25 AM
into the SchedMD guys to see if they have any more insight. Then again
some one on this list might have seen the same issue.
-Paul Edmon-
On 11/7/18 10:20 AM, Scott Hazelhurst wrote:
Thanks, Paul, yes, it does seem a likely cause, but I can’t see the problem.
All machines have the same /etc/hosts file
rare though that we need to look back at that data.
-Paul Edmon-
On 10/01/2018 08:12 AM, Chris Samuel wrote:
On Saturday, 29 September 2018 1:18:24 AM AEST Ole Holm Nielsen wrote:
Does anyone have a good explanation of usage of the Archive and Purge
features for the Slurm database
restarting the service it
times out and the database only gets partially update. In which case I
had to restore from the mysqldump I had made and tried again. Also
highly recommend doing mysqldumps prior to major version updates.
-Paul Edmon-
On 09/25/2018 09:54 AM, Baker D.J. wrote
This is the idea behind XDMod's SUPReMM. It does generate a ton of data
though, so it does not scale to very active systems (i.e. churning over
tens of thousands of jobs).
https://github.com/ubccr/xdmod-supremm
-Paul Edmon-
On 12/9/2018 8:39 AM, Aravindh Sampathkumar wrote:
Hi All.
I
Your best bet is a LUA job submission script to strip these options from
the submissions.
-Paul Edmon-
On 11/27/18 11:48 AM, Aaron Jackson wrote:
Hi all,
I am wondering if it is possible to disable the use of the --nodelist
argument from srun/sbatch/salloc/etc? In the worst case I can just
I'm pretty sure that gres.conf has to be on all the nodes as well and
not just the master.
-Paul Edmon-
On 1/11/19 5:21 AM, Sean McGrath wrote:
Hi everyone,
Your help for this would be much appreciated please.
We have a cluster with 3 types of gpu configured in gres. Users can successfully
users and then map that in
to Slurm using sacctmgr.
It really depends on if your Slurm users are a subset of your regular
users or not.
-Paul Edmon-
On 9/12/2018 12:21 PM, Andre Torres wrote:
Hi all,
I’m new to slurm and I’m confused regarding user creation. I have an
installation
So the Lua script I posted only does it for people who submit to the
cluster. To do it for all users it should just be a simple bash script
to do that, I don't have one put together though.
-Paul Edmon-
On 09/13/2018 10:29 AM, Eric F. Alemany wrote:
Hi Paul
You said
“Another way would
Sure. Here is our lua script.
-Paul Edmon-
On 09/13/2018 07:28 AM, Andre Torres wrote:
That's interesting using AD to maintain uid consistency across all the nodes.
Like Loris, I'm also interested in your Lua script.
-
André
On 13/09/2018, 11:42, "slurm-users on behalf of Loris Be
only add them if they don't already exist so the impact is only
when new users appear.
-Paul Edmon-
On 09/13/2018 10:48 AM, Douglas Jacobsen wrote:
At one point in time we would also use the job_submit.lua to add
users, however, I cannot recommend it in general since job_submit runs
while
Users can control that:
https://slurm.schedmd.com/sbatch.html
-Paul Edmon-
On 09/13/2018 11:10 AM, Ariel Balter wrote:
Does anyone know how to change email settings?
On 9/13/2018 7:59 AM, Damien François wrote:
Just to add my 2c to the discussion: at our site, we use a utility we
wrote
several smaller purges. That at least worked for us
in the past.
-Paul Edmon-
On 4/4/19 9:38 AM, Julien Rey wrote:
Hello,
Our slurm accounting database is growing bigger and bigger (more than
100Gb) and is never being purged. We are running slurm 15.08.0-0pre1.
I would like to upgrade
ust to see if there is any database work that was done.
-Paul Edmon-
On 4/5/19 9:05 AM, Julien Rey wrote:
Hi Paul, thanks for your advice. Actually I already tried what you
suggested. No matter what value do I put after PurgeJobAfter I always
end up with the same error:
sacctmgr archive dump Direc
a downtime for the dbd upgrade. That's not too bad though as we
pause all our jobs out of paranoia for upgrades.
-Paul Edmon-
On 3/1/19 8:10 AM, Ole Holm Nielsen wrote:
We're one of the many Slurm sites which run the slurmdbd database
daemon on the same server as the slurmctld daemon
A lot of this is automated in the new versions of slurm. You should
just need to run:
sacctmgr show runawayjobs
It will then give you an option to clean them and slurm will handle the
rest. If you add the -i option it will just clean them automatically.
-Paul Edmon-
On 3/6/2019 11:58 AM
Odds are the new version won't help for that. You will have to do some
mysql work to fix it then.
-Paul Edmon-
On 3/6/2019 1:23 PM, Brian Andrus wrote:
I am running the latest and did that, but it didn't change anything.
The jobs stay in the runaway state and no changes are made
We tried it once back when they first introduced it and shelved it after
we found that we didn't really need it.
-Paul Edmon-
On 3/4/19 2:26 PM, Christopher Samuel wrote:
Hi folks,
Anyone here tried Slurm's message aggregation (MsgAggregationParams in
slurm.conf) at all?
All the best
Exactly. The easiest way is just to underreport the amount of memory in
slurm. That way slurm will take care of it natively. We do this here as
well even though we have disks in order to make sure the OS has memory
left to run.
-Paul Edmon-
On 3/14/19 8:36 AM, Doug Meyer wrote:
We also run
lua script. That would be my recommended method.
-Paul Edmon-
On 3/12/19 12:31 PM, David Baker wrote:
Hello,
I have set up a serial queue to run small jobs in the cluster.
Actually, I route jobs to this queue using the job_submit.lua script.
Any 1 node job using up to 20 cpus is routed
No. Jobs should continue as normal.
-Paul Edmon-
On 1/31/19 9:38 AM, Buckley, Ronan wrote:
Hi,
Does restarting the slurmctld daemon on a slurm head node affect
running slurm jobs on the compute nodes in any way?
Rgds
Nope per the documentation you have to restart the slurmctld to change
MaxJobCount.
-Paul Edmon-
On 1/31/19 5:58 AM, Buckley, Ronan wrote:
Hi,
I want to increase the MaxJobCount in the slurm.conf file from its
default value of 10,000. I want to increase it to 250,000.
The online
That should be it. It shouldn't impact running jobs.
-Paul Edmon-
On 1/29/19 5:47 AM, Buckley, Ronan wrote:
Hi,
I want to increase the MaxArraySize in the slurm.conf file from its
default value of 1001. I want to increase it to 1.
Is it a case of just adding “MaxArraySize=1
For reference we are running 18.08.7
-Paul Edmon-
On 5/29/19 10:39 AM, Paul Edmon wrote:
Sure. Here is what we have:
## Scheduling
#
### This section is specific to scheduling
### Tells the scheduler to enforce limits for all
PriorityWeightQOS=10
I'm happy to chat about any of the settings if you want, or share our
full config.
-Paul Edmon-
On 5/29/19 10:17 AM, Julius, Chad wrote:
All,
We rushed our Slurm install due to a short timeframe and missed some
important items. We are now looking to implement a better system
took to long to clean up thus those jobs took forever to
schedule.
With the various improvements to the scheduler this may no longer be the
case, but I haven't taken the time to test it on our cluster as our
current set up has worked well.
-Paul Edmon-
On 5/29/19 11:04 AM, Kilian Cavalotti
/partition_prio or preempt/qos plugins.)
In general slurm will try not to preempt if it can avoid it. These
options can help to guide that a bit more intelligently.
-Paul Edmon-
On 5/29/19 8:53 AM, Mike Harvey wrote:
I am relatively new to SLURM, and am having difficulty configuring our
for
resource usage. It has worked pretty well for our purposes.
-Paul Edmon-
On 6/19/19 3:30 PM, Fulcomer, Samuel wrote:
(...and yes, the name is inspired by a certain OEM's software
licensing schemes...)
At Brown we run a ~400 node cluster containing nodes of multiple
architectures
then they have to build their own stack.
-Paul Edmon-
On 6/20/19 11:07 AM, Fulcomer, Samuel wrote:
...ah, got it. I was confused by "PI/Lab nodes" in your partition list.
Our QoS/account pair for each investigator condo is our approximate
equivalent of what you're doing with owned partition
been having a hard enough time
understanding our current system. It's not due to its complexity but
more that most people just flat out aren't cognizant of their usage and
think the resource is functionally infinite.
-Paul Edmon-
On 6/19/19 5:16 PM, Fulcomer, Samuel wrote:
Hi Paul,
Thanks
I don't know off hand. You can sort of construct a similar system in
Slurm, but I've never seen it as a native option.
-Paul Edmon-
On 6/20/19 10:32 AM, John Hearns wrote:
Paul, you refer to banking resources. Which leads me to ask are
schemes such as Gold used these days in Slurm?
Gold
have about using suspend is that while the job is suspended, the memory
that job was using is still allocated. Thus that may be why your jobs
are not moving immediately as Slurm will still consider the memory space
allocated though the CPU is now free.
-Paul Edmon-
On 7/8/19 6:03 PM, Hanu
as in one submission it will generate thousands of jobs which
then the scheduler can handle sensibly. So I highly recommend using job
arrays.
-Paul Edmon-
On 8/27/19 3:45 AM, Guillaume Perrault Archambault wrote:
Hi Paul,
Thanks a lot for your suggestion.
The cluster I'm using has thousands
A QoS is probably your best bet. Another variant might be MCS, which
you can use to help reduce resource fragmentation. For limits though
QoS will be your best bet.
-Paul Edmon-
On 8/30/19 7:33 AM, Steven Dick wrote:
It would still be possible to use job arrays in this situation, it's
just
Yes, QoS's are dynamic.
-Paul Edmon-
On 8/30/19 2:58 PM, Guillaume Perrault Archambault wrote:
Hi Paul,
Thanks for your pointers.
I'll looking into QOS and MCS after my paper deadline (Sept 5). Re
QOS, as expressed to Peter in the reply I just now sent, I wonder if
it the QOS of a job can
for it at that point.
-Paul Edmon-
On 8/28/19 10:49 AM, David Baker wrote:
Hello,
I apologise that this email is a bit vague, however we are keen to
understand the role of the Slurm "StateSave" location. I can see the
value of the information in this location when, for example, we are
upgra
We've hit this before due to RPC saturation. I highly recommend using
max_rpc_cnt and/or defer for scheduling. That should help alleviate
this problem.
-Paul Edmon-
On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote:
Hello,
I wrote a regression-testing toolkit to manage large
re tightly with the
scheduler. Sometime for older versions of MPI they need to use mpirun
but by and large our community uses srun for the above reasons. It's
the more native slurm way of doing things with MPI.
-Paul Edmon-
On 9/17/19 4:12 AM, Marcus Wagner wrote:
Hi Jürgen,
we set in our modules the
Probably your best bet is to use QoS's to accomplish this. Be advised
that suspending jobs still leaves them in memory space.
-Paul Edmon-
On 9/18/19 9:16 PM, Benjamin Wong wrote:
Hello,
I plan to purchase a GPU machine with 8 GPUs which will be shared
between group A and group B. Group
All the aggregate historic data should be accessible via sacct. sstat is
for live jobs but sacct is for completed jobs.
-Paul Edmon-
On 10/30/2019 2:13 PM, Jacob Chappell wrote:
Is there a simple way to store sstat information permanently on job
completion? We already have job accounting
Yes they should be.
-Paul Edmon-
On 12/15/2019 10:28 AM, Raymond Muno wrote:
We are new to SLURM, migrating over from SGE.
When launching OpenMPI jobs (version 4.0.2 in this case) via srun, are
the MCA parameters followed when they are set via environmental
variables, e.g. OMPI_MCA_param
We do this via looking at gres. The info is in the job_desc.gres
variable. We basically do the inverse where we ensure some one is
asking for the gpu before allowing them to submit to a gpu partition.
-Paul Edmon-
On 12/11/2019 12:32 PM, Grigory Shamov wrote:
Hi All,
I am trying
preempt/partition_prio or preempt/qos plugins.)
-Paul Edmon-
On 10/25/19 7:21 AM, Oytun Peksel wrote:
Hi,
Let’s say I have two partitions assigned to the same single load in
the cluster.
LowPrio with PreemptMode=suspend Priority=1
HighPrio with PreemtMode=off Priority=5
I have 4 identical
We have been using:
https://github.com/fasrc/slurm-diamond-collector
For our set up. Though it gives more of an over all look. We also use
this:
https://github.com/fasrc/lsload
-Paul Edmon-
On 10/16/19 4:53 PM, Will Dennis wrote:
Hi all,
We run a few Slurm clusters here, all using
It can also happen if you have a stalled out filesystem or stuck
processes. I've gotten in the habit of doing a daily patrol for them to
clean them up. Most of them time you can just reopen the node but
sometimes this indicates something is wedged.
-Paul Edmon-
On 10/22/2019 5:22 PM, Riebs
sshare is cumulative statistics, so no window is needed. Its just the
sum of the total usage for whatever window you set for fairshare. If
you set no window then it is everything.
-Paul Edmon-
On 3/2/20 10:34 AM, Enric Fortin wrote:
Hi everyone,
I’ve noticed that when using `sshare
Also if you want tracking of fairshare and other stats in graphite, you
can use these:
https://github.com/fasrc/slurm-diamond-collector
-Paul Edmon-
On 2/17/2020 8:57 AM, Chris Samuel wrote:
On 17/2/20 4:19 am, Parag Khuraswar wrote:
Does Slurm provide cluster usage reports like mentioned
I would recommend setting up XDMoD as it will calculate this, plus a
variety of other useful facts:
https://open.xdmod.org/8.5/index.html
Also if you like grafana you can use this:
https://github.com/fasrc/slurm-diamond-collector
-Paul Edmon-
On 4/2/2020 8:31 AM, Sudeep Narayan Banerjee
would have
everything governed purely by fairshare with one large queue and no QoS's
For your setup though I think a combination of QoS's and partition
layout would fit the bill.
-Paul Edmon-
On 4/22/2020 5:43 PM, Paul Brunk wrote:
Hi all:
[ BTW this is the same situation that the submitter
if using the backfill scheduling plugin. In order to
eliminate some possible race conditions, the minimum non-zero value
for *MinJobAge* recommended is 2.
-Paul Edmon-
On 4/30/2020 3:39 AM, Gestió Servidors wrote:
Hello,
I would like to know if there exist any way to get the same
You could try holding the job and the releasing it. I've inquired of
SchedMD about this before and this is the response they gave:
https://bugs.schedmd.com/show_bug.cgi?id=8069
-Paul Edmon-
On 3/23/2020 8:05 AM, Sefa Arslan wrote:
Hi,
Due to lack of source in a partition, I updated the job
--parsable2 will print full names. You can also use -o to format your
output.
-Paul Edmon-
On 3/23/2020 10:46 AM, Sysadmin CAOS wrote:
Hi,
when I run "sshare -A myaccount -a" and, myaccount containts usernames
with more than 10 characters, "sshare" output shows a "
will select which one their job will run on more
quickly. Then we rely on fairshare to adjudicate priority.
-Paul Edmon-
On 10/6/2020 11:37 AM, Jason Simms wrote:
Hello David,
I'm still relatively new at Slurm, but one way we handle this is that
for users/groups who have "bought in" to t
as there are numerous performance improvements.
For something straight out of the box though I would look at
defer/max_rpc_cnt as that will help the scheduler cope with high RPC
traffic.
-Paul Edmon-
On 8/17/2020 2:30 PM, Ransom, Geoffrey M. wrote:
Hello
We are having performance issues
you
want to cut that down by whatever means you think is reasonable.
-Paul Edmon-
On 8/18/2020 11:36 AM, Jason Simms wrote:
Hello everyone! We have a script that queries our LDAP server for any
users that have an entitlement to use the cluster, and if they don't
already have an account
.
We also have a git repo in which we manage our slurm.spec file with a
branch for each version and type so we can keep organized.
-Paul Edmon-
On 9/24/2020 3:31 PM, Dana, Jason T. wrote:
Hello,
I hopefully have a quick question.
I have compiled Slurm RPMs on a CentOS system with nvidia
is Association based. So you could just modify their account
directly and set it to something low.
You can also simply put their pending jobs in hold state. That way they
won't be considered for scheduling but won't be outright removed.
Setting fairshare to 0 has the same effect.
-Paul Edmon
The bug site is the best way. The devs prioritize sponsored features
over general community requested features.
-Paul Edmon-
On 9/30/2020 11:34 AM, Ryan Novosielski wrote:
I’ve previously seen code contributed back in that way. See bug 1611
as an example (happened to have looked at that just
Probably the best way to accomplish this is via a job_submit.lua
script. That way you can reject at submission time. There isn't a
feature in the partition configurations that I am aware that can
accomplish this but a custom job_submit script certainly can.
-Paul Edmon-
On 9/30/2020 11:44
So the way we handle it is that we give a blanket fairshare to everyone
but then dial in our TRES charge back on a per partition basis based on
hardware. Our fairshare doc has a fuller explanation:
https://docs.rc.fas.harvard.edu/kb/fairshare/
-Paul Edmon-
On 9/17/2020 9:30 AM, Mark Dixon
reserved.
That's the natural understanding of suspend, but that's not the way
suspend actually work in Slurm.
-Paul Edmon-
On 9/16/2020 6:08 AM, SJTU wrote:
Hi,
I am using SLURM 19.05 and found that SLURM may launch jobs onto nodes with
suspended jobs, which leads to resource contention
No, you are only charged for time you actually use.
-Paul Edmon-
On 9/18/2020 11:09 AM, Angelo wrote:
Hi all,
Is the job limit time requested (--time=) considered in the classic
fairshrare algorithm?
Example: if I set the job time limit to 1 day (--time=24:00:00) and
the job ends in 4
This can happen if the underlying storage is wedged. I would check that
it is working properly.
Usually the only way to clear this state is either fix the stuck storage
or reboot the node.
-Paul Edmon-
On 10/24/2020 12:22 PM, Kimera Rodgers wrote:
I'm setting up slume on OpenHPC cluster
to down, then run a cancel over all the running jobs.
Pending jobs are left in place, and users are allowed to submit work
during the outage and when we reopen everything gets going again.
So there is a third option, though you have to accept that jobs will be
cancelled to pull it off.
-Paul
user can run with out causing damage to themselves, the
underlying filesystems, and interfering with other users. Practical
experience has lead to us setting that limit to be 10,000 on our
cluster, but I imagine it will vary from location to location.
-Paul Edmon-
On 8/6/2020 10:31 PM
(130 as of last count) so our tuning has
been a bit more complicated. However the latest version of slurm
(20.02) vastly improved the backfill efficiency which has helped with
making sure the cluster is full. Nonetheless we still seem to average a
job per core per day here.
-Paul Edmon-
On 8
Try setting RawShares to something greater than 1. I've seen it be the
case then when you set 1 it creates weirdness like this.
-Paul Edmon-
On 7/9/2020 1:12 PM, Dumont, Joey wrote:
Hi,
We recently set up fair tree scheduling (we have 19.05 running), and
are trying to use sshare to see
You could set up an dummy node that has the features that are not active
but not allow jobs to schedule to that node by setting it to DOWN. That
would be a hacky way of accomplishing this.
-Paul Edmon-
On 7/9/2020 7:15 PM, Raj Sahae wrote:
Hi all,
My apologies if this is sent twice
Another option would be to use the license feature and just set licenses
to 0 when they aren't available.
-Paul Edmon-
On 7/10/2020 12:42 PM, Raj Sahae wrote:
Hi Brian and Paul,
You both sent me suggestions about using an offline dummy node with
all features set. Thanks for your ideas
For the record we filed a bug on this years ago:
https://bugs.schedmd.com/show_bug.cgi?id=3875 Hasn't been fixed yet
though everyone seems to agree it is a good idea.
Florian's suggestion is probably the best stopgap until this feature is
implemented.
-Paul Edmon-
On 6/22/2020 7:11 AM
Yes. I have a discussion here which might be useful:
https://docs.rc.fas.harvard.edu/kb/fairshare/
Note this is using the classic fairshare not FairTree which is now the
default for Slurm.
-Paul Edmon-
On 6/25/2020 9:23 AM, Durai Arasan wrote:
Hi,
In slurm accounting
and won't impact larger work. I don't necessarily
recommend that. A single node with oversubscribe should be sufficient.
If you can't spare a single node then a VM would do the job.
-Paul Edmon-
On 6/11/2020 9:28 AM, Renfro, Michael wrote:
That’s close to what we’re doing, but without dedicated
Same here. Whenever we see rashes of Kill task failed it is invariably
symptomatic of one of our Lustre filesystems acting up or being saturated.
-Paul Edmon-
On 7/22/2020 3:21 PM, Ryan Cox wrote:
Angelos,
I'm glad you mentioned UnkillableStepProgram. We meant to look at
that a while ago
very useful.
-Paul Edmon-
On 7/16/2020 8:42 AM, Paul Edmon wrote:
A trick you can use to reset certain users (which I have used before)
is to simply delete them from the slurmdb and then readd them. At
least under the other fairshare system, which is what our site uses,
that would remove
assuming fairtree works the same way.
-Paul Edmon-
On 7/16/2020 5:49 AM, Gestió Servidors wrote:
Hello,
I will try to explain an scenario that occurs in my SLURM cluster. An
important number of users (accounts) belongs to students of a certain
subject. That subject is 6 month duration. When
Wow, nice find. I wasn't even aware of that one. Hopefully they will
support the ability to reset to other values in the future as that would
be a handy ability.
-Paul Edmon-
On 7/16/2020 12:56 PM, Sebastian T Smith wrote:
`sacctmgr` can be used to reset the accrued RawUsage value
://slurm.schedmd.com/sacctmgr.html
-Paul Edmon-
On 7/27/2020 2:17 PM, Jason Simms wrote:
Dear all,
Apologies for the basic question. I've looked around online for an
answer to this, and I haven't found anything that has helped
accomplish exactly what I want. That said, it is also probable that
what
1 - 100 of 207 matches
Mail list logo