I suspect you may have set some "frontendname" or "frontendaddr" in your
slurm.conf that triggered that.
A FrontEnd is a node that is used to execute batch scripts rather than
compute nodes (Cray ALPS systems). If that is not you, you should not
set it.
Brian Andrus
IIRC, Preemption is determined by partition first, not node.
Since your pending job is in the 'day' partition, it will not preempt
something in the 'night' partition (even if the node is in both).
Brian Andrus
On 8/19/2021 2:49 PM, Russell Jones wrote:
Hi all,
I co
Yep. I do it all the time when I forget to add a parent. Also when a
project/account changes who owns it.
sacctmgr will also tell you what it is going to change and gives you 30
seconds to say yes, else it doesn't make the change.
Brian Andrus
On 9/8/2021 3:41 AM, byron wrote:
Hi
a cluster. This may be a good option for you.
Brian Andrus
On 9/13/2021 7:14 AM, Ozeryan, Vladimir wrote:
*max_script_size=#*
Specify the maximum size of a batch script, in bytes. The default
value is 4 megabytes. Larger values may adversely impact system
performance.
I have users
Modify it and raise the priority to something very, very high.
scontrol update job=JOBID priority=999
Brian Andrus
On 9/16/2021 8:39 AM, 顏文 wrote:
Dear users
Thank for the immediate replies.I currently have one important job
running. How to prevent the running job from being preempted
Those would be considered separate for each job.
You may want to have your prolog check to see if there is an epilogue
running and wait for the epilogue to be done before starting its prolog
work.
Brian Andrus
On 9/27/2021 9:15 AM, Joe Teumer wrote:
Should the Prologslurmctld script only
.
Also helps with OOM killer situations.
Brian Andrus
On 10/1/2021 1:22 AM, Diego Zuccato wrote:
Hello all.
I just upgraded to Debian 11 that brings Slurm 21.08 and the newer
nodes upgraded w/o too many issues (just minor config changes, one
being RealMemory value in slurm.conf, since for
Something is very odd when you have the node reporting:RealMemory=1 AllocMem=0 FreeMem=47563 Sockets=2 Boards=1 What do you get when you run ‘slurmd -C’ on the node? Brian Andrus From: Adam XuSent: Tuesday, October 12, 2021 6:07 PMTo: slurm-users@lists.schedmd.comSubject: Re: [slurm-users] job is
You may have space, but do you have enough inodes?
Two different things to look at when trying to see why you cannot write
to a disk.
Also verify that it is writeable by SlurmUser.
If something happened and it automatically remounted itself as
read-only, that can do it too.
Brian Andrus
That is interesting to me.
How do you use ulimit and systemd to limit user usage on the login
nodes? This sounds like something very useful.
Brian Andrus
On 10/31/2021 1:08 AM, Yair Yarom wrote:
Hi,
If it helps, this is our setup:
6 clusters (actually a bit more)
1 mysql + slurmdbd on the
I don't think slum does what you think it does.
It manages the resources and schedule, not the actual hardware of a node.
You are likely looking for something more along a hypervisor (if you are
doing VMs) or remote KVM (since you are mentioning BIOS access).
Brian Andrus
On 11/12/2021
Maybe submit the job with the option to not start for 24 hours...
From https://slurm.schedmd.com/sbatch.html :
--begin=now+1hour
Brian ANdrus
On 11/22/2021 8:28 PM, Jeherul Islam wrote:
Dear All,
Is there any way to configure slurm, that the High Priority job waits
for a certain amount of
one and set the job to use that node.
Brian Andrus
On 12/1/2021 12:06 PM, Benjamin Nacar wrote:
Based on some quick experiments, that doesn't do what I'm looking for.
I set LLN=YES for the default partition and ran my test job several
times, waiting each time for it to fin
imit, part_max_time and partition variables are mapped from
job_desc and part_list
Brian Andrus
On 12/2/2021 6:01 AM, mercan wrote:
Hi;
The EnforcePartLimits parameter in slurm.conf, should be set to ALL or
ANY to enforce time limit for partition.
Regards.
Ahmet M.
2.12.2021 16:18 tarihinde
Your slurm needs built with the support. If you have mysql-devel
installed it should pick it up, otherwise you can specify the location
with --with-mysql when you configure/build slurm
Brian Andrus
On 12/2/2021 12:40 PM, Giuseppe G. A. Celano wrote:
Hi everyone,
I am having trouble
:41.022] fatal: You are running with a database but
for some reason we have no TRES from it. This should only happen if
the database is down and you don't have any state files.
On Thu, Dec 2, 2021 at 10:36 PM Brian Andrus wrote:
Your slurm needs built with the support. If you have
parts you need out of it.
Brian Andrus
On 12/3/2021 2:11 AM, Gestió Servidors wrote:
Hi,
Answering between lines...
> Hi;
>
> The EnforcePartLimits parameter in slurm.conf, should be set to ALL
or ANY
> to enforce time limit for partition.
>
> Regards.
>
> Ahmet M
Which version of Mariadb are you using?
Brian Andrus
On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote:
After installation of libmariadb-dev, I have reinstalled the entire
slurm with ./configure + options, make, and make install. Still,
accounting_storage_mysql.so is missing.
On Sat, Dec
Indeed, this is accurate.
We regularly add nodes on the fly (cloud based cluster).
All that is need is to get them all set in the slurm.conf, restart
slurmctld and do 'scontrol reconfigure'
Brian Andrus
On 12/13/2021 11:01 AM, Paul Brunk wrote:
Hi:
Normally, adding a new nod
All,
Trying to see if there is a simpler way to do this other than awk..
Is there a way to list only partitions a user has access to that are in
the 'UP' state?
Brian Andrus
file and folks will be able to login
again.
Brian Andrus
On 1/31/2022 9:18 PM, Sid Young wrote:
Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W: (personal) https://z900collector.wordpress.com/
On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel
wrote:
That looks like a DNS issue.
Verify all your nodes are able to resolve the names of each other.
Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on the
nodes (including head/login nodes) to ensure they all match.
Brian Andrus
On 2/1/2022 1:37 AM, Jeremy Fix wrote:
Hello
), you
should be good. You will still need to do the incremental for the db
changes, but no worries about state files either way.
Brian Andrus
On 2/2/2022 1:38 PM, Nathan Smith wrote:
The "Upgrades" section of the quick-start guide [0] warns:
Slurm permits upgrades to a new major rele
symlink that to /scratch which is where users are directed to.
You could just do "chmod 1777 /tmp" as well
Caveat: If this is the ephemeral ramdisk/ssd/etc disk that is created
each time the node starts up, you have to do the above step every boot.
Brian Andrus
On 2/2/2022 8:59 P
So if you run sshare on the head node, it shows your dummy user?
At any rate, just do a db dump (also known as a backup) and you can
restore that if you have an issue of any sort.
Brian Andrus
On 2/7/2022 12:42 AM, Moshe Mergy wrote:
Hi all
I cloned the Slurm DB into a separated node
.conf
after shutting down slurmd.
You should not have anything listed as backuphost for anything on the
cloned db node.
It should only have 'localhost' for the SlurmtctldHost and
AccountingStorageHost (slurm.conf) and DbdHost (slurmdbd.conf)
Brian Andrus
On 2/7/2022 8:51 AM, Mo
Moshe,
So it looks like you added the dummy user to the main database somehow.
I would suggest to try again being cautious and make a dummy2 user or such.
Your questions now are getting out of slurm and into mysql area, so may
be more appropriate in another forum.
Brian Andrus
On 2/7
Just curious as to expectations out here.
When /should /slurm immediately reject a job?
Brian Andrus
On 2/8/2022 11:41 PM, Alexander Block wrote:
Hi Mike,
I'm just discussing a familiar case with SchedMD right now (ticket
13309). But it seems that it is not possible with Slurm to s
le way to do it would be to
have round-robin DNS or a load balancer in front of the slurmdbd servers
and let that be where clients access it.
Brian Andrus
On 2/15/2022 7:46 AM, Xand Meaden wrote:
Hello,
I'm wondering what others are doing to make their slurmdbd service
resilient? W
First look and I would guess that there are enough resources to satisfy
the requests of both jobs, so no need to suspend.
Having the node info and the job info to compare would be the next step.
Brian Andrus
On 2/18/2022 7:20 AM, Walls, Mitchell wrote:
Hello,
Hoping someone can shed some
configless so I use a
symlink to the slurm.conf file a shared filesystem. This works great.
Anytime there are changes, a simple 'scontrol reconfigure' brings all
running nodes up to speed and any down nodes will automatically read the
latest.
Brian Andrus
On 2/23/2022 2:31 AM, Dav
Double-check you have all the packages.
When slurm is built, slurmrestd is a separate package and is only built
if the whole set was directed to do so. If they did not build it, you
will need to do so yourself. This will mean using your custom built
files throughout.
Brian Andrus
On 3/7
Depending on other variables, it is fine.
The 7 license job cannot run because there are only 5 available, so that
one has to wait.
Since there are 5 available, the 1 license job can run, so it does.
That is the simple view. Other variables such as job time could affect that.
Brian Andrus
to know what is available versus what you asked for. When
using exclusive, it becomes more like "I want at least X cores" and you
get "Ok, here are X cores or more"
Within your script, you could check for total cores. something like
'srun lscpu' and parse the ou
It should exist in the user environment as well.
I would check the users .bashrc and .bash_profile settings to see if
they are doing anything that will change that.
Brian Andrus
On 3/23/2022 7:42 AM, taleinterve...@sjtu.edu.cn wrote:
Hi, all:
We found a problem that slurm job with
("RUNNING","RU");print}'
Just add a 'sub' command for each substitution. It is tedious to setup
but will do the trick. You can also specify the specific field to do any
substitution on.
Brian Andrus
On 3/24/2022 6:12 AM, Chip Seraphine wrote:
I’m trying to sh
ep that is where you would do that and those
are subsets of sbatch.
Brian Andrus
On 3/31/2022 11:14 AM, David Henkemeyer wrote:
We noticed that we can pass --cpu_bind into an srun commandline, but
not sbatch. Why is that?
Thanks
David
, and are not usable outside of the
partition configuration.
Feature
All nodes with this single feature will be included as part of this
nodeset.
Nodes
List of nodes in this set.
NodeSet
Unique name for a set of nodes. Must not overlap with any NodeName
definitions.
Brian Andrus
All,
Not sure if this is already out there, but it would be nice to be able to
immediately reject interactive jobs that are going to be held due to an
upcoming maintenance window.
Does anyone already have this? If not, I suspect I will work on it as a lua
function for the job_submit.lua
Brian
You want to see what is output on the node itself when you run:
slurmd -C
Brian Andrus
On 4/5/2022 2:11 PM, Guertin, David S. wrote:
We've added a new GPU node to our cluster with 32 cores. It contains 2
16-core sockets, and hyperthreading is turned off, so the total is 32
cores. But
justified and size must be
specified. By default output is left justified.
suffix
Arbitrary string to append to the end of the field.
Brian Andrus
On 4/7/2022 11:02 AM, Nicholas Yue wrote:
Hi,
I am spinning up an MPI/Slurm cluster on AWS
I am attempting to script the
Check selinux.
Run "getenforce" on the node, if it returns 1, try setting "setenforce 0"
Slurm doesn't play well if selinux is enabled.
Brian Andrus
On 4/8/2022 10:53 AM, Nicolas Greneche wrote:
Hi,
I have an issue with pam_slurm_adopt when I moved from 21.08.5
Ok. Next I would check that the uid of the user is the same on the
compute node as the head node.
It looks like it is identifying the job, but doesn't see it as yours.
Brian Andrus
On 4/8/2022 1:40 PM, Nicolas Greneche wrote:
Hi Brian,
Thanks, SELinux is neither in strict or targeted
Not to steal his thunder, but Ole has done a great job with quite a few
things.
He has some job scripts at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
I fully expect him to chime in and offer additional great advice.
Brian Andrus
On 4/15/2022 7:28 AM, David Henkemeyer
'vanishes'. I suspect Nagios even has the
hooks to make that work. You could also email the user to let them know
their job was ended due to spot being pulled.
Just some ideas,
Brian Andrus
On 5/5/2022 6:28 AM, Steven Varga wrote:
Hi Tina,
Thank you for sharing. This matches my obser
on restart of the slurmctld daemon. May not exceed
65533.
so if you already have (by default) 5000 jobs being considered, the
remaining aren't even looked at.
Brian Andrus
On 5/12/2022 7:34 AM, David Henkemeyer wrote:
Question for the braintrust:
I have 3 partitions:
Double-check the account info on that node (c0801).
Could be the node does not recognize the uid being assigned to the user/job.
Brian Andrus
On 5/13/2022 2:31 PM, Williams, Jenny Avis wrote:
Yesterday I upgraded slurmdbd and slurmctld nodes from RHEL7 / Slurm
v. 20.11.8 to RHEL8.5 / Slurm
hings down into slurm steps, so you would be
able to get pretty good detailed info.
Brian Andrus
On 5/16/2022 6:44 AM, William Dear wrote:
Could anyone please recommend methods of tracking the performance of
individual tasks in a task array job? I have installed XDMoD but it
is focused so
You are starting to understand a major issue with most containers.
I suggest you check out Singularity, which was built from the ground up
to address most issues. And it can run other container types (eg: docker).
Brian Andrus
On 5/16/2022 10:49 PM, GHui wrote:
I use podman 4.0.2. And slurm
You need to step upgrade through major versions (not minor).
So 19.05=>20.x
I would highly recommend going to 21.08 while you are at it.
I just did the same migration (although they started at 18.x) with no
issues. Running jobs were not impacted and users didn't even notice.
Bria
watch
the logs to see when it is happy). Don't start slurmctld until that is
done. Waiting makes things easier.
Brian Andrus
On 5/17/2022 9:29 AM, Paul Edmon wrote:
I think it should be, but you should be able to run a test and find out.
-Paul Edmon-
On 5/17/22 12:13 PM, byron wrote:
Sor
the people you give the
permission to that they will not abuse it.
Brian Andrus
On 5/18/2022 12:22 AM, GHui wrote:
Hi, Brian Andrus
I think the main poblem is that container can cheat Slurm.
On 5/17/22 06:58:20, Brian Andrus wrote:
> You are starting to understand a major issue with most co
Add it to your slurm.conf
Then it is always there after a restart.
Brian Andrus
On 6/2/2022 12:05 PM, Hanby, Mike wrote:
Howdy,
I can’t seem to find a solution in ‘man slurm.conf’ for this. How can
I make the following persist a slurmctld restart:
scontrol update NodeName="
Offhand, I would suggest double check munge and versions of
slurmd/slurmctld.
Brian Andrus
On 6/3/2022 3:17 AM, taleinterve...@sjtu.edu.cn wrote:
Hi, all:
Our cluster set up 2 slurm control node and scontrol show config as below:
> scontrol show config
…
SlurmctldHost[0] = slu
rent users/groups.
Brian Andrus
On 6/9/2022 5:19 PM, Willy Markuske wrote:
Hello All,
I have a request from users for the ability to have persistent
interactive jobs. Currently some users are using srun to allocate and
interactive job and run their scripts but sshd will close connections
aft
respectively.
*NOTE*: The suffix "*" identifies nodes that are presently not
responding.
Brian Andrus
On 6/26/2022 5:39 AM, z1...@arcor.de wrote:
Hello,
if I call "sinfo -o %all", the worker state includes only a single state
word like "DRNG".
It is clearer in
ensures both are getting
accurate and current information.
Brian Andrus
On 6/27/2022 9:15 AM, taleinterve...@sjtu.edu.cn wrote:
Hi, all:
We noticed that slurmdbd provide the conf option *DbdBackupHost* for
user to set a secondary slurmdbd node. Since slurmdbd is closely
related to database
Hmm. That would imply you could still use the tar file with something like:
rpmbuild -v -ta --define "_lto_cflags %{nil}" slurm-22.05.2.tar.bz2
Note, I have not tried this (no immediate access to RHEL9 derivative),
so YMMV.
Brian Andrus
On 7/21/2022 10:15 AM, Kilian Cavalotti
Verify that their uid on the node is the same as the uid your master sees
Brian Andrus
On 7/27/2022 8:53 AM, byron wrote:
Hi
When a user tries to login into a compute node on which they have a
running job they get the error
Access denied: user blahblah (uid=) has no active jobs on
Lloyd,
You could check out the order of entries in your pam.d/ssh (and
related/included) files
See where the slurm_pam_adopt is, how it is being called and if there
are settings that are interferring.
Does this occur only on a single node, or all of them?
Brian Andrus
On 7/27/2022 9:29
compute
nodes do.
Brian Andrus
On 8/2/2022 6:45 AM, Paul Edmon wrote:
No, the node running the slurmctld does not need access to any of the
customer facing filesystems or home directories. While all the login
and client nodes do, the slurmctld does not.
-Paul Edmon-
On 8/2/2022 9:30 AM
So an example of using slurm to reboot all nodes 3 at a time:
sinfo -h -o %n|xargs --max-procs=3 scontrol reboot {}
If you want to get fancy, make a script that does the reboot and waits
for the node to be back up before exiting and use that instead of the
'scontrol reboot' par
This is actually brilliant!
Brian Andrus
On 8/3/2022 10:20 PM, Gerhard Strangar wrote:
Phil Chiu wrote:
- Individual slurm jobs which reboot nodes - With a for loop, I could
submit a reboot job for each node. But I'm not sure how to limit this so at
most N jobs are ru
It looks to me like you have the same node in multiple partitions. If
the output you are getting is basically what you want just pipe it to
'sort -u' or 'uniq'
Brian Andrus
On 8/8/2022 10:14 AM, Borchert, Christopher B ERDC-RDE-ITL-MS CIV wrote:
Hello. How can I simply
Not sure if you can do all the things you intend, but the job_submit
script is precisely where you want to check submission options.
https://slurm.schedmd.com/job_submit_plugins.html
Brian Andrus
On 8/30/2022 12:58 PM, Davide DelVento wrote:
Hi,
I would like to soft-enforce license
I would be surprised if it were compiled without the support. However,
you could check and run something like:
strings /sbin/slurmctld | grep job_submit
(or where ever your slurmctld binary is). There should be quite a few
lines with that in it.
Brian Andrus
On 9/1/2022 10:54 AM, Davide
)
Usually it would be found at /usr/lib64/slurm/job_submit_lua.so
If that is there, you should be good with trying out a job_submit lua
script.
Brian Andrus
On 9/1/2022 1:24 PM, Davide DelVento wrote:
Thanks again, Brian, indeed that grep returns many hits, but none of
them includes lua, i.e
Try setting logging to debug mode, then you can get some info from the logs.
Brian Andrus
On 9/1/2022 8:15 PM, Davide DelVento wrote:
Thanks.
I did try a lua script as soon as I got your first email, but that
never worked (yes, I enabled it in slurm.conf and ran "scontrol
reconfigure&q
Possibly way off base, but did you happen to do any of the editing in
Windows? Maybe running into the cr/lf issue for how windows saves text
files?
Brian Andrus
On 9/7/2022 5:21 AM, Davide DelVento wrote:
Thanks Ole, your wiki page sheds some light on this mystery.
Very frustrating that even
No, however a node can reside in multiple partitions.
So if you add those nodes to the partition they are running in, they
will be available to them.
Brian Andrus
On 9/8/2022 11:38 AM, Purvesh Parmar wrote:
We require more nodes to run a single job which requires more nodes
than present in
I had completely forgotten about HETJOB supporting multiple partitions.
Thanks for reminding me.
Brian Andrus
On 9/12/2022 6:06 AM, Marcus Wagner wrote:
yes, that is possible by submitting a hetjob.
Best
Marcus
Am 08.09.2022 um 20:38 schrieb Purvesh Parmar:
We require more nodes to run a
configured?
Brian Andrus
On 9/15/2022 2:49 PM, Davide DelVento wrote:
I have a super simple prolog script, as follows (very similar to the
example one)
#!/bin/bash
if [[ $VAR == 1 ]]; then
echo "True"
fi
exit 0
This fails (and obviously causes great disruption to my production
j
is in the
database updated to match the number free from flexlm to stop license
starvation due to users outside slurm using them up so they really
aren't available to slurm.
Brian Andrus
On 9/15/2022 3:34 PM, Davide DelVento wrote:
I am a bit confused by remote licenses.
https://li
You can dynamically modify the weight of nodes with:
scontrol update nodename= weight=
So, in theory, you could do that periodically to adjust the weights you
may want.
Brian Andrus
On 9/15/2022 4:27 PM, Russell Smithies wrote:
Can I set dynamic or calculated “weights” for nodes
t)
2) Update the database (sacctmgr command)
As you can see, that 1st step would be highly dependent on you and your
environment. The 2nd step would be dependent on what things you are
tracking within that.
Brian Andrus
On 9/16/2022 5:01 AM, Davide DelVento wrote:
So if I understand corr
Feel free to do that. It is not something that
scales well, but it looks like you have a rather beginner cluster that
would never be impacted by such choices.
Brian Andrus
On 9/16/2022 10:00 AM, Davide DelVento wrote:
Thanks Brian.
I am still perplexed. What is a database to install, admin
ourself working on a large cluster
sometime in your career, I would not recommend using it there.
Brian Andrus
On 9/16/2022 3:06 PM, Davide DelVento wrote:
Hi Brian,
From your response, I speculate that my wording sounded harsh or
unrespectful. That was not my intention and therefore I sincer
Paul,
You are likely spot on with the inactiveLimit change. It may also be an
environment variable of TMOUT (under bash) set.
Brian Andrus
On 9/19/2022 5:46 AM, Paul Raines wrote:
I have had two nights where right at 3:35am a bunch of jobs were
killed early with TIMEOUT way before their
out of the mix.
Brian Andrus
On 9/23/2022 7:09 AM, Groner, Rob wrote:
I'm working through how to use the new dynamic node features in order
to take down a particular node, reconfigure it (using nvidia MIG to
change the number of graphic cores available) and give it back to slurm.
I
ynamic node.
What is the preferred method?
Rob
--------
*From:* slurm-users on behalf
of Brian Andrus
*Sent:* Friday, September 23, 2022 10:24 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] slurmd and dyna
YMMV, but if you aren't having excessive traffic to the
share, you should be good. I have yet to discover what would be
excessive enough to impact things.
The only use I have had for the HA is being able to keep the cluster
running/happy during maintenance.
Brian Andrus
On 10/24/2022 1:
It caches up to a point. As I understand it, that is about an hour
(depending on size and how busy the cluster is, as well as available
memory, etc).
Brian Andrus
On 10/31/2022 9:20 PM, Richard Chang wrote:
Hi,
Just for my info, I would like to know what happens when SlurmDBD
loses
Ole,
Fair enough, it is actually slurmctld that does the caching. Technical
typo on my part there.
Just trying to let the user know, there is a window that they have to
ensure no information is lost during a database outage.
Brian Andrus
On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote:
Hi
processing
data. There are many ways to do that, but those designs fall under
MariaDB and not Slurm.
Brian Andrus
On 11/1/2022 6:49 PM, Richard Chang wrote:
Does it mean it is best to use a single slurmdbd host in my case?
My primary slurmctld is the backup slurmdbd host, and my worry is if
t
reset/recreate it.
That addresses even a miffed software change.
Brian Andrus
On 11/23/2022 5:11 AM, Xaver Stiensmeier wrote:
Hello slurm-users,
The question can be found in a similar fashion here:
https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a
Steve,
I suspect you did not install the packages.
You need to install slurm-slurmctld to get the slurmctld systemd files:
/# rpm -qlp slurm-slurmctld-20.11.9-1.el7.x86_64.rpm//
///run/slurm/slurmctld.pid//
/*//usr/lib/systemd/system/slurmctld.service/*/
///usr/sbin/slurmctld//
ed to submit at all? The reservation method can cause an sbatch
command to be rejected, if that is what you are looking for.
Brian Andrus
On 11/30/2022 6:29 AM, Richard Ems wrote:
Hi all,
I have to change our set up to be able to update the total number of
available licenses due to users che
I successfully build it for Rocky straight from the tgz file as usual
with rpmbuild -ta
Brian Andrus
On 12/2/2022 9:21 AM, David Thompson wrote:
Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8
Slurm cluster. We would like to be able to use the sbatch –prefer
option
You may want to look here:
https://slurm.schedmd.com/heterogeneous_jobs.html
Brian Andrus
On 12/7/2022 12:42 AM, Le, Viet Duc wrote:
Dear slurm community,
I am encountering a unique situation where I need to allocate jobs to
nodes with different numbers of CPU cores. For instance
assigned to it. Also check the state of the nodes with 'sinfo'
It would also be good to ensure the node settings are right. Run 'slurmd
-C' on a node and see if the output matches what is in the config.
Brian Andrus
On 12/13/2022 1:38 AM, Gary Mansell wrote:
Dear Slurm Us
the many articles, wikis and videos
out there.
TLDR; If you are going to be running efficient HPC jobs, you are indeed
better off with HT turned off.
Brian Andrus
On 12/13/2022 8:03 AM, Gary Mansell wrote:
Hi, thanks for getting back to me.
I have been doing some more experimenting, and I
Try:
sacctmgr list runawayjobs
Brian Andrus
On 12/20/2022 7:54 AM, Reed Dier wrote:
Hoping this is a fairly simple one.
This is a small internal cluster that we’ve been using for about 6
months now, and we’ve had some infrastructure instability in that
time, which I think may be the
Seems like the time may have been off on the db server at the insert/update.
You may want to dump the database, find what table/records need updated
and try updating them. If anything went south, you could restore from
the dump.
Brian Andrus
On 12/20/2022 11:51 AM, Reed Dier wrote:
Just to
I suspect if you delete /var/lib/slurmrestd.socket and then start
slurmrestd, it will create it as the user you need it to be.
Or just change the owner of it to the slurmrestd owner.
I have been running slurmrestd as a separate user for some time.
Brian Andrus
On 12/28/2022 3:20 PM, Chris
lurm/slurm.conf"*/
You can change those as needed. This made it listen on port 8081 only
(no socket and not 6820)
I was then able to just use curl on port 8081 to test things.
Hope that helps.
Brian Andrus
On 12/29/2022 6:49 AM, Chris Stackpole wrote:
Greetings,
Thanks for responding
ready.
Brian Andrus
On 1/4/2023 9:22 AM, Groner, Rob wrote:
We currently have a test cluster and a production cluster, all on the
same network. We try things on the test cluster, and then we gather
those changes and make a change to the production cluster. We're
doing that through two diffe
y with the new (known good) config.
Brian Andrus
On 1/17/2023 12:36 PM, Groner, Rob wrote:
So, you have two equal sized clusters, one for test and one for
production? Our test cluster is a small handful of machines compared
to our production.
We have a test slurm control node on a test cl
Then cluster_run.sh would call sbatch along with the appropriate commands.
Brian Andrus
On 2/7/2023 9:31 AM, Groner, Rob wrote:
I'm trying to setup the capability where a user can execute:
$: sbatch script_to_run.sh
and the end result is that a job is created on a node, and that job
wi
commands
are xterm, a shell script containing srun commands, and srun (see the
EXAMPLES section). *If no command is specified, then salloc runs the
user's default shell.*
Brian Andrus
On 2/8/2023 7:01 AM, Jeffrey T Frey wrote:
You may need srun to allocate a pty for the command.
efficient HPC jobs. The goal is that every process is utilizing the CPU
as close to 100% as possible, which would render hyper-threading moot.
Brian Andrus
On 2/13/2023 12:15 AM, Hermann Schwärzler wrote:
Hi Sebastian,
I am glad I could help (although not exactly as expected :-).
With
201 - 300 of 374 matches
Mail list logo