Hi all,
I have a few working GPU compute nodes. I bought a couple of more
identical nodes. They are all diskless; so they all boot from the same
disk image.
For some reason slurmd refuses to start on the new nodes; and I'm not able
to find any differences in hardware or software. Google search
:41 PM Bill wrote:
> Hi Alex,
>
> Try run nvidia-smi before start slurmd, I also found this issue. I have to
> run nvidia-smi before slurmd when I reboot system.
> Regards,
> Bill
>
>
> -- Original --
> *From:* Alex Chekholko
> *D
Hello all,
My error was indeed just the comma in my gres.conf. I was confused because
I had the same file on my running nodes but that's just because slurmd
started before the erroneous comma was added to the config.
So the error message was in fact directly correct, it could not find the
device
Seems like your slurmctld is not running. Have you checked its log to see
why?
On Tue, Jul 31, 2018 at 8:35 AM Mahmood Naderan
wrote:
> Hi,
> It seems that squeue is broken due to the following error:
>
> [root@rocks7 ~]# squeue
> slurm_load_jobs error: Unable to contact slurm controller (conne
Hi,
Right now I have a cluster running SLURM v17.02.7 with:
JobAcctGatherType = jobacct_gather/none
The documentation says "NOTE: Changing this configuration parameter changes
the contents of the messages between Slurm daemons. Any previously running
job steps are managed by a slurmstepd d
Almost every place I worked built some site-specific tools for managing
jobs that some people found very useful. E.g.
https://github.com/StanfordBioinformatics/SJM
http://clusterjob.org/
There have also been some efforts to standardize this sort of thing:
https://www.commonwl.org/
I have not use
Hi Will,
You have bumped into the old adage: "HPC is just about moving the
bottlenecks around".
If your bottleneck is now your network, you may want to upgrade the
network. Then the disks will become your bottleneck :)
For GPU training-type jobs that load the same set of data over and over
agai
Hey Graziano,
To make your decision more "data-driven", you can pipe your SLURM
accounting logs into a tool like XDMOD which will make you pie charts of
usage by user, group, job, gres, etc.
https://open.xdmod.org/8.0/index.html
You may also consider assigning this task to one of your "machine
any
millions of jobs you want to process.
I'm not aware of command-line tools that produce pretty graphs suitable for
consumption by upper management :)
Regards,
Alex
On Thu, Mar 21, 2019 at 10:03 AM Noam Bernstein
wrote:
> On Mar 21, 2019, at 12:38 PM, Alex Chekholko wrote:
>
>
Hi Chris,
re: "can't run more than 1 job per node at a time. "
try "scontrol show config" and grep for defmem
IIRC by default the memory request for any job is all the memory in a node.
Regards,
Alex
On Thu, Apr 4, 2019 at 4:01 PM Andy Riebs wrote:
> in slurm.conf, on the line(s) starting "
Hi all,
I'm running on Ubuntu 18.04.2 LTS
munge is from the Ubuntu package
slurm v18.08.7 I compile myself with
./configure --prefix=/tmp/slurm-build-7 --sysconfdir=/etc/slurm
--enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/
--without-shared-libslurm
Then I make a deb with fpm and ins
Hey Suzanne,
In order to "combine" RAM between different systems, you will need a
hardware/software solution like ScaleMP, or you need a software framework
like OpenMPI. If your software is already written to use MPI then, in a
sense, it is "combining" the memory.
SLURM is a resource manager and
Hi all,
My expectation is that the epilog script gets run no matter what happens to
the job (fails, canceled, timeout, etc). Is that true, or are there corner
cases? I hope I correctly understand the intended behavior.
My OS is Ubuntu 18.04.2 LTS and my SLURM is 18.08.7 built from source.
The e
I think this error usually means that on your node cn7 it has either the
wrong /etc/hosts or the wrong /etc/slurm/slurm.conf
E.g. try 'srun --nodelist=cn7 ping -c 1 cn7'
On Wed, May 29, 2019 at 6:00 AM Alexander Åhman
wrote:
> Hi,
> Have a very strange problem. The cluster has been working just
Hey Samuel,
Can't you just adjust the existing "cpu" limit numbers using those same
multipliers? Someone bought 100 CPUs 5 years ago, now that's ~70 CPUs.
Or vice versa, someone buys 100 CPUs today, they get a setting of 130 CPUs
because the CPUs are normalized to the old performance. Since it
Hi Chad,
Here is the most generally useful process I ended up with, implemented in a
local custom utility script.
#Update slurm.conf everywhere
#Stop slurmctld
#Restart all slurmd processes
#Start slurmctld
per:
https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes
I think you only will
Hey David,
Which distro? Which kernel version? Which systemd version? Which SLURM
version?
Based on some paths in your varialbles, I'm guessing Ubuntu distro with
Debian SLURM packages?
Regards,
Alex
On Wed, Aug 21, 2019 at 5:24 AM David da Silva Pires <
david.pi...@butantan.gov.br> wrote:
>
Hi David,
I actually don't know much about cgroups, and I don't have a single-node
cluster.
Here are some cgroup-related settings from my regular Ubuntu 18.04 cluster,
running SLURM 18.08.7
root@cb-admin:~# cat /etc/slurm/slurm.conf | grep -i cgr
ProctrackType=proctrack/cgroup
TaskPlugin=task/cg
Sounds like maybe you didn't correctly roll out / update your slurm.conf
everywhere as your RealMemory value is back to your large wrong number.
You need to update your slurm.conf everywhere and restart all the slurm
daemons.
I recommend the "safe procedure" from here:
https://wiki.fysik.dtu.dk/ni
Hi Mike,
IIRC if you have the default config, jobs get all the memory in the node,
thus you can only run one job at a time. Check:
root@admin:~# scontrol show config | grep DefMemPerNode
DefMemPerNode = 64000
Regards,
Alex
On Thu, Nov 7, 2019 at 1:21 PM Mike Mosley wrote:
> Greetings
Hi,
I had asked a similar question recently (maybe a year ago) and also got
crickets. I think in our case we were not able to ensure that the epilog
always ran for different types of job failures, so we just had the users
add some more cleanup code to the end of their jobs _and_ also run separate
ion J
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com
> ] *On Behalf Of *Alex Chekholko
> *Sent:* Monday, December 9, 2019 12:53 PM
> *To:* Slurm User Community List
>
> *Subject:* Re: [slurm-users] Timeout and Epilogue
>
>
>
> H
Hey Steve,
I think it doesn't just "power down" the nodes but deletes the instances.
So then when you need a new node, it creates one, then provisions the
config, then updates the slurm cluster config...
That's how I understand it, but I haven't tried running it myself.
Regards,
Alex
On Thu, De
Hey Dean,
Does 'scontrol show node https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
Also check that slurmd daemons on the compute nodes can talk to each other
(not just to the master). e.g. bottom of
https://slurm.schedmd.com/big_sys.html
Regards,
Alex
Hey Dean,
Here is what I found in my build notes which are now outdated by 1 year at
least, but probably there are some more configure parameters you want to
specify with relevant directories:
./configure --prefix=/tmp/slurm-build --sysconfdir=/etc/slurm --enable-pam
--with-pam_dir=/lib/x86_64-li
Hey Sudeep,
Which flags to sreport have you tried? Which information was missing?
Regards,
Alex
On Thu, Apr 2, 2020 at 10:29 PM Sudeep Narayan Banerjee <
snbaner...@iitgn.ac.in> wrote:
> Dear Steven: Yes, but am unable to get the desired data. Not sure which
> flags to use.
>
> Thanks & Regard
You will want to look at the output of 'sinfo' and 'scontrol show node' to
see what slurmctld thinks about your compute nodes; then on the compute
nodes you will want to check the status of the slurmd service ('systemctl
status -l slurmd') and possibly read through the slurmd logs as well.
On Mon,
Any time a node goes into DRAIN state you need to manually intervene and
put it back into service.
scontrol update nodename=ip-172-31-80-232 state=resume
On Mon, May 11, 2020 at 11:40 AM Joakim Hove wrote:
>
> You’re on the right track with the DRAIN state. The more specific answer
>> is in the
Hi Andrew,
I think maybe something is wrong with your slurmd, maybe something missing
from your install?
On the node (where slurmd is running), you should see a message similar to
this in slurmd.log
[2020-05-11T14:29:17.766] Gres Name=gpu Type=titanrtx Count=4 ID=7696487
File=/dev/nvidia[0-3] (n
Hi David,
There are several approaches to have a shared filesystem namespace without
an actual shared filesystem. One issue you will have to contend with is how
to handle any kind of filesystem caching (how much room to allocate for
local cache, how to handle cache inconsistencies).
examples:
gcs
Hey Raj,
To me this all sounds, at a high level, a job for some kind of lightweight
middleware on top of SLURM. E.g. makefiles or something like that. Where
each pipeline would be managed outside of slurm and would maybe submit a
job to install some software, then submit a job to run something o
Hi,
Your job does not request any specific amount of memory, so it gets the
default request. I believe the default request is all the RAM in the node.
Try something like:
$ scontrol show config | grep -i defmem
DefMemPerNode = 64000
Regards,
Alex
On Mon, Nov 23, 2020 at 12:33 PM Jan
This may be more "cargo cult" but I've advised users to add a "sleep 60" to
the end of their job scripts if they are "I/O intensive". Sometimes they
are somehow able to generate I/O in a way that slurm thinks the job is
finished, but the OS is still catching up on the I/O, and then slurm tries
to
Hi Luke,
Yes, I think your request is unusual.
I believe in the past there have been a number of middle-wares that helped
with this kind of bureaucracy, things like
http://docs.adaptivecomputing.com/gold/
Regards,
Alex
On Thu, Dec 10, 2020 at 9:23 AM Luke Yeager wrote:
> (originally posted at
Hey Sajesh,
Each public cloud vendor provides a standard way to create a virtual
private network in their infrastructure and connect that private network to
your existing private network for your cluster. The devil is in the
networking details.
So in that case, you can just treat it as a new rac
Hi Jason,
Ultimately each site decides how/why to do it; in my case I tend to do big
"forklift upgrades", so I'm running 18.08 on the current cluster and will
go to latest SLURM for my next cluster build. But you may have good
reasons to upgrade slurm more often on your existing cluster. I don't
In my most recent experience, I have some SSDs in compute nodes that
occasionally just drop off the bus, so the compute node loses its OS disk.
I haven't thought about it too hard, but the default NHC scripts do not
notice that. Similarly, Paul's proposed script might need to also check
that the s
I don't have specific answers to your questions but one thing you can do is
run the slurmd on one of your "nodes" and see what hardware specs SLURM
auto-detects.
Run "slurmd -C"; from the man page:
-C Print actual hardware configuration and exit. The format of
output is the same as used
There was a previous thread where someone recommended a third-party script:
"pestat -G" that will parse the outputs of 'scontrol shown node' and
'scontrol show job' and add up the used GPUs perhaps?
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
On Fri, Mar 16, 2018 at 11:44 AM,
The thing you are describing is possible in both theory and practice.
Plenty of people use a scheduler on a single large host. The challenge
will be in enforcing user practices so they don't just run commands
directly but through the scheduler.
On Fri, Apr 6, 2018 at 10:00 AM, Patrick Goetz
wrot
Hey Will,
It maybe just as easy in your case to just build it directly, it's just one
c file and makefile
https://github.com/SchedMD/slurm/tree/master/contribs/pam
Regards,
Alex
On Fri, May 4, 2018 at 2:11 PM, Will Dennis wrote:
> I just tried unpacking the original archive, and running “./co
Add a logging rule to your iptables and look at what traffic is actually
being blocked?
On Wed, May 16, 2018 at 11:11 AM Sean Caron wrote:
> Hi all,
>
> Does anyone use SLURM in a scenario where there is an iptables firewall on
> the compute nodes on the same network it uses to communicate with
Hi all,
I have a cloud cluster running in GCP that seems to have gotten stuck
in a state where the slurmctld will not start/stop compute nodes, it
just sits there with thousands of jobs in the queue and only a few
compute nodes up and running (out of thousands).
I can try to kick it by setting no
43 matches
Mail list logo