Hi,
We are trying to setup a tiny Slurm cluster to manage shared access to the
GPU server in our team. Both slurmctld and slumrd are going to run on this
GPU server. But here is a problem. On one hand, we don't want to give
developers ssh access to that box, because otherwise they might bypass
Is that logged somewhere or do I need to capture the output from the make
command to a file?
-Original Message-
From: slurm-users On Behalf Of Kurt
H Maier
Sent: Wednesday, December 11, 2019 6:32 PM
To: Slurm User Community List
Subject: Re: [slurm-users] Need help with controller
You prompted me to dig even deeper into my epilog. I was trying to
access a semaphore file in the user's home directory.
It seems that when the epilogue is run the ~ is not expanded in anyway.
So I can't even use ~${SLURM_JOB_USER} to access their semaphore file.
Potentially problematic for
On Wed, Dec 11, 2019 at 04:04:44PM -0700, Dean Schulze wrote:
> I tried again with a completely new system (virtual machine). I used the
> latest source, I used mysql instead of mariadb, and I installed all the
> client and dev libs (below). I still get the same error. It doesn't
> build the
Snapshot of a job_submit.lua we use to automatically to route jobs to a GPU
partition if the user asks for a GPU:
https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5
All our users just use srun or sbatch with a default queue, and the plugin
handles it from there. There’s more
Hi Brian,
can you maybe elaborate on how exactly you verified that your epilog
does not run when a job exceeds it's walltime limit? Does it run when
the jobs end normally or when a running job is cancelled by the user?
I am asking because in our environment the epilog also runs when a job
I tried again with a completely new system (virtual machine). I used the
latest source, I used mysql instead of mariadb, and I installed all the
client and dev libs (below). I still get the same error. It doesn't
build the /usr/lib/slurm/accounting_storage_mysql.so file.
Could the ./configure
Look for libmariadb-client. That's needed for slurmdbd on debian.
On Wed, Dec 11, 2019 at 11:43 AM Dean Schulze wrote:
>
> Turns out I've already got libmariadb-dev installed:
>
> $ dpkg -l | grep maria
> ii libmariadb-dev 3.0.3-1build1
>
All,
So I have verified that the Epilog script is NOT run for any job that times
out. Even though in the documentation (
https://slurm.schedmd.com/prolog_epilog.html), it states "At job
termination"
I guess timeouts are not considered terminated??
So, is there a recommended way to have a cleanup
We do this via looking at gres. The info is in the job_desc.gres
variable. We basically do the inverse where we ensure some one is
asking for the gpu before allowing them to submit to a gpu partition.
-Paul Edmon-
On 12/11/2019 12:32 PM, Grigory Shamov wrote:
Hi All,
I am trying the
Hi,
We have a strange behaviour of Slurm after updating from 18.08.7 to 18.08.8,
for jobs using --exclusive and --mem-per-cpu.
Our nodes have 128GB of memory, 28 cores.
$ srun --mem-per-cpu=3 -n 1 --exclusive hostname
=> works in 18.08.7
=> doesn’t work in 18.08.8
In 18.08.8 :
Partial progress. The scientist that developed the model took a look at the
output and found that instead of one model run being ran in parallel srun
had ran multiple instances of the model, one per thread, which for this
test was 110 threads.
I have a feeling this just verified the same thing
I tried a simple thing of swapping out mpirun in the sbatch script for
srun. Nothing more, nothing less.
The model is now working on at least two nodes, I will have to test again
on more but this is progress.
Thanks,
Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great
Thanks all for the ideas and possibilities. I will answer all in turn.
Paul: Neither of the switches in use, Ethernet and Infiniband, have any
form of broadcast storm protection enabled.
Chris: I have passed on your question to the scientist that created
the sbatch script. I will also look into
Hi Angelines,
I use a plugin for that - I believe this one
https://github.com/hpc2n/spank-private-tmp
which sort of does it all; your job sees an (empty) /tmp/.
(It doesn't do cleanup, I simply rely on OS cleaning up /tmp, at the
moment.)
Tina
On 05/12/2019 15:57, Angelines wrote:
> Hello,
I had a simmilar issue, please check if the home drive, or the place
the data should be stored is mounted on the nodes.
On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote:
> I have a 16 node HPC that is in the process of being upgraded from
> CentOS 6 to 7. All nodes are
16 matches
Mail list logo