[slurm-users] Is that possible to submit jobs to a Slurm cluster right from a developer's PC

2019-12-11 Thread Victor (Weikai) Xie
Hi, We are trying to setup a tiny Slurm cluster to manage shared access to the GPU server in our team. Both slurmctld and slumrd are going to run on this GPU server. But here is a problem. On one hand, we don't want to give developers ssh access to that box, because otherwise they might bypass

Re: [slurm-users] Need help with controller issues

2019-12-11 Thread dean.w.schulze
Is that logged somewhere or do I need to capture the output from the make command to a file? -Original Message- From: slurm-users On Behalf Of Kurt H Maier Sent: Wednesday, December 11, 2019 6:32 PM To: Slurm User Community List Subject: Re: [slurm-users] Need help with controller

Re: [slurm-users] cleanup script after timeout

2019-12-11 Thread Brian Andrus
You prompted me to dig even deeper into my epilog. I was trying to access a semaphore file in the user's home directory. It seems that when the epilogue is run the ~ is not expanded in anyway. So I can't even use ~${SLURM_JOB_USER} to access their semaphore file. Potentially problematic for

Re: [slurm-users] Need help with controller issues

2019-12-11 Thread Kurt H Maier
On Wed, Dec 11, 2019 at 04:04:44PM -0700, Dean Schulze wrote: > I tried again with a completely new system (virtual machine). I used the > latest source, I used mysql instead of mariadb, and I installed all the > client and dev libs (below). I still get the same error. It doesn't > build the

Re: [slurm-users] Lua jobsubmit plugin for cons_tres ?

2019-12-11 Thread Renfro, Michael
Snapshot of a job_submit.lua we use to automatically to route jobs to a GPU partition if the user asks for a GPU: https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5 All our users just use srun or sbatch with a default queue, and the plugin handles it from there. There’s more

Re: [slurm-users] cleanup script after timeout

2019-12-11 Thread Juergen Salk
Hi Brian, can you maybe elaborate on how exactly you verified that your epilog does not run when a job exceeds it's walltime limit? Does it run when the jobs end normally or when a running job is cancelled by the user? I am asking because in our environment the epilog also runs when a job

Re: [slurm-users] Need help with controller issues

2019-12-11 Thread Dean Schulze
I tried again with a completely new system (virtual machine). I used the latest source, I used mysql instead of mariadb, and I installed all the client and dev libs (below). I still get the same error. It doesn't build the /usr/lib/slurm/accounting_storage_mysql.so file. Could the ./configure

Re: [slurm-users] Need help with controller issues

2019-12-11 Thread Eli V
Look for libmariadb-client. That's needed for slurmdbd on debian. On Wed, Dec 11, 2019 at 11:43 AM Dean Schulze wrote: > > Turns out I've already got libmariadb-dev installed: > > $ dpkg -l | grep maria > ii libmariadb-dev 3.0.3-1build1 >

[slurm-users] cleanup script after timeout

2019-12-11 Thread Brian Andrus
All, So I have verified that the Epilog script is NOT run for any job that times out. Even though in the documentation ( https://slurm.schedmd.com/prolog_epilog.html), it states "At job termination" I guess timeouts are not considered terminated?? So, is there a recommended way to have a cleanup

Re: [slurm-users] Lua jobsubmit plugin for cons_tres ?

2019-12-11 Thread Paul Edmon
We do this via looking at gres.  The info is in the job_desc.gres variable.  We basically do the inverse where we ensure some one is asking for the gpu before allowing them to submit to a gpu partition. -Paul Edmon- On 12/11/2019 12:32 PM, Grigory Shamov wrote: Hi All, I am trying the

[slurm-users] Slurm 18.08.8 --mem-per-cpu + --exclusive = strange behavior

2019-12-11 Thread Beatrice Charton
Hi, We have a strange behaviour of Slurm after updating from 18.08.7 to 18.08.8, for jobs using --exclusive and --mem-per-cpu. Our nodes have 128GB of memory, 28 cores. $ srun --mem-per-cpu=3 -n 1 --exclusive hostname => works in 18.08.7 => doesn’t work in 18.08.8 In 18.08.8 :

Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Chris Woelkers - NOAA Federal
Partial progress. The scientist that developed the model took a look at the output and found that instead of one model run being ran in parallel srun had ran multiple instances of the model, one per thread, which for this test was 110 threads. I have a feeling this just verified the same thing

Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Chris Woelkers - NOAA Federal
I tried a simple thing of swapping out mpirun in the sbatch script for srun. Nothing more, nothing less. The model is now working on at least two nodes, I will have to test again on more but this is progress. Thanks, Chris Woelkers IT Specialist National Oceanic and Atmospheric Agency Great

Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Chris Woelkers - NOAA Federal
Thanks all for the ideas and possibilities. I will answer all in turn. Paul: Neither of the switches in use, Ethernet and Infiniband, have any form of broadcast storm protection enabled. Chris: I have passed on your question to the scientist that created the sbatch script. I will also look into

Re: [slurm-users] SLURM_TMPDIR

2019-12-11 Thread Tina Friedrich
Hi Angelines, I use a plugin for that - I believe this one https://github.com/hpc2n/spank-private-tmp which sort of does it all; your job sees an (empty) /tmp/. (It doesn't do cleanup, I simply rely on OS cleaning up /tmp, at the moment.) Tina On 05/12/2019 15:57, Angelines wrote: > Hello,

Re: [slurm-users] Multi-node job failure

2019-12-11 Thread Zacarias Benta
I had a simmilar issue, please check if the home drive, or the place the data should be stored is mounted on the nodes. On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote: > I have a 16 node HPC that is in the process of being upgraded from > CentOS 6 to 7. All nodes are