[slurm-dev] Re: Multiple simultaneous jobs on single node

2017-02-09 Thread Marc Rollins
Thank you! It appears that setting SelectTypeParameters=CR_CPU does what I want. On Thu, Feb 9, 2017 at 11:55 AM, Allan Streib wrote: > > I just got something similar working. I used CR_CPU for > SelectTypeParameters. > > Marc Rollins writes: > >

[slurm-dev] Re: Multiple simultaneous jobs on single node

2017-02-09 Thread Allan Streib
I just got something similar working. I used CR_CPU for SelectTypeParameters. Marc Rollins writes: > Hello, > > I am attempting to run multiple jobs simultaneously on a node with two GPUs. > However, all my attempts fail. Both > jobs are queued, but only one runs at

[slurm-dev] Multiple simultaneous jobs on single node

2017-02-09 Thread Marc Rollins
Hello, I am attempting to run multiple jobs simultaneously on a node with two GPUs. However, all my attempts fail. Both jobs are queued, but only one runs at a time while the other remains in the queue. My SLURM configuration is below. Any assistance is greatly appreciated. # slurm.conf

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Novosielski
I have used ulimits in the past to limit users to 768MB of RAM per process. This seemed to be enough to run anything they were actually supposed to be running. I would use cgroups on a more modern (this was RHEL5). A related question: we used cgroups on a CentOS 6 system, but then switched our

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Cox
If you're interested in the programmatic method I mentioned to increase limits for file transfers, https://github.com/BYUHPC/uft/tree/master/cputime_controls might be worth looking at. It works well for us, though a user will occasionally start using a new file transfer program that you

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Jason Bacon
That reminds me, we also don't allow file transfers through the head node: chmod 750 /usr/bin/sftp /usr/bin/scp /usr/bin/rsync All file transfer operations must go through one of the file servers. On 02/09/17 12:13, Nicholas McCollum wrote: While this isn't a SLURM issue, it's something we

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Nicholas McCollum
While this isn't a SLURM issue, it's something we all face. Due to my system being primarily students, it's something I face a lot. I second the use of ulimits, although this can kill off long running file transfers. What you can do to help out users is set a low soft limit and a somewhat

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ole Holm Nielsen
We limit the cpu times in /etc/security/limits.conf so that user processes have a maximum of 10 minutes. It doesn't eliminate the problem completely, but it's fairly effective on users who misunderstood the role of login nodes. On Thu, Feb 9, 2017 at 6:38 PM +0100, "Jason Bacon"

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Jason Bacon
We simply make it impossible to run computational software on the head nodes. 1.No scientific software packages are installed on the local disk. 2.Our NFS-mounted application directory is mounted with noexec. Regards, Jason On 02/09/17 07:09, John Hearns wrote: Does anyone

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread John Hearns
Thanks to Ryan, Sarlo and Sean. > "Killed" isn't usually a helpful error message that they understand. Au contraire, I usually find that is a message they understand. Pour encourarger les autres you understand. -Original Message- From: Ryan Cox [mailto:ryan_...@byu.edu] Sent: 09

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Cox
John, We use /etc/security/limits.conf to set cputime limits on processes: * hard cpu 60 root hard cpu unlimited It works pretty well but long running file transfers can get killed. We have a script that looks for whitelisted programs to remove the limit from on a periodic basis. We

[slurm-dev] Re: sacctmgr case insensitive

2017-02-09 Thread Loris Bennett
Hi Daniel, Daniel Ruiz Molina writes: > Hi, > > I'm adding user to accounts in accounting information. However, some users in > my > system have capital letters and when I try to add them to their account, > sacctmgr returns this message: "There is no uid for user

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Sean McGrath
Hi, We use cgroups to limit usage to 3 cores and 4G of memory on the head nodes. I didn't do it but will copy and paste in our documentation below. Those limits, 3 cores are 4G are global to all non root users I think as they apply to a group. We obviously don't do this on the nodes. We also

[slurm-dev] Re: Slurmd daemon doesn't start

2017-02-09 Thread David Ramírez
You need that nvidia-module has been started since machine boot. This script is worked for me I use Centos. Yo can add it to init.d Regards El 09/02/17 a las 13:56, Christian Goll escribió: Hello Daniel, do /dev/nvidia[0-1] exist on the machines? If not see under

[slurm-dev] sacctmgr case insensitive

2017-02-09 Thread Daniel Ruiz Molina
Hi, I'm adding user to accounts in accounting information. However, some users in my system have capital letters and when I try to add them to their account, sacctmgr returns this message: "There is no uid for user 'MY_USER' Are you sure you want to continue?". Then, if I click "y", user is

[slurm-dev] Stopping compute usage on login nodes

2017-02-09 Thread John Hearns
Does anyone have a good suggestion for this problem? On a cluster I am implementing I noticed a user is running a code on 16 cores, on one of the login nodes, outside the batch system. What are the accepted techniques to combat this? Other than applying a LART, if you all know what this means.

[slurm-dev] Re: Slurmd daemon doesn't start

2017-02-09 Thread Christian Goll
Hello Daniel, do /dev/nvidia[0-1] exist on the machines? If not see under http://docs.nvidia.com/cuda/cuda-installation-guide-linux/ there is shell scripted which creates the device nodes for you. They are not always created during startup, especially if there is not X on the system. kind

[slurm-dev] Re: Abaqus with Slurm

2017-02-09 Thread John Hearns
Sean, much thankyou. Guinness owed if I am ever in Temple Bar soon. -Original Message- From: Sean McGrath [mailto:smcg...@tchpc.tcd.ie] Sent: 09 February 2017 11:58 To: slurm-dev Subject: [slurm-dev] Re: Abaqus with Slurm Hi, We have slurm 16.05.4 and the latest

[slurm-dev] Re: Abaqus with Slurm

2017-02-09 Thread Sean McGrath
Hi, We have slurm 16.05.4 and the latest version of Abaqus we use is 6.14. I remember running into a similar problem with Abaqus so I wrote some bad bash to populate the host list file; http://www.tchpc.tcd.ie/node/1261 The github script seems to be doing something similar but in a better way.

[slurm-dev] Slurmd daemon doesn't start

2017-02-09 Thread Daniel Ruiz Molina
Hi, In my GPU cluster, slurmd daemon doesn't start correctly because when daemon start, it doesn't find /dev/nvidia[0-1] device (mapped in gres.conf). For solving this problem, I have added attribute "ExecStartPre=@/usr/bin/nvidia-smi >/dev/null" in service file and now daemon starts

[slurm-dev] Abaqus with Slurm

2017-02-09 Thread John Hearns
I would guess quite a few sites are using Abaqus with Slurm. I would be grateful for some pointers on the submission scripts for MPI parallel Abaqus runs. I am setting up Abaqus version 6.14-1 on a system with Slurm 16.05 and an Omnipath interconnect. Specifically I am using this script to