Re: [slurm-users] NIC gres types and lack of device files?

2017-11-14 Thread Christopher Samuel
On 14/11/17 22:12, Geert Geurts wrote: > I have no experience with gres nic config, but can't you use /sys/class/net > instead of /dev? Unfortunately not, that lists the p1p1 and p1p2 devices but not the mlx5_0 and mlx5_1 names that Open-MPI needs to use. :-( -- Christopher

Re: [slurm-users] NIC gres types and lack of device files?

2017-11-14 Thread Christopher Samuel
e closest to the allocated sockets from my reading. cheers! Chris -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

Re: [slurm-users] slurm-dev Mailing list changes this weekend, slurm-dev will become slurm-users

2017-11-05 Thread Christopher Samuel
On 06/11/17 06:46, Tim Wickberg wrote: > Welcome to the re-named mailing list. Thanks Tim, very much appreciate your work on this! -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (

Re: [slurm-users] Quick hold on all partitions, all jobs

2017-11-08 Thread Christopher Samuel
On 09/11/17 11:00, Lachlan Musicman wrote: > I've just discovered that the partitions have a state, and it can be set > to UP, DOWN, DRAIN or INACTIVE. DRAIN the partitions to stop new jobs running, then you can work on how you suspend running jobs (good luck with that!). -- Chris

Re: [slurm-users] Accounting not recording jobs

2018-05-08 Thread Christopher Samuel
On 08/05/18 18:31, sysadmin.caos wrote: In file /var/log/slurm/accounting appear my last job... but I don't undertand why job appears there while I have configured accounting with "AccountingStorageType=accounting_storage/slurmdbd" What does: sacctmgr list clusters say on the machine where

Re: [slurm-users] Slurm-17.11.5 + Pmix-2.1.1/Debugging

2018-05-08 Thread Christopher Samuel
On 09/05/18 10:23, Bill Broadley wrote: It's possible of course that it's entirely an openmpi problem, I'll be investigating and posting there if I can't find a solution. One of the changes in OMPI 3.1.0 was: - Update PMIx to version 2.1.1. So I'm wondering if previous versions were falling

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Christopher Samuel
On 02/05/18 10:15, R. Paul Wiegand wrote: Yes, I am sure they are all the same. Typically, I just scontrol reconfig; however, I have also tried restarting all daemons. Understood. Any diagnostics in the slurmd logs when trying to start a GPU job on the node? We are moving to 7.4 in a few

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Christopher Samuel
On 02/05/18 09:00, Kevin Manalo wrote: Also, I recall appending this to the bottom of [cgroup_allowed_devices_file.conf] .. Same as yours ... /dev/nvidia* There was a SLURM bug issue that made this clear, not so much in the website docs. That shouldn't be necessary, all we have for this

Re: [slurm-users] GPU / cgroup challenges

2018-05-01 Thread Christopher Samuel
On 02/05/18 09:31, R. Paul Wiegand wrote: Slurm 17.11.0 on CentOS 7.1 That's quite old (on both fronts, RHEL 7.1 is from 2015), we started on that same Slurm release but didn't do the GPU cgroup stuff until a later version (17.11.3 on RHEL 7.4). I don't see anything in the NEWS file about

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread Christopher Samuel
On 18/01/18 02:53, Loris Bennett wrote: This is all very OT, so it might be better to discuss it on, say, the OpenHPC mailing list, since as far as I can tell Spack, EasyBuild and Lmod (but not old or new 'environment-modules') are part of OpenHPC. Another place might be the Beowulf list, all

Re: [slurm-users] Best practice: How much node memory to specify in slurm.conf?

2018-01-17 Thread Christopher Samuel
On 18/01/18 01:52, Paul Edmon wrote: We've been typically taking 4G off the top for memory in our slurm.conf for the system and other processes.  This seems to work pretty well. Where I was working previously we'd discount the memory by the amount of GPFS page cache too, plus a little for

Re: [slurm-users] Slurm not starting

2018-01-15 Thread Christopher Samuel
On 16/01/18 04:22, Elisabetta Falivene wrote: slurmd: debug2: _slurm_connect failed: Connection refused slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused This sounds like the compute node cannot connect back to slurmctld on the management node, you

Re: [slurm-users] Getting a runtime percentage of % allocated on cluster

2018-02-05 Thread Christopher Samuel
On 06/02/18 09:09, Kevin Manalo wrote: Just to help 'sreport cluster utilization' is close to what I am looking for. Does this help? # sreport -t percent cluster utilization cheers, Chris

Re: [slurm-users] MariaDB lock problems for sacctmgr delete query

2018-02-16 Thread Christopher Samuel
Hi Ole, On 16/02/18 22:23, Ole Holm Nielsen wrote: Question: Is it safer to wait for 17.11.4 where the issue will presumably be solved? I don't think the commit has been backported to 17.11.x to date. It's in master (for 18.08) here: commit 4a16541bf0e005e1984afd4201b97df482e269ee Author:

Re: [slurm-users] ntasks and cpus-per-task

2018-02-22 Thread Christopher Samuel
On 22/02/18 18:49, Miguel Gutiérrez Páez wrote: What's the real meaning of ntasks? Has cpus-per-task and ntasks the same meaning in sbatch and srun? --ntasks is for parallel distributed jobs, where you can run lots of independent processes that collaborate using some form of communication

Re: [slurm-users] slurm and dates?

2018-02-24 Thread Christopher Samuel
On 24/02/18 05:51, Michael Di Domenico wrote: some of those variables are spit out as dates. since the dates do not include a timezone field how should that date field be assumed to work? from the value i conclude that it's my localtime, but is the date being stored as UTC and converted

Re: [slurm-users] ntasks and cpus-per-task

2018-02-23 Thread Christopher Samuel
On 23/02/18 21:50, Loris Bennett wrote: OK, I'm confused now. Our main culprit for producing processes with incorrect affinity is ORCA [1]. It uses OpenMPI but also likes to start processes asynchronously via SSH within the node set. In that case (and for the general case where there are

Re: [slurm-users] how can users start their worker daemons using srun?

2018-08-28 Thread Christopher Samuel
On 29/08/18 09:10, Priedhorsky, Reid wrote: This is surprising to me, as my interpretation is that the first run should allocate only one CPU, leaving 35 for the second srun, which also only needs one CPU and need not wait. Is this behavior expected? Am I missing something? That's odd - and

Re: [slurm-users] Determine usage for a QOS?

2018-08-19 Thread Christopher Samuel
Hi Paul, On 20/08/18 11:36, Paul Edmon wrote: I don't really have enough experience with QoS's to give a slicker method but you could use squeue --qos to poll the QoS and then write a wrapper to do the summarization.  It's hacky but it should work. I was thinking sacct -q ${QOS} to pull info

[slurm-users] Determine usage for a QOS?

2018-08-19 Thread Christopher Samuel
Hi folks, After an extended hiatus (I forgot to resubscribe after going away for a few weeks) I'm back.. ;-) We are using QOS's for projects which have been granted a fixed set of time for higher priority work which works nicely, but have just been asked the obvious question "how much time do

Re: [slurm-users] mpi on multiple nodes

2018-03-13 Thread Christopher Samuel
On 14/03/18 06:30, Mahmood Naderan wrote: I expected to see one compute-0-0.local and one compute-0-1.local messages. Any idea about that? You've asked for 2 MPI ranks each using 1 CPU and as you've got 2 cores on one and 4 cores on the other Slurm can fit both on to one of your nodes so

Re: [slurm-users] Memory allocation error

2018-03-13 Thread Christopher Samuel
On 14/03/18 07:11, Mahmood Naderan wrote: Any idea about that? You've not requested any memory in your batch job and I guess your default limit is too low. To get the 1GB (and a little head room) try: #SBATCH --mem=1100M That's a per node limit, so for MPI jobs (which Gaussian is not)

Re: [slurm-users] Creating .deb packages on ubuntu 16.04 LTS

2018-04-03 Thread Christopher Samuel
On 03/04/18 19:24, Arie Blumenzweig wrote: I need to do some minor changes in the preinst and postinst scripts of the debian packages. You probably need to look at what Debian and Ubuntu do for their packaging, for instance Ubuntu has information about it here:

Re: [slurm-users] MaxSubmitJobsPerUser?

2018-04-08 Thread Christopher Samuel
On 08/04/18 02:32, Dmitri Chebotarov wrote: The MaxSubmitJobsPerUser seems to be working when QOS where MaxSubmitJobsPerUser is defined is the only QOS assigned to the user. When multiple QOS assigned to user account, and only one QOS defines MaxSubmitJobsPerUser, the MaxSubmitJobsPerUser is

Re: [slurm-users] Two lines are printed by sacct

2018-04-11 Thread Christopher Samuel
On 12/04/18 04:00, Mahmood Naderan wrote: Hi, Hi Mahmood, I would like to know why the sacct command which I am usinig that to get some reports, shows two lines for each job. sacct reports one line per job step by default, not per job. If you add the 'JobName' field to your sacct

Re: [slurm-users] 17.11+auks+cgroups: finished jobs hang in completing state

2018-03-26 Thread Christopher Samuel
On 26/03/18 20:50, Robbert Eggermont wrote: The suggest fix (use sigkill instead of sigterm in slurm_spank_auks to stop auks) seems to work (so far). Excellent, so glad to hear that! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Re: [slurm-users] 17.11+auks+cgroups: finished jobs hang in completing state

2018-03-25 Thread Christopher Samuel
On 26/03/18 12:43, Robbert Eggermont wrote: Does this sound familiar to anyone? Does the slurmd log report it trying to kill the auks process? Also you might want to have a look at: https://bugs.schedmd.com/show_bug.cgi?id=4733 to see if that bug fits what you're seeing. Basically I get a

Re: [slurm-users] GrpTRES

2018-03-26 Thread Christopher Samuel
On 25/03/18 15:18, Mahmood Naderan wrote: Same as before Hmm, could you do "sacct -j 13" to see what account the job ran under? I can see you're in the "root" account too, which has no limits. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-21 Thread Christopher Samuel
On 22/03/18 00:09, Ole Holm Nielsen wrote: Chris, I don't understand what you refer to as "that"? Someone must have created /etc/pam.d/slurm.* files, and it doesn't seem to be the Slurm RPMs. Sorry Ole, just meant that PAM automates reading those files for you if you create them (and the

Re: [slurm-users] srun not allowed in a partition

2018-03-21 Thread Christopher Samuel
On 22/03/18 01:43, sysadmin.caos wrote: I'm trying to compile SLURM-17.02.7 with "lua" support executing "./configure && make && make contribs && make install", but make does nothing in src/plugins/job_submit/lua and I don't know why... How do I have to compile that plugin? The rest of the

Re: [slurm-users] SLURM on Ubuntu 16.04

2018-04-25 Thread Christopher Samuel
On 26/04/18 09:58, Christopher Samuel wrote: Most importantly you will want to be sure that they have backported the patch to close CVE-2018-7033 (fixed in 17.11.5). Went and found their sources, there is no mention of this being fixed in the proposed version, so it seems that bionic

Re: [slurm-users] SLURM on Ubuntu 16.04

2018-04-25 Thread Christopher Samuel
On 26/04/18 09:49, Eric F. Alemany wrote: I am going to follow your suggestion to install slurm via ubuntu 18.04 package. Just be aware that the version in bionic is outdated, it's 17.11.2. Most importantly you will want to be sure that they have backported the patch to close CVE-2018-7033

Re: [slurm-users] pam_slurm_adopt does not constrain memory?

2018-10-24 Thread Christopher Samuel
On 24/10/18 9:37 pm, Chris Samuel wrote: We're on 17.11.7 (for the moment, starting to plan upgrade to 18.08.x). From the NEWS file in 17.11.x (in this case for 17.11.10): -- Fix pam_slurm_adopt to honor action_adopt_failure. Could explain why this isn't something we see consistently, and

Re: [slurm-users] pam_slurm_adopt does not constrain memory?

2018-10-24 Thread Christopher Samuel
On 25/10/18 2:29 pm, Christopher Samuel wrote: Could explain why this isn't something we see consistently, and why we're both seeing it currently. This seems to be a handy way to find any processes that are not properly constrained by Slurm cgroups on compute nodes (at least in our

Re: [slurm-users] Accounting - running with 'wrong' account on cluster

2018-11-06 Thread Christopher Samuel
On 7/11/18 2:44 pm, Brian Andrus wrote: Ah just scontrol reconfigure doesn't actually make it take effect. Restarting slurmctld did it. Phew! Glad to hear that's sorted out.. :-) -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Re: [slurm-users] constraints question

2018-11-11 Thread Christopher Samuel
Hi Doug, On 12/11/18 8:34 am, Douglas Jacobsen wrote: I think you'll need to update to 18.08 to get this working, constraint arithmetic and knl were not compatible until that release. Thanks! That's planned for us today (though we're not using constraints) and from the sound of it Tina

Re: [slurm-users] new user simple question re sacct output line2

2018-11-14 Thread Christopher Samuel
On 15/11/18 12:38 am, Matthew Goulden wrote: sacct output including the default headers is three lines, What is line 2 documenting? Most fields are blank. Ah, well it can be more that 3 lines.. ;-) [csamuel@farnarkle2 tmp]$ sbatch --wrap hostname Submitted batch job 1740982 When I use sacct

Re: [slurm-users] How to check the percent cpu of a job?

2018-11-21 Thread Christopher Samuel
On 22/11/18 5:41 am, Ryan Novosielski wrote: You can see, both of the above are examples of jobs that have allocated CPU numbers that are very different from the ultimate CPU load (the first one using way more than allocated, though they’re in a cgroup so theoretically isolated from the other

Re: [slurm-users] About x11 support

2018-11-21 Thread Christopher Samuel
On 22/11/18 5:04 am, Mahmood Naderan wrote: The idea is to have a job manager that find the best node for a newly submitted job. If the user has to manually ssh to a node, why one should use slurm or any other thing? You are in a really really unusual situation - in 15 years I've not come

Re: [slurm-users] $TMPDIR does not honor "TmpFS"

2018-11-21 Thread Christopher Samuel
On 22/11/18 12:38 am, Douglas Duckworth wrote: We are setting TmpFS=/scratchLocal in /etc/slurm/slurm.conf on nodes and controller. However $TMPDIR value seems to be /tmp not /scratchLocal. As a result users are writing to /tmp which we do not want. Our solution to that was to use a plugin

Re: [slurm-users] ubuntu 16.04 > 18.04

2018-09-13 Thread Christopher Samuel
On 13/09/18 03:44, A wrote: Thinking about upgrading to Ubuntu 18.04 on my workstation, where I am running a single node slurm setup. Any issues any one has run across in the update? If you are using slurmdbd that's too large a jump, you'll need to upgrade to an intermediate version first.

Re: [slurm-users] Multinode MPI job

2019-03-27 Thread Christopher Samuel
On 3/27/19 8:07 AM, Prentice Bisbal wrote: sbatch -n 24 -w  Node1,Node2 That will allocate 24 cores (tasks, technically) to your job, and only use Node1 and Node2. You did not mention any memory requirements of your job, so I assumed memory is not an issue and didn't specify any in my

Re: [slurm-users] Multinode MPI job

2019-03-27 Thread Christopher Samuel
On 3/27/19 11:29 AM, Mahmood Naderan wrote: Thank you very much. you are right. I got it. Cool, good to hear. I'd love to hear whether you get heterogenous MPI jobs working too! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Multinode MPI job

2019-03-27 Thread Christopher Samuel
On 3/27/19 8:39 AM, Mahmood Naderan wrote: mpirun pw.x -imos2.rlx.in You will need to read the documentation for this: https://slurm.schedmd.com/heterogeneous_jobs.html Especially note both of these: IMPORTANT: The ability to execute a single application across more

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Christopher Samuel
On 2/22/19 3:54 PM, Aaron Jackson wrote: Happy to answer any questions about our setup. If folks are interested in a mailing list where this discussion would be decidedly on-topic then I'm happy to add people to the Beowulf list where there's a lot of other folks with expertise in this

Re: [slurm-users] Slurm message aggregation

2019-03-05 Thread Christopher Samuel
On 3/5/19 6:58 AM, Paul Edmon wrote: We tried it once back when they first introduced it and shelved it after we found that we didn't really need it. Thanks Paul. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] How to enable QOS correctly?

2019-03-05 Thread Christopher Samuel
On 3/5/19 7:37 AM, Matthew BETTINGER wrote: Every time we attempt this no one can submit a job, slurm says waiting on resources I believe. We have accounting enabled and everyone is a member of the default qos group "normal". Is it also their default QOS? Do you still have the slurmctld

[slurm-users] Slurm message aggregation

2019-03-04 Thread Christopher Samuel
Hi folks, Anyone here tried Slurm's message aggregation (MsgAggregationParams in slurm.conf) at all? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] seff: incorrect memory usage (18.08.5-2)

2019-03-04 Thread Christopher Samuel
On 2/26/19 5:49 AM, Marcus Wagner wrote: If I remember right, there was a discussion lately in this list regarding the JobAcctGatherType, yet I do not remember the outcame It used to be that SchedMD would strongly recommend the non-group way of gathering information, but that never really

Re: [slurm-users] pmix and ucx versions compatibility with slurm

2019-02-26 Thread Christopher Samuel
On 2/26/19 5:13 AM, Daniel Letai wrote: I couldn't find any documentation regarding which api from pmix or ucx Slurm is using, and how stable those api are. There is information about PMIx at least on the SchedMD website: https://slurm.schedmd.com/mpi_guide.html#pmix For UCX I'd suggest

Re: [slurm-users] What is the 2^32-1 values in "stepd_connect to .4294967295 failed" telling you

2019-03-08 Thread Christopher Samuel
On 3/8/19 12:25 AM, Kevin Buckley wrote: error: stepd_connect to .1 failed: No such file or directory error: stepd_connect to .4294967295 failed: No such file or directory We can imagine why a job that got killed in step 0 might still be looking for the .1 step but the .2^32-1 is beyond our

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-19 Thread Christopher Samuel
On 3/19/19 5:31 AM, Peter Steinbach wrote: For example, let's say I have a 4-core GPU node called gpu1. A non-GPU job $ sbatch --wrap="sleep 10 && hostname" -c 3 Can you share the output for "scontrol show job [that job id]" once you submit this please? Also please share "scontrol show

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-20 Thread Christopher Samuel
On 3/20/19 9:09 AM, Peter Steinbach wrote: Interesting enough, if I add Cores=0-1 and Cores=2-3 to the gres.conf file, everything stops working again. :/ Should I send around scontrol outputs? And yes, I watched out to set the --mem flag for the job submission this time. Well there you've

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-20 Thread Christopher Samuel
On 3/20/19 4:20 AM, Frava wrote: Hi Chris, thank you for the reply. The team that manages that cluster is not very fond of upgrading SLURM, which I understand. Do be aware that Slurm 17.11 will stop being maintained once 19.05 is released in May. So basically my heterogeneous job that

Re: [slurm-users] Very large job getting starved out

2019-03-21 Thread Christopher Samuel
On 3/21/19 6:55 AM, David Baker wrote: it currently one of the highest priority jobs in the batch partition queue What does squeue -j 359323 --start say? -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Christopher Samuel
On 3/21/19 9:21 AM, Loris Bennett wrote: Chris, maybe you should look at EasyBuild (https://easybuild.readthedocs.io/en/latest/). That way you can install all the dependencies (such as zlib) as modules and be pretty much independent of the ancient packages your distro may provide (other

Re: [slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-03-22 Thread Christopher Samuel
On 3/21/19 3:43 PM, Prentice Bisbal wrote: #!/bin/tcsh Old school script debugging trick - make that line: #!/bin/tcsh -x and then you'll see everything the script is doing. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-22 Thread Christopher Samuel
On 3/22/19 10:31 AM, Prentice Bisbal wrote: Most HPC centers have scheduled downtime on a regular basis. That's not my experience before now, where I've worked in Australia we scheduled maintenance for when we absolutely had to do them, but there could be delays to them if there were

Re: [slurm-users] Segmentation fault when launching mpi jobs using Intel MPI

2019-02-06 Thread Christopher Samuel
On 2/6/19 9:06 AM, Bob Smith wrote: Any ideas on what is going on? Any reason you're not using "srun" to launch your code? https://slurm.schedmd.com/mpi_guide.html All the best, Chris

Re: [slurm-users] Strange error, submission denied

2019-02-14 Thread Christopher Samuel
On 2/14/19 12:22 AM, Marcus Wagner wrote: CPUs=96 Boards=1 SocketsPerBoard=4 CoresPerSocket=12 ThreadsPerCore=2 RealMemory=191905 That's different to what you put in your config in the original email though. There you had: CPUs=48 Sockets=4 CoresPerSocket=12 ThreadsPerCore=2 This config

Re: [slurm-users] Reservation with memory

2019-02-15 Thread Christopher Samuel
On 2/15/19 7:17 AM, Arnaud Renard URCA wrote: Does any of you have a solution to consider memory when creating a reservation ? I don't think memory is currently supported for reservations via TRES, it's certainly not listed in the manual page for scontrol either in 18.08 or in master (which

Re: [slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Samuel
On 1/31/19 8:12 AM, Christopher Benjamin Coffey wrote: This seems to be related to jobs that can't start due to in our case: AssocGrpMemRunMinutes, and AssocGrpCPURunMinutesLimit Must be a bug relating to GrpTRESRunLimit it seems. Do you mean can't start due to not enough time, or can't

Re: [slurm-users] disable-bindings disables counting of gres resources

2019-04-15 Thread Christopher Samuel
On 4/15/19 8:15 AM, Peter Steinbach wrote: We had a feeling that cgroups might be more optimal. Could you point us to documentation that suggests cgroups to be a requirement? Oh it's not a requirement, just that without it there's nothing to stop a process using GPUs outside of its

Re: [slurm-users] Scontrol update: invalid user id

2019-04-15 Thread Christopher Samuel
On 4/15/19 3:03 PM, Andy Riebs wrote: Run "slurmd -Dvv" as root on one of the compute nodes and it will show you what it thinks is the socket/core/thread configuration. In fact: slurmd -C will tell you what it discovers in a way that you can use in the configuration file. All the best,

Re: [slurm-users] How to apply for multiple GPU cards from different worker nodes?

2019-04-16 Thread Christopher Samuel
On 4/16/19 1:15 AM, Ran Du wrote:       And another question is : how to apply for multiple cards could not be divided exactly by 8? For example, to apply for 10 GPU cards, 8 cards on one node and 2 cards on another node? There are new features coming in 19.05 for GPUs to better support

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-11 Thread Christopher Samuel
On 4/11/19 8:27 AM, Randall Radmer wrote: I guess my next question is, are there any negative repercussions to setting "Delegate=yes" in slurmd.service? This was Slurm bug 5292 and was fixed last year: https://bugs.schedmd.com/show_bug.cgi?id=5292 # Commit cecb39ff087731d2 adds Delegate=yes

Re: [slurm-users] Issue with x11

2019-05-15 Thread Christopher Samuel
On 5/15/19 11:36 AM, Mahmood Naderan wrote: I really like to know why x11 is not so friendly? For example, slurm works with MPI. Why not with X11?! Because MPI support is fundamental, X11 support is nice to have. I suspect 19.05 will make your life an awful lot easier! All the best, Chris

Re: [slurm-users] Issue with x11

2019-05-15 Thread Christopher Samuel
On 5/15/19 7:32 AM, Tina Friedrich wrote: Hadn't yet read that far - I plan to test 19.05 soon anyway. Will report. Cool, Tim has ripped out all the libssh code (which caused me issues at ${JOB-1} because it didn't play nicely with SSH keep alive messages) and replaced it with native

Re: [slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

2019-06-06 Thread Christopher Samuel
On 6/6/19 10:21 AM, Levi Morrison wrote: This means all OpenMPI programs that end up calling `srun` on Slurm 19.05 will fail. Sounds like a good reason to file a bug. We're not on 19.05 yet so we're not affected (yet) but this may cause us some pain when we get to that point (though at

Re: [slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

2019-06-06 Thread Christopher Samuel
On 6/6/19 12:01 PM, Kilian Cavalotti wrote: Levi did already. Aha, race condition between searching bugzilla and writing the email. ;-) -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] status of cloud nodes

2019-06-19 Thread Christopher Samuel
On 6/18/19 11:29 PM, nathan norton wrote: Without knowing the internals of slurm it feels like nodes that are turned off+cloud state don't exist in the system until they are on? Not quite, they exist internally but are not exposed until in use:

Re: [slurm-users] ConstrainRAMSpace=yes and page cache?

2019-06-21 Thread Christopher Samuel
On 6/13/19 5:27 PM, Kilian Cavalotti wrote: I would take a look at the various *KmemSpace options in cgroups.conf, they can certainly help with this. Specifically I think you'll want: ConstrainKmemSpace=no to fix this. This happens for NFS and Lustre based systems, I don't think it's a

Re: [slurm-users] Issue with x11

2019-05-14 Thread Christopher Samuel
On 5/14/19 5:09 PM, Mahmood Naderan wrote: Should I modify that parameter on compute-0-0 too? No, but you'll need to logout of rocks7 and ssh back into it. Or are you on the console of the system itself? -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Issue with x11

2019-05-16 Thread Christopher Samuel
On 5/16/19 1:04 AM, Alan Orth wrote: but now we get a handful of nodes drained every day with reason "Kill task failed". In ten years of using SLURM I've never had so many problems as I'm having now. :\ We see "kill task failed" issues but as Marcus says that's not related to X11 support,

Re: [slurm-users] Issue with x11

2019-05-16 Thread Christopher Samuel
On 5/16/19 8:53 AM, Mahmood Naderan wrote: Can I ask what is the expected release date for 19? It seems that rc1 has been released in theMay? Sometime in May hopefully! -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Christopher Samuel
On 4/26/19 7:29 AM, Riebs, Andy wrote: In a separate test that I had missed, even "srun hostname" took 5 minutes to run. So there was no remote file system or MPI involvement. Worth trying: srun /bin/hostname Just in case there's something weird in the path that causes it to hit a network

Re: [slurm-users] Hide Filesystem From Slurm

2019-07-11 Thread Christopher Samuel
On 7/11/19 8:19 AM, Douglas Duckworth wrote: I am wondering if it's possible to hide a file system, that's world writable on compute node, logically within Slurm.  That way any job a user runs cannot possible access this file system. Essentially we define $TMPDIR as /scratch, which Slurm

Re: [slurm-users] AllocNodes on partition no longer working

2019-08-14 Thread Christopher Samuel
On 8/14/19 10:46 AM, Sajdak, Doris wrote: We upgraded from version 18.08.4 to 19.05.1-2 today and are suddenly getting a permission denied error on partitions where we have AllocNodes set.  If we remove the AllocNodes constraint, the job submits successfully but then users can submit from

Re: [slurm-users] Slurm 19.05 --workdir non existent?

2019-08-15 Thread Christopher Samuel
On 8/15/19 11:02 AM, Mark Hahn wrote: it's in NEWS, if that counts.  also, I note that at least in this commit, --chdir is added but --workdir is not removed from option parsing. It went away here: commit 9118a41e13c2dfb347c19b607bcce91dae70f8c6 Author: Tim Wickberg Date: Tue Mar 12

Re: [slurm-users] AllocNodes on partition no longer working

2019-08-15 Thread Christopher Samuel
On 8/15/19 7:18 AM, Sajdak, Doris wrote: Thanks Chris! That worked. We'd tried IP address but not FQDN. Great to hear! -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] 19.05 and GPUs vs GRES

2019-09-05 Thread Christopher Samuel
On 8/13/19 10:44 PM, Barbara Krašovec wrote: We still have the gres configuration, users have their workload scripted and some still use sbatch with gres. Both options work. I missed this before Barbara, sorry - that's really good to know that the options aren't mutually exclusive, thank

Re: [slurm-users] 19.05 and GPUs vs GRES

2019-09-05 Thread Christopher Samuel
On 9/5/19 3:49 PM, Bill Broadley wrote: I have a user with a particularly flexible code that would like to run a single MPI job across multiple nodes, some with 8 GPUs each, some with 2 GPUs. Perhaps they could just specify a number of tasks with cpus per task, mem per task and GPUs per

Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-10 Thread Christopher Samuel
On 9/4/19 9:40 AM, Sam Gallop (NBI) wrote: I did play around with XFS quotas on our large systems (SGI UV300, HPE MC990-X and Superdome Flex) but it couldn't get it working how I wanted (or how I thought it should work). I'll re-visit it knowing that other people have got XFS quotas working.

[slurm-users] How to trigger kernel stacktraces for stuck processes from unkillable steps

2019-09-18 Thread Christopher Samuel
Hi all, At the Slurm User Group I mentioned about how to tell the kernel to dump information about stuck processes from your unkillable step script to the kernel log buffer (seen via dmesg and hopefully syslog'd somewhere useful for you). echo w > /proc/sysrq-trigger That's it.. ;-) You

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-15 Thread Christopher Samuel
On 9/15/19 4:17 PM, Brian Andrus wrote: Are steps required to capture Max RSS? No, you should see a MaxRSS reported for the batch step, for instance: $ sacct -j $JOBID -o jobid,jobname,maxrss All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Using the manager as compute Node

2019-08-05 Thread Christopher Samuel
On 8/5/19 8:00 AM, wodel youchi wrote: Do I have to declare it, for example with 10 CPUs and 32Gb of RAM to save the rest for the management, or will slurmctld take that in hand? You will need both to declare it and also use cgroups to enforce it so that processes can't overrun that limit.

Re: [slurm-users] pam_slurm_adopt and memory constraints?

2019-07-17 Thread Christopher Samuel
On 7/17/19 4:05 AM, Andy Georges wrote: Can you show what your /etc/pam.d/sshd looks like? For us it's actually here: --- # cat /etc/pam.d/common-account #%PAM-1.0 # # This file is autogenerated by pam-config. All changes # will be

Re: [slurm-users] Invalid qos specification

2019-07-15 Thread Christopher Samuel
On 7/15/19 11:22 AM, Prentice Bisbal wrote: $ salloc -p general -q debug  -t 00:30:00 salloc: error: Job submit/allocate failed: Invalid qos specification what does: scontrol show part general say? Also, does the user you're testing as have access to that QOS? All the best, Chris --

Re: [slurm-users] pam_slurm_adopt and memory constraints?

2019-07-15 Thread Christopher Samuel
On 7/12/19 6:21 AM, Juergen Salk wrote: I suppose this is nevertheless the expected behavior and just the way it is when using pam_slurm_adopt to restrict access to the compute nodes? Is that right? Or did I miss something obvious? Could it be a RHEL7 specific issue? It looks like it's

Re: [slurm-users] oom-kill events for no good reason

2019-11-07 Thread Christopher Samuel
On 11/7/19 8:36 AM, David Baker wrote: We are dealing with some weird issue on our shared nodes where job appear to be stalling for some reason. I was advised that this issue might be related to the oom-killer process. We do see a lot of these events. In fact when I started to take a closer

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Christopher Samuel
On 11/13/19 10:42 AM, Ole Holm Nielsen wrote: Your order of upgrading is *disrecommended*, see for example page 6 in the presentation "Field Notes From A MadMan, Tim Wickberg, SchedMD" in the page https://slurm.schedmd.com/publications.html Also the documentation for upgrading here:

Re: [slurm-users] Array jobs vs. many jobs

2019-11-22 Thread Christopher Samuel
Hi Ryan, On 11/22/19 12:18 PM, Ryan Novosielski wrote: Quick question that I'm not sure how to find the answer to otherwise: do array jobs have less impact on the scheduler in any way than a whole long list of jobs run the more traditional way? Less startup overhead, anything like that?

Re: [slurm-users] understanding resource reservations

2019-10-21 Thread Christopher Samuel
On 10/21/19 3:05 PM, c b wrote: 1) It looks like there's a way to create a daily recurring reservation by specifying "flags=daily" .  How would I make a regular reservation for weekdays only? flags=WEEKDAY Repeat the reservation at the same time on every weekday (Monday, Tuesday,

Re: [slurm-users] can't get fairshare to be calculated per partition

2019-10-29 Thread Christopher Samuel
On 10/29/19 12:42 PM, Igor Feghali wrote: fairshare is been calculated for the entire cluster and not per partition. That's correct - jobs can request multiple partitions (and will run in the first one available to service it). All the best, Chris -- Chris Samuel :

Re: [slurm-users] How to share GPU resources? (MPS or another way?)

2019-10-09 Thread Christopher Samuel
On 10/8/19 12:30 PM, Goetz, Patrick G wrote: It looks like GPU resources can only be shared by processes run by the same user? This is touched on in this bug https://bugs.schedmd.com/show_bug.cgi?id=7834 where it appears at one point MPS appeared to work for multiple users. It may be that

Re: [slurm-users] Removing user from slurm configuration

2019-10-11 Thread Christopher Samuel
On 10/10/19 8:53 AM, Marcus Wagner wrote: if you REALLY want to get rid of that user, you might need to manipulate the SQL Database. Yeah, I really don't think that would be a safe thing to do. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-25 Thread Christopher Samuel
On 2/25/20 11:41 AM, Dean Schulze wrote: I'm very interested in the "configless" setup for slurm.  Is the setup for configless documented somewhere? Looks like the website has already been updated for the 20.02 documentation, and it looks like it's here:

Re: [slurm-users] Slurm 19.05 X11-forwarding

2020-02-29 Thread Christopher Samuel
On 2/28/20 8:56 PM, Pär Lundö wrote: I thought that I could run the srun-command with X11-forwarding called from an sbatch-jobarray-script and get the X11-forwarding to my display. No, I believe X11 forwarding can only work when you run "srun --x11" directly on a login node, not from inside

Re: [slurm-users] Block interactive shell sessions

2020-03-05 Thread Christopher Samuel
On 3/5/20 9:22 AM, Luis Huang wrote: We would like to block certain nodes from accepting interactive jobs. Is this possible on slurm? My suggestion would be to make a partition for interactive jobs that only contains the nodes that you want to run them and then use the submit filter to

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Christopher Samuel
On 2/27/20 11:23 AM, Robert Kudyba wrote: OK so does SLURM support MPS and if so what version? Would we need to enable cons_tres and use, e.g., --mem-per-gpu? Slurm 19.05 (and later) supports MPS - here's the docs from the most recent release of 19.05:

  1   2   >