Re: [slurm-users] PMIx and Slurm

2017-11-28 Thread Paul Edmon
then to let PMIx handle pmix solely and let slurm handle the rest.  Thanks! Am I right in reading that you don't have to build slurm against PMIx?  So it just interoperates with it fine if you just have it installed and specify pmix as the launch option?  That's neat. -Paul Edmon- On 11/28/2017 6

[slurm-users] PMIx and Slurm

2017-11-28 Thread Paul Edmon
is the right way of building PMIx and Slurm such that they interoperate properly? Suffice it to say little to no documentation exists on how to properly this, so any guidance would be much appreciated. -Paul Edmon-

Re: [slurm-users] Intermittent "Not responding" status

2017-12-04 Thread Paul Edmon
is substantial, thus the lag crossing back and for can add up. I would check to see if all your nodes can talk to each other and the master and if your Timeouts are set high enough. -Paul Edmon- On 12/04/2017 01:57 PM, Stradling, Alden Reid (ars9ac) wrote: I have a number of nodes that have, after our

Re: [slurm-users] x11 for interactive jobs

2018-05-14 Thread Paul Edmon
There is a spank x11 plugin that I think pretty much everyone used: https://github.com/hautreux/slurm-spank-x11 -Paul Edmon- On 05/14/2018 02:44 PM, Mahmood Naderan wrote: Hi, I see --x11 option in [1], but there isn't any such option. Is that for old versions? Also, there is a wrapper [2

Re: [slurm-users] Slurm Installation on different Unix environment

2018-05-10 Thread Paul Edmon
Assuming you can build slurm and its dependencies this should work.  We've run slurm here with different OS's on various nodes for a while and it works fine.  That said I haven't tried odroids so I can't speak specifically to that. -Paul Edmon- On 05/10/2018 08:26 AM, agostino bruno wrote

Re: [slurm-users] How to access environment variables in submit script?

2018-05-10 Thread Paul Edmon
Not that I am aware of.  Since the header isn't really part of the script bash doesn't evaluate them as far as I know. -Paul Edmon- On 05/10/2018 09:19 AM, Dmitri Chebotarov wrote: Hello Is it possible to access environment variables in a submit script? F.e. $SCRATCH is set to a path and I

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Paul Edmon
to limit usage. -Paul Edmon- On 05/08/2018 10:08 AM, Renfro, Michael wrote: That’s the first limit I placed on our cluster, and it has generally worked out well (never used a job limit). A single account can get 1000 CPU-days in whatever distribution they want. I’ve just added a root-only

Re: [slurm-users] Restart slurmctld

2018-06-05 Thread Paul Edmon
If you are in SystemD land the command is: systemctl restart slurmctld -Paul Edmon- On 06/05/2018 06:00 AM, Mahmood Naderan wrote: Yes Yes/No :) Regards, Mahmood On Tue, Jun 5, 2018 at 2:18 PM, Buckley, Ronan <mailto:ronan.buck...@dell.com>> wrote: Hi All, I need t

Re: [slurm-users] Multiple job constraints

2018-06-20 Thread Paul Edmon
You will get whatever cores Slurm can find which will be an assortment of hosts. -Paul Edmon- On 6/20/2018 11:01 AM, Nathan Harper wrote: sorry to hijack, but we've been considering a similar configuration, but I was wondering what happens if you don't set a processor type

Re: [slurm-users] Jobs in pending state

2018-04-29 Thread Paul Edmon
It sounds like your second partition is getting primarily scheduled by the backfill scheduler.  I would try the partition_job_depth option as otherwise the main loop only looks at priority order and not by partition. -Paul Edmon- On 4/29/2018 5:32 AM, Zohar Roe MLM wrote: Hello. I am having

Re: [slurm-users] Jobs blocking scheduling progress

2018-07-03 Thread Paul Edmon
jobs can't run due to some vargarity in logic (typically because it thinks that it won't fit due to time constraints). Anyways that's where I would start. -Paul Edmon- On 7/3/2018 5:22 PM, Christopher Benjamin Coffey wrote: Hello! We are having an issue with high priority gpu jobs blocking

Re: [slurm-users] restrict application to a given partition

2018-01-15 Thread Paul Edmon
script doesn't catch it. -Paul Edmon- On 1/15/2018 8:31 AM, John Hearns wrote: Juan, me kne-jerk reaction is to say 'containerisation' here. However I guess that means that Slurm would have to be able to inspect the contents of a container, and I do not think that is possible. I may be very

Re: [slurm-users] ntasks and cpus-per-task

2018-02-22 Thread Paul Edmon
Yeah, I've found that in those situations to have people wrap their threaded programs in srun inside of sbatch.  That way the scheduler knows which process specifically gets the threading. -Paul Edmon- On 02/22/2018 10:39 AM, Loris Bennett wrote: Hi Paul, Paul Edmon <ped...@cfa.harvard.

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-02-22 Thread Paul Edmon
though so perhaps we avoided that particular query due to that. From past experience these major upgrades can take quite a bit of time as they typically change a lot about the DB structure in between major versions. -Paul Edmon- On 02/22/2018 06:17 AM, Malte Thoma wrote: FYI: * We broke our

Re: [slurm-users] Changing resource limits while running jobs

2018-01-04 Thread Paul Edmon
Typically changes like this only impact pending or newly submitted jobs.  Running jobs usually are not impacted, though they will count against any new restrictions that you put in place. -Paul Edmon- On 1/4/2018 6:44 AM, Juan A. Cordero Varelaq wrote: Hi, A couple of jobs have been

Re: [slurm-users] restart slurmd on nodes w/ running jobs?

2018-07-27 Thread Paul Edmon
Restarting slurmd should be fine assuming they come back before the communications time out.  I restart slurmd's all the time and haven't had any real problems. -Paul Edmon- On 7/27/2018 6:51 PM, Chris Harwell wrote: Ot is possible, but double check your config for timeouts first. On Fri

Re: [slurm-users] submit from node w/ different OS?

2018-07-26 Thread Paul Edmon
Generally it is best that they should be.  Slurm maps the users environment into the job submission.  So if things change in the OSt under it it can lead to issues. -Paul Edmon- On 07/26/2018 12:39 PM, Liam Forbes wrote: Morning All. I'm attempting to set up a new submit host

Re: [slurm-users] slurm does not pass mca params to openmpi?

2018-07-19 Thread Paul Edmon
So the recommendation I've gotten the past is to us option number 4 from this FAQ: https://www.open-mpi.org/faq/?category=tuning#setting-mca-params This works for both mpirun and srun in slurm because its a flat file that is read rather than options that are passed in. -Paul Edmon- On 07

Re: [slurm-users] Recovering from network failures in Slurm (without killing or restarting active jobs)

2018-08-31 Thread Paul Edmon
So there are different options you can set for Return to Service in the slurm.conf which can effect how the node is handled on reconnect.  You can also up the timeouts for the daemons. -Paul Edmon- On 8/31/2018 5:06 PM, Renfro, Michael wrote: Hey, folks. I’ve got a Slurm 17.02 cluster (RPMs

Re: [slurm-users] Time-based partitions

2018-03-12 Thread Paul Edmon
You could probably accomplish this using a job submit lua script and some crafted QoS's.  It would take some doing but I imagine it could work. -Paul Edmon- On 03/12/2018 02:46 PM, Keith Ball wrote: Hi All, We are looking to have time-based partitions; e.g.  a"day" and "ni

Re: [slurm-users] Job still running after process completed

2018-04-23 Thread Paul Edmon
I would recommend putting a clean up process in your epilog script.  We have a check here that sees if the job completed and if so it then terminates all the user processes by kill -9 to clean up any residuals. If it fails it closes of the node so we can reboot it. -Paul Edmon- On 04/23

Re: [slurm-users] Position in queue?

2018-10-05 Thread Paul Edmon
So if you use the showq utility it has functionality for that: https://github.com/fasrc/slurm_showq Happy to have contributors to this. -Paul Edmon- On 10/05/2018 09:56 AM, Alexandre Strube wrote: Is there a way to show the actual position in the queue, given the current priority? It’s

Re: [slurm-users] changing PriorityDecayHalfLife has no impact on stored accounting data

2018-10-16 Thread Paul Edmon
I'm not aware of one.  This may be worth a feature request to the devs at bugs.schedmd.com -Paul Edmon- On 10/16/18 7:29 AM, Antony Cleave wrote: Hi All Yes, I realise this is almost certainly the intended outcome. I have wondered this for a long time but only recently got round to testing

Re: [slurm-users] Spreading jobs across servers instead of loading up individual nodes

2018-11-15 Thread Paul Edmon
in parallel jobs being distributed across many nodes. Note that node *Weight* takes precedence over how many idle resources are on each node. Also see the *SelectParameters* configuration parameter *CR_LLN* to use the least loaded nodes in every partition. -Paul Edmon- On 11/15/2018 4:25 AM

Re: [slurm-users] srun problem -- Can't find an address, check slurm.conf

2018-11-07 Thread Paul Edmon
into the SchedMD guys to see if they have any more insight.  Then again some one on this list might have seen the same issue. -Paul Edmon- On 11/7/18 10:20 AM, Scott Hazelhurst wrote: Thanks, Paul, yes, it does seem a likely cause, but I can’t see the problem. All machines have the same /etc/hosts file

Re: [slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08

2018-10-01 Thread Paul Edmon
rare though that we need to look back at that data. -Paul Edmon- On 10/01/2018 08:12 AM, Chris Samuel wrote: On Saturday, 29 September 2018 1:18:24 AM AEST Ole Holm Nielsen wrote: Does anyone have a good explanation of usage of the Archive and Purge features for the Slurm database

Re: [slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08

2018-09-25 Thread Paul Edmon
restarting the service it times out and the database only gets partially update.  In which case I had to restore from the mysqldump I had made and tried again.  Also highly recommend doing mysqldumps prior to major version updates. -Paul Edmon- On 09/25/2018 09:54 AM, Baker D.J. wrote

Re: [slurm-users] CPU & memory usage summary for a job

2018-12-09 Thread Paul Edmon
This is the idea behind XDMod's SUPReMM.  It does generate a ton of data though, so it does not scale to very active systems (i.e. churning over tens of thousands of jobs). https://github.com/ubccr/xdmod-supremm -Paul Edmon- On 12/9/2018 8:39 AM, Aravindh Sampathkumar wrote: Hi All. I

Re: [slurm-users] Disabling --nodelist

2018-11-27 Thread Paul Edmon
Your best bet is a LUA job submission script to strip these options from the submissions. -Paul Edmon- On 11/27/18 11:48 AM, Aaron Jackson wrote: Hi all, I am wondering if it is possible to disable the use of the --nodelist argument from srun/sbatch/salloc/etc? In the worst case I can just

Re: [slurm-users] GPU gres error for 1 of 3 GPU types

2019-01-11 Thread Paul Edmon
I'm pretty sure that gres.conf has to be on all the nodes as well and not just the master. -Paul Edmon- On 1/11/19 5:21 AM, Sean McGrath wrote: Hi everyone, Your help for this would be much appreciated please. We have a cluster with 3 types of gpu configured in gres. Users can successfully

Re: [slurm-users] Create users

2018-09-12 Thread Paul Edmon
users and then map that in to Slurm using sacctmgr. It really depends on if your Slurm users are a subset of your regular users or not. -Paul Edmon- On 9/12/2018 12:21 PM, Andre Torres wrote: Hi all, I’m new to slurm and I’m confused regarding user creation. I have an installation

Re: [slurm-users] Create users

2018-09-13 Thread Paul Edmon
So the Lua script I posted only does it for people who submit to the cluster.  To do it for all users it should just be a simple bash script to do that, I don't have one put together though. -Paul Edmon- On 09/13/2018 10:29 AM, Eric F. Alemany wrote: Hi Paul You said “Another way would

Re: [slurm-users] Create users

2018-09-13 Thread Paul Edmon
Sure. Here is our lua script. -Paul Edmon- On 09/13/2018 07:28 AM, Andre Torres wrote: That's interesting using AD to maintain uid consistency across all the nodes. Like Loris, I'm also interested in your Lua script. - André On 13/09/2018, 11:42, "slurm-users on behalf of Loris Be

Re: [slurm-users] Create users

2018-09-13 Thread Paul Edmon
only add them if they don't already exist so the impact is only when new users appear. -Paul Edmon- On 09/13/2018 10:48 AM, Douglas Jacobsen wrote: At one point in time we would also use the job_submit.lua to add users, however, I cannot recommend it in general since job_submit runs while

Re: [slurm-users] Create users

2018-09-13 Thread Paul Edmon
Users can control that: https://slurm.schedmd.com/sbatch.html -Paul Edmon- On 09/13/2018 11:10 AM, Ariel Balter wrote: Does anyone know how to change email settings? On 9/13/2018 7:59 AM, Damien François wrote: Just to add my 2c to the discussion: at our site, we use a utility we wrote

Re: [slurm-users] slurmdbd purge not working

2019-04-04 Thread Paul Edmon
several smaller purges.  That at least worked for us in the past. -Paul Edmon- On 4/4/19 9:38 AM, Julien Rey wrote: Hello, Our slurm accounting database is growing bigger and bigger (more than 100Gb) and is never being purged. We are running slurm 15.08.0-0pre1. I would like to upgrade

Re: [slurm-users] slurmdbd purge not working

2019-04-05 Thread Paul Edmon
ust to see if there is any database work that was done. -Paul Edmon- On 4/5/19 9:05 AM, Julien Rey wrote: Hi Paul, thanks for your advice. Actually I already tried what you suggested. No matter what value do I put after PurgeJobAfter I always end up with the same error: sacctmgr archive dump Direc

Re: [slurm-users] Migrate the slurmdbd service to another server

2019-03-04 Thread Paul Edmon
a downtime for the dbd upgrade.  That's not too bad though as we pause all our jobs out of paranoia for upgrades. -Paul Edmon- On 3/1/19 8:10 AM, Ole Holm Nielsen wrote: We're one of the many Slurm sites which run the slurmdbd database daemon on the same server as the slurmctld daemon

Re: [slurm-users] sacct end time for failed jobs

2019-03-06 Thread Paul Edmon
A lot of this is automated in the new versions of slurm.  You should just need to run: sacctmgr show runawayjobs It will then give you an option to clean them and slurm will handle the rest.  If you add the -i option it will just clean them automatically. -Paul Edmon- On 3/6/2019 11:58 AM

Re: [slurm-users] sacct end time for failed jobs

2019-03-06 Thread Paul Edmon
Odds are the new version won't help for that.  You will have to do some mysql work to fix it then. -Paul Edmon- On 3/6/2019 1:23 PM, Brian Andrus wrote: I am running the latest and did that, but it didn't change anything. The jobs stay in the runaway state and no changes are made

Re: [slurm-users] Slurm message aggregation

2019-03-05 Thread Paul Edmon
We tried it once back when they first introduced it and shelved it after we found that we didn't really need it. -Paul Edmon- On 3/4/19 2:26 PM, Christopher Samuel wrote: Hi folks, Anyone here tried Slurm's message aggregation (MsgAggregationParams in slurm.conf) at all? All the best

Re: [slurm-users] How do I impose a limit the memory requested by a job?

2019-03-14 Thread Paul Edmon
Exactly.  The easiest way is just to underreport the amount of memory in slurm.  That way slurm will take care of it natively. We do this here as well even though we have disks in order to make sure the OS has memory left to run. -Paul Edmon- On 3/14/19 8:36 AM, Doug Meyer wrote: We also run

Re: [slurm-users] How do I impose a limit the memory requested by a job?

2019-03-12 Thread Paul Edmon
lua script.  That would be my recommended method. -Paul Edmon- On 3/12/19 12:31 PM, David Baker wrote: Hello, I have set up a serial queue to run small jobs in the cluster. Actually, I route jobs to this queue using the job_submit.lua script. Any 1 node job using up to 20 cpus is routed

Re: [slurm-users] service slurmctld restart

2019-01-31 Thread Paul Edmon
No.  Jobs should continue as normal. -Paul Edmon- On 1/31/19 9:38 AM, Buckley, Ronan wrote: Hi, Does restarting the slurmctld daemon on a slurm head node affect running slurm jobs on the compute nodes in any way? Rgds

Re: [slurm-users] Increase MaxJobCount in slurm.conf

2019-01-31 Thread Paul Edmon
Nope per the documentation you have to restart the slurmctld to change MaxJobCount. -Paul Edmon- On 1/31/19 5:58 AM, Buckley, Ronan wrote: Hi, I want to increase the MaxJobCount in the slurm.conf file from its default value of 10,000. I want to increase it to 250,000. The online

Re: [slurm-users] Increase MaxArraySize in slurm.conf

2019-01-29 Thread Paul Edmon
That should be it.  It shouldn't impact running jobs. -Paul Edmon- On 1/29/19 5:47 AM, Buckley, Ronan wrote: Hi, I want to increase the MaxArraySize in the slurm.conf file from its default value of 1001. I want to increase it to 1. Is it a case of just adding “MaxArraySize=1

Re: [slurm-users] Slurm Fairshare / Multifactor Priority

2019-05-29 Thread Paul Edmon
For reference we are running 18.08.7 -Paul Edmon- On 5/29/19 10:39 AM, Paul Edmon wrote: Sure.  Here is what we have: ## Scheduling # ### This section is specific to scheduling ### Tells the scheduler to enforce limits for all

Re: [slurm-users] Slurm Fairshare / Multifactor Priority

2019-05-29 Thread Paul Edmon
PriorityWeightQOS=10 I'm happy to chat about any of the settings if you want, or share our full config. -Paul Edmon- On 5/29/19 10:17 AM, Julius, Chad wrote: All, We rushed our Slurm install due to a short timeframe and missed some important items.  We are now looking to implement a better system

Re: [slurm-users] Slurm Fairshare / Multifactor Priority

2019-05-29 Thread Paul Edmon
took to long to clean up thus those jobs took forever to schedule. With the various improvements to the scheduler this may no longer be the case, but I haven't taken the time to test it on our cluster as our current set up has worked well. -Paul Edmon- On 5/29/19 11:04 AM, Kilian Cavalotti

Re: [slurm-users] Node weight / Job Preemption

2019-05-29 Thread Paul Edmon
/partition_prio or preempt/qos plugins.) In general slurm will try not to preempt if it can avoid it. These options can help to guide that a bit more intelligently. -Paul Edmon- On 5/29/19 8:53 AM, Mike Harvey wrote: I am relatively new to SLURM, and am having difficulty configuring our

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-19 Thread Paul Edmon
for resource usage.  It has worked pretty well for our purposes. -Paul Edmon- On 6/19/19 3:30 PM, Fulcomer, Samuel wrote: (...and yes, the name is inspired by a certain OEM's software licensing schemes...) At Brown we run a ~400 node cluster containing nodes of multiple architectures

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-20 Thread Paul Edmon
then they have to build their own stack. -Paul Edmon- On 6/20/19 11:07 AM, Fulcomer, Samuel wrote: ...ah, got it. I was confused by "PI/Lab nodes" in your partition list. Our QoS/account pair for each investigator condo is our approximate equivalent of what you're doing with owned partition

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-20 Thread Paul Edmon
been having a hard enough time understanding our current system. It's not due to its complexity but more that most people just flat out aren't cognizant of their usage and think the resource is functionally infinite. -Paul Edmon- On 6/19/19 5:16 PM, Fulcomer, Samuel wrote: Hi Paul, Thanks

Re: [slurm-users] Proposal for new TRES - "Processor Performance Units"....

2019-06-20 Thread Paul Edmon
I don't know off hand.  You can sort of construct a similar system in Slurm, but I've never seen it as a native option. -Paul Edmon- On 6/20/19 10:32 AM, John Hearns wrote: Paul, you refer to banking resources. Which leads me to ask are schemes such as Gold used these days in Slurm? Gold

Re: [slurm-users] scavenger partition/qos

2019-07-09 Thread Paul Edmon
have about using suspend is that while the job is suspended, the memory that job was using is still allocated.  Thus that may be why your jobs are not moving immediately as Slurm will still consider the memory space allocated though the CPU is now free. -Paul Edmon- On 7/8/19 6:03 PM, Hanu

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Paul Edmon
as in one submission it will generate thousands of jobs which then the scheduler can handle sensibly. So I highly recommend using job arrays. -Paul Edmon- On 8/27/19 3:45 AM, Guillaume Perrault Archambault wrote: Hi Paul, Thanks a lot for your suggestion. The cluster I'm using has thousands

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Paul Edmon
A QoS is probably your best bet.  Another variant might be MCS, which you can use to help reduce resource fragmentation.  For limits though QoS will be your best bet. -Paul Edmon- On 8/30/19 7:33 AM, Steven Dick wrote: It would still be possible to use job arrays in this situation, it's just

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-30 Thread Paul Edmon
Yes, QoS's are dynamic. -Paul Edmon- On 8/30/19 2:58 PM, Guillaume Perrault Archambault wrote: Hi Paul, Thanks for your pointers. I'll looking into QOS and MCS after my paper deadline (Sept 5). Re QOS, as expressed to Peter in the reply I just now sent, I wonder if it the QOS of a job can

Re: [slurm-users] Slurm statesave directory -- location and management

2019-08-28 Thread Paul Edmon
for it at that point. -Paul Edmon- On 8/28/19 10:49 AM, David Baker wrote: Hello, I apologise that this email is a bit vague, however we are keen to understand the role of the Slurm "StateSave" location. I can see the value of the information in this location when, for example, we are upgra

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-26 Thread Paul Edmon
We've hit this before due to RPC saturation.  I highly recommend using max_rpc_cnt and/or defer for scheduling.  That should help alleviate this problem. -Paul Edmon- On 8/26/19 2:12 AM, Guillaume Perrault Archambault wrote: Hello, I wrote a regression-testing toolkit to manage large

Re: [slurm-users] MPI jobs via mirun vs. srun through PMIx.

2019-09-17 Thread Paul Edmon
re tightly with the scheduler.  Sometime for older versions of MPI they need to use mpirun but by and large our community uses srun for the above reasons.  It's the more native slurm way of doing things with MPI. -Paul Edmon- On 9/17/19 4:12 AM, Marcus Wagner wrote: Hi Jürgen, we set in our modules the

Re: [slurm-users] Sharing a single machine between two groups; What's the best way define this in slurm config?

2019-09-19 Thread Paul Edmon
Probably your best bet is to use QoS's to accomplish this.  Be advised that suspending jobs still leaves them in memory space. -Paul Edmon- On 9/18/19 9:16 PM, Benjamin Wong wrote: Hello, I plan to purchase a GPU machine with 8 GPUs which will be shared between group A and group B.  Group

Re: [slurm-users] Store sstat information permanently on job completion?

2019-10-30 Thread Paul Edmon
All the aggregate historic data should be accessible via sacct. sstat is for live jobs but sacct is for completed jobs. -Paul Edmon- On 10/30/2019 2:13 PM, Jacob Chappell wrote: Is there a simple way to store sstat information permanently on job completion? We already have job accounting

Re: [slurm-users] SLURM with OpenMPI

2019-12-15 Thread Paul Edmon
Yes they should be. -Paul Edmon- On 12/15/2019 10:28 AM, Raymond Muno wrote: We are new to SLURM, migrating over from SGE. When launching OpenMPI jobs (version 4.0.2 in this case) via srun, are the MCA parameters followed when they are set via environmental variables, e.g. OMPI_MCA_param

Re: [slurm-users] Lua jobsubmit plugin for cons_tres ?

2019-12-11 Thread Paul Edmon
We do this via looking at gres.  The info is in the job_desc.gres variable.  We basically do the inverse where we ensure some one is asking for the gpu before allowing them to submit to a gpu partition. -Paul Edmon- On 12/11/2019 12:32 PM, Grigory Shamov wrote: Hi All, I am trying

Re: [slurm-users] Preemption Priority

2019-10-25 Thread Paul Edmon
preempt/partition_prio or preempt/qos plugins.) -Paul Edmon- On 10/25/19 7:21 AM, Oytun Peksel wrote: Hi, Let’s say I have two partitions assigned to the same single load in the cluster. LowPrio with PreemptMode=suspend Priority=1 HighPrio with PreemtMode=off Priority=5 I have 4 identical

Re: [slurm-users] Statistics on node utilization?

2019-10-17 Thread Paul Edmon
We have been using: https://github.com/fasrc/slurm-diamond-collector For our set up.  Though it gives more of an over all look.  We also use this: https://github.com/fasrc/lsload -Paul Edmon- On 10/16/19 4:53 PM, Will Dennis wrote: Hi all, We run a few Slurm clusters here, all using

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2019-10-22 Thread Paul Edmon
It can also happen if you have a stalled out filesystem or stuck processes.  I've gotten in the habit of doing a daily patrol for them to clean them up.  Most of them time you can just reopen the node but sometimes this indicates something is wedged. -Paul Edmon- On 10/22/2019 5:22 PM, Riebs

Re: [slurm-users] sshare vs sreport

2020-03-02 Thread Paul Edmon
sshare is cumulative statistics, so no window is needed.  Its just the sum of the total usage for whatever window you set for fairshare.  If you set no window then it is everything. -Paul Edmon- On 3/2/20 10:34 AM, Enric Fortin wrote: Hi everyone, I’ve noticed that when using `sshare

Re: [slurm-users] Cluster usage with Slurm

2020-02-17 Thread Paul Edmon
Also if you want tracking of fairshare and other stats in graphite, you can use these: https://github.com/fasrc/slurm-diamond-collector -Paul Edmon- On 2/17/2020 8:57 AM, Chris Samuel wrote: On 17/2/20 4:19 am, Parag Khuraswar wrote: Does Slurm  provide cluster usage reports like mentioned

Re: [slurm-users] How many users are running jobs per day on average in slurm ?

2020-04-02 Thread Paul Edmon
I would recommend setting up XDMoD as it will calculate this, plus a variety of other useful facts: https://open.xdmod.org/8.5/index.html Also if you like grafana you can use this: https://github.com/fasrc/slurm-diamond-collector -Paul Edmon- On 4/2/2020 8:31 AM, Sudeep Narayan Banerjee

Re: [slurm-users] floating condo partition, , no pre-emption, guarantee a max pend time?

2020-04-23 Thread Paul Edmon
would have everything governed purely by fairshare with one large queue and no QoS's For your setup though I think a combination of QoS's and partition layout would fit the bill. -Paul Edmon- On 4/22/2020 5:43 PM, Paul Brunk wrote: Hi all: [ BTW this is the same situation that the submitter

Re: [slurm-users] How to get command from a finished job

2020-04-30 Thread Paul Edmon
if using the backfill scheduling plugin. In order to eliminate some possible race conditions, the minimum non-zero value for *MinJobAge* recommended is 2. -Paul Edmon- On 4/30/2020 3:39 AM, Gestió Servidors wrote: Hello, I would like to know if there exist any way to get the same

Re: [slurm-users] reseting SchedNodeList

2020-03-23 Thread Paul Edmon
You could try holding the job and the releasing it.  I've inquired of SchedMD about this before and this is the response they gave: https://bugs.schedmd.com/show_bug.cgi?id=8069 -Paul Edmon- On 3/23/2020 8:05 AM, Sefa Arslan wrote: Hi, Due to lack of source in a partition, I updated the job

Re: [slurm-users] sshare with usernames too long

2020-03-23 Thread Paul Edmon
--parsable2 will print full names.  You can also use -o to format your output. -Paul Edmon- On 3/23/2020 10:46 AM, Sysadmin CAOS wrote: Hi, when I run "sshare -A myaccount -a" and, myaccount containts usernames with more than 10 characters, "sshare" output shows a "

Re: [slurm-users] Controlling access to idle nodes

2020-10-06 Thread Paul Edmon
will select which one their job will run on more quickly.  Then we rely on fairshare to adjudicate priority. -Paul Edmon- On 10/6/2020 11:37 AM, Jason Simms wrote: Hello David, I'm still relatively new at Slurm, but one way we handle this is that for users/groups who have "bought in" to t

Re: [slurm-users] How to throttle sinfo/squeue/scontrol show so they don't throttle slurmctld

2020-08-17 Thread Paul Edmon
as there are numerous performance improvements. For something straight out of the box though I would look at defer/max_rpc_cnt as that will help the scheduler cope with high RPC traffic. -Paul Edmon- On 8/17/2020 2:30 PM, Ransom, Geoffrey M. wrote: Hello     We are having performance issues

Re: [slurm-users] Adding Users to Slurm's Database

2020-08-18 Thread Paul Edmon
you want to cut that down by whatever means you think is reasonable. -Paul Edmon- On 8/18/2020 11:36 AM, Jason Simms wrote: Hello everyone! We have a script that queries our LDAP server for any users that have an entitlement to use the cluster, and if they don't already have an account

Re: [slurm-users] Compiling Slurm with nvml support

2020-09-24 Thread Paul Edmon
. We also have a git repo in which we manage our slurm.spec file with a branch for each version and type so we can keep organized. -Paul Edmon- On 9/24/2020 3:31 PM, Dana, Jason T. wrote: Hello, I hopefully have a quick question. I have compiled Slurm RPMs on a CentOS system with nvidia

Re: [slurm-users] Quickly throttling/limiting a specific user's jobs

2020-09-22 Thread Paul Edmon
is Association based.  So you could just modify their account directly and set it to something low. You can also simply put their pending jobs in hold state.  That way they won't be considered for scheduling but won't be outright removed.  Setting fairshare to 0 has the same effect. -Paul Edmon

Re: [slurm-users] How to contact slurm developers

2020-09-30 Thread Paul Edmon
The bug site is the best way.  The devs prioritize sponsored features over general community requested features. -Paul Edmon- On 9/30/2020 11:34 AM, Ryan Novosielski wrote: I’ve previously seen code contributed back in that way. See bug 1611 as an example (happened to have looked at that just

Re: [slurm-users] Limit a partition or host to jobs less than 4 cores?

2020-09-30 Thread Paul Edmon
Probably the best way to accomplish this is via a job_submit.lua script.  That way you can reject at submission time.  There isn't a feature in the partition configurations that I am aware that can accomplish this but a custom job_submit script certainly can. -Paul Edmon- On 9/30/2020 11:44

Re: [slurm-users] Fair share per partition

2020-09-17 Thread Paul Edmon
So the way we handle it is that we give a blanket fairshare to everyone but then dial in our TRES charge back on a per partition basis based on hardware.  Our fairshare doc has a fuller explanation: https://docs.rc.fas.harvard.edu/kb/fairshare/ -Paul Edmon- On 9/17/2020 9:30 AM, Mark Dixon

Re: [slurm-users] SLURM launching jobs onto nodes with suspended jobs may lead to resource contention

2020-09-16 Thread Paul Edmon
reserved.  That's the natural understanding of suspend, but that's not the way suspend actually work in Slurm. -Paul Edmon- On 9/16/2020 6:08 AM, SJTU wrote: Hi, I am using SLURM 19.05 and found that SLURM may launch jobs onto nodes with suspended jobs, which leads to resource contention

Re: [slurm-users] job limit time requested on fairshare algorithm

2020-09-18 Thread Paul Edmon
No, you are only charged for time you actually use. -Paul Edmon- On 9/18/2020 11:09 AM, Angelo wrote: Hi all, Is the job limit time requested (--time=) considered in the classic fairshrare algorithm? Example: if I set the job time limit to 1 day (--time=24:00:00) and the job ends in 4

Re: [slurm-users] Jobs stuck in "completing" (CG) state

2020-10-24 Thread Paul Edmon
This can happen if the underlying storage is wedged.  I would check that it is working properly. Usually the only way to clear this state is either fix the stuck storage or reboot the node. -Paul Edmon- On 10/24/2020 12:22 PM, Kimera Rodgers wrote: I'm setting up slume on OpenHPC cluster

Re: [slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Paul Edmon
to down, then run a cancel over all the running jobs.  Pending jobs are left in place, and users are allowed to submit work during the outage and when we reopen everything gets going again. So there is a third option, though you have to accept that jobs will be cancelled to pull it off. -Paul

Re: [slurm-users] Tuning MaxJobs and MaxJobsSubmit per user and for the whole cluster?

2020-08-07 Thread Paul Edmon
user can run with out causing damage to themselves, the underlying filesystems, and interfering with other users.  Practical experience has lead to us setting that limit to be 10,000 on our cluster, but I imagine it will vary from location to location. -Paul Edmon- On 8/6/2020 10:31 PM

Re: [slurm-users] Tuning MaxJobs and MaxJobsSubmit per user and for the whole cluster?

2020-08-10 Thread Paul Edmon
(130 as of last count) so our tuning has been a bit more complicated.  However the latest version of slurm (20.02) vastly improved the backfill efficiency which has helped with making sure the cluster is full.  Nonetheless we still seem to average a job per core per day here. -Paul Edmon- On 8

Re: [slurm-users] priority/multifactor, sshare, and AccountingStorageEnforce

2020-07-09 Thread Paul Edmon
Try setting RawShares to something greater than 1.  I've seen it be the case then when you set 1 it creates weirdness like this. -Paul Edmon- On 7/9/2020 1:12 PM, Dumont, Joey wrote: Hi, We recently set up fair tree scheduling (we have 19.05 running), and are trying to use sshare to see

Re: [slurm-users] How to queue jobs based on non-existent features

2020-07-10 Thread Paul Edmon
You could set up an dummy node that has the features that are not active but not allow jobs to schedule to that node by setting it to DOWN.  That would be a hacky way of accomplishing this. -Paul Edmon- On 7/9/2020 7:15 PM, Raj Sahae wrote: Hi all, My apologies if this is sent twice

Re: [slurm-users] How to queue jobs based on non-existent features

2020-07-10 Thread Paul Edmon
Another option would be to use the license feature and just set licenses to 0 when they aren't available. -Paul Edmon- On 7/10/2020 12:42 PM, Raj Sahae wrote: Hi Brian and Paul, You both sent me suggestions about using an offline dummy node with all features set. Thanks for your ideas

Re: [slurm-users] [External] How to exclude nodes in sbatch/srun?

2020-06-22 Thread Paul Edmon
For the record we filed a bug on this years ago: https://bugs.schedmd.com/show_bug.cgi?id=3875  Hasn't been fixed yet though everyone seems to agree it is a good idea. Florian's suggestion is probably the best stopgap until this feature is implemented. -Paul Edmon- On 6/22/2020 7:11 AM

Re: [slurm-users] Difference between fairshare and fair-share?

2020-06-25 Thread Paul Edmon
Yes.  I have a discussion here which might be useful: https://docs.rc.fas.harvard.edu/kb/fairshare/ Note this is using the classic fairshare not FairTree which is now the default for Slurm. -Paul Edmon- On 6/25/2020 9:23 AM, Durai Arasan wrote: Hi, In slurm accounting

Re: [slurm-users] Make "srun --pty bash -i" always schedule immediately

2020-06-11 Thread Paul Edmon
and won't impact larger work.  I don't necessarily recommend that.  A single node with oversubscribe should be sufficient.  If you can't spare a single node then a VM would do the job. -Paul Edmon- On 6/11/2020 9:28 AM, Renfro, Michael wrote: That’s close to what we’re doing, but without dedicated

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2020-07-23 Thread Paul Edmon
Same here.  Whenever we see rashes of Kill task failed it is invariably symptomatic of one of our Lustre filesystems acting up or being saturated. -Paul Edmon- On 7/22/2020 3:21 PM, Ryan Cox wrote: Angelos, I'm glad you mentioned UnkillableStepProgram.  We meant to look at that a while ago

Re: [slurm-users] Reset Fair-share tree account values

2020-07-16 Thread Paul Edmon
very useful. -Paul Edmon- On 7/16/2020 8:42 AM, Paul Edmon wrote: A trick you can use to reset certain users (which I have used before) is to simply delete them from the slurmdb and then readd them.  At least under the other fairshare system, which is what our site uses, that would remove

Re: [slurm-users] Reset Fair-share tree account values

2020-07-16 Thread Paul Edmon
assuming fairtree works the same way. -Paul Edmon- On 7/16/2020 5:49 AM, Gestió Servidors wrote: Hello, I will try to explain an scenario that occurs in my SLURM cluster. An important number of users (accounts) belongs to students of a certain subject. That subject is 6 month duration. When

Re: [slurm-users] Reset Fair-share tree account values

2020-07-16 Thread Paul Edmon
Wow, nice find.  I wasn't even aware of that one.  Hopefully they will support the ability to reset to other values in the future as that would be a handy ability. -Paul Edmon- On 7/16/2020 12:56 PM, Sebastian T Smith wrote: `sacctmgr` can be used to reset the accrued RawUsage value

Re: [slurm-users] Reset FairShare?

2020-07-27 Thread Paul Edmon
://slurm.schedmd.com/sacctmgr.html -Paul Edmon- On 7/27/2020 2:17 PM, Jason Simms wrote: Dear all, Apologies for the basic question. I've looked around online for an answer to this, and I haven't found anything that has helped accomplish exactly what I want. That said, it is also probable that what

  1   2   3   >