[slurm-dev] Re: Thoughts on GrpCPURunMins as primary constraint?

2017-07-24 Thread Ryan Cox
sure users explicitly requesting slow nodes instead of just dumping them on ancient Opterons). Also, each user gets their own Account, so the QoS Grp limits apply to each human separately. Accounts would also have absolute core limits. Thank you for your thoughts! Corey -- Ryan Cox Operation

[slurm-dev] Re: Job Submit Lua Plugin

2017-06-27 Thread Ryan Cox
S: Ubuntu 16.04 Lua: lua5.2 and liblua5.2-dev (I can use Lua interactively) SLURM version: 17.02.5, compiled from source (after installing Lua) using ./configure --prefix=/usr --sysconfdir=/etc/slurm Any guidance to get me up and running would be greatly appreciated! Thanks, Nathan -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Ryan Cox
egards. > > > > Could you set AllowedRamSpace/AllowedSwapSpace in /etc/slurm/cgroup.conf to some big number? That way the job memory limit will be the cgroup soft limit, and the cgroup hard limit which is when the kernel will OOM kill the job would be "job_memory_limit * AllowedRamSpace" that is, some large value? -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 <tel:%2B358503841576> || janne.blomqv...@aalto.fi <mailto:janne.blomqv...@aalto.fi> -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Cox
blazed this trail before, but this is how I am going about it. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Cox
lity in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-19 Thread Ryan Cox
22.2 20.4 17 16.9799 * denotes the node where the batch script executes (node 0) CPU usage is cumulative since the start of the job Ryan On 09/19/2016 11:13 AM, Ryan Cox wrote: We use this script that we cobbled together: https://github.com/BYUHPC/slurm-random/blob/master/rjobstat

[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-19 Thread Ryan Cox
We use this script that we cobbled together: https://github.com/BYUHPC/slurm-random/blob/master/rjobstat. It assumes that you're using cgroups. It uses ssh to connect to each node so it's not very scalable but it works well enough for us. Ryan On 09/18/2016 06:42 PM, Igor Yakushin wrote:

[slurm-dev] Re: scontrol update not allowing jobs

2016-04-15 Thread Ryan Cox
rge Washington University 725 21st Street Washington, DC 20052 Suite 211, Corcoran Hall == On Fri, Apr 15, 2016 at 1:07 PM, Ryan Cox <ryan_...@byu.edu <mailto:ryan_...@byu.edu>> wrote: Did you try this: --reservation=root_13

[slurm-dev] Re: scontrol update not allowing jobs

2016-04-15 Thread Ryan Cox
Did you try this: --reservation=root_13 On 04/15/2016 08:10 AM, Glen MacLachlan wrote: scontrol update not allowing jobs Dear all, Wrapping up a maintenance period and I want to run some test jobs before I release the reservation and allow regular user jobs to start running. I've modified

[slurm-dev] Re: AssocGrp*Limits being considered for scheduling

2016-02-23 Thread Ryan Cox
Coincidentally, I asked about that yesterday in a bug report: http://bugs.schedmd.com/show_bug.cgi?id=2465. The short answer is to use SchedulerParameters=assoc_limit_continue that was introduced in 15.08.8. It only works if the Reason for the job is something like Assoc*Limit. Ryan On

[slurm-dev] Re: distribution for array jobs

2016-01-28 Thread Ryan Cox
ug slurm_ar user1 R8:53:26 1 compute48* In my config, I have: *SelectType = select/cons_res* *SelectTypeParameters = CR_CORE_MEMORY* What am I missing to get more than one job to run on a node? Thanks in advance, Brian Andrus

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-13 Thread Ryan Cox
there, and slurmctld decided the data was invalid and killed all jobs. (I don't know if this is still a problem.) -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Batch job submission failed: Invalid account or account/partition combination specified

2015-09-08 Thread Ryan Cox
We have seen similar issues on 14.11.8 but haven't bothered to diagnose or report it. I think I've seen it twice so far out of dozens of new users. Ryan On 09/07/2015 09:16 AM, Loris Bennett wrote: Hi, This problem occurs with 14.11.8. A user I set up today got the following error when

[slurm-dev] Re: Changing /dev file permissions for particular user

2015-06-24 Thread Ryan Cox
Be sure to test it first before trying anything else: https://stackoverflow.com/questions/18661976/reading-dev-cpu-msr-from-userspace-operation-not-permitted. We ran into this issue once when we had a trusted person and we couldn't easily grant him access to the MSRs. We couldn't find a good

[slurm-dev] Re: concurrent job limit

2015-06-11 Thread Ryan Cox
or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: FAIR_TREE in SLURM 14.11

2015-06-04 Thread Ryan Cox
Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu mailto:treyd...@tamu.edu Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu On Thu, Jun 4, 2015 at 11:51 AM, Ryan Cox ryan_...@byu.edu mailto:ryan_...@byu.edu wrote: Trey, In http

[slurm-dev] Re: FAIR_TREE in SLURM 14.11

2015-06-04 Thread Ryan Cox
Phone: (979)458-2396 Email: treyd...@tamu.edu mailto:treyd...@tamu.edu Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: GPU node allocation policy

2015-04-07 Thread Ryan Cox
- Senior Technologist || \\ and Health | novos...@rutgers.edu- 973/972.0922 (2x0922) || \\ Sciences | OIRT/High Perf Res Comp - MSB C630, Newark `' On Apr 6, 2015, at 20:17, Ryan Cox ryan_...@byu.edu wrote: Chris, Just have GPU users request the numbers of CPU cores that they need and don't

[slurm-dev] Re: GPU node allocation policy

2015-04-06 Thread Ryan Cox
Chris, Just have GPU users request the numbers of CPU cores that they need and don't lie to Slurm about the number of cores. If a GPU user needs 4 cores and 4 GPUs, have them request that. That leaves 20 cores for others to use. Ryan On 04/06/2015 03:43 PM, Christopher B Coffey wrote:

[slurm-dev] RE: fairshare allocations

2015-01-21 Thread Ryan Cox
On 01/21/2015 09:23 AM, Bill Wichser wrote: A user underneath gets the expected 0.009091 normalized shares since there are a lot of fairshare=1 users there. The user3 gets basically 25x this value as the fairshare for user3=25 Yet the normalized shares is actually MORE than the

[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-26 Thread Ryan Cox
*From:*Ryan Cox ryan_...@byu.edu *Sent:* 25 November 2014 17:43 *To:* slurm-dev *Subject:* [slurm-dev] Re: [ sshare ] RAW Usage Raw usage is a long double and the time added by jobs can be off by a few seconds. You can take a look at _apply_new_usage() in src/plugins/priority/multifactor

[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-26 Thread Ryan Cox
in the SLURM accounting DB? I could not find any value for the JOB that corresponds to this RAW usage. Roshan From: Ryan Cox ryan_...@byu.edu Sent: 25 November 2014 17:43 To: slurm-dev Subject: [slurm-dev] Re: [ sshare ] RAW Usage Raw usage is a long double and the time added by jobs can

[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-25 Thread Ryan Cox
Raw usage is a long double and the time added by jobs can be off by a few seconds. You can take a look at _apply_new_usage() in src/plugins/priority/multifactor/priority_multifactor.c to see exactly what happens. Ryan On 11/25/2014 10:34 AM, Roshan Mathew wrote: Hello SLURM users,

[slurm-dev] Re: How many accounts can SLURM support?

2014-11-19 Thread Ryan Cox
Dave, I have done testing on 5-6 year old hardware with 100,000 users randomly distributed in 10,000 accounts with semi-random depths with most being between 1-4 levels from root but some much deeper than that, plus 100,000 jobs pending. slurmctld startup time was really long but, after

[slurm-dev] Re: Non static partition definition

2014-10-30 Thread Ryan Cox
George, Wouldn't a QOS with GrpNodes=10 accomplish that? Ryan On 10/30/2014 11:47 AM, Brown George Andrew wrote: Hi, I would like to have a partition of N nodes without statically defining which nodes should belong to a partition and I'm trying to work out the best way to achieve this.

[slurm-dev] Re: Understanding Fairshare and effect on background/backfill type partitions

2014-10-27 Thread Ryan Cox
Trey, I'm not sure why your jobs aren't starting. Someone else will have to answer that question. You can model an organizational hierarchy a lot better in 14.11 due to changes in Fairshare=parent for accounts. If you only want fairshare to matter at the research group and user levels but

[slurm-dev] RE: EXTERNAL: Re: question on multifactor priority plugin - fairshare basics

2014-10-16 Thread Ryan Cox
/wishing the values would be between 0.0 and 1.0, but I can work with 0.5 as the max value. It just means that I need to double the PriorityWeightFairshare factor in order to achieve the intended relative weighting between Fairshare, QOS, Partitions, JobSize, Age. Ed *From:*Ryan Cox [mailto:ryan_

[slurm-dev] Re: question on multifactor priority plugin - fairshare basics

2014-10-14 Thread Ryan Cox
I assume you are using the default fairshare algorithm since you didn't specify otherwise. F=2**(-U/S) where U is Effectv Usage (often displayed in documentation as UE) and S is Norm Shares. See http://slurm.schedmd.com/priority_multifactor.html under the heading The SLURM Fair-Share

[slurm-dev] Re: Authentication and invoking slurm commands from web app

2014-10-02 Thread Ryan Cox
. Idiria Sociedad Limitada reserves the right to take legal action against any persons unlawfully gaining access to the content of any external message it has emitted. For additional information, please visit our website http://www.idiria.com http://www.idiria.com/ -- Ryan Cox Operations

[slurm-dev] Re: Submitting to multiple partitions with job_submit plugin (Was: Implementing fair-share policy using BLCR)

2014-09-29 Thread Ryan Cox
On 09/23/2014 11:27 AM, Trey Dockendorf wrote: Has anyone used the Lua job_submit plugin and also allows multiple partitions? I'm not even user what the partition value would be in the Lua code when a job is submitted with --partition=general,background, for example. We do. We use the

[slurm-dev] Re: Dynamic partitions on Linux cluster

2014-08-14 Thread Ryan Cox
have to use accounting to enfoce the limits? Or is there another way that I don't see? Best regards, Uwe Sauter -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox
different from the DRF ones? Or if one does specify different rates, it might end up breaking some of the fairness properties that are described in the DRF paper and opens up the algorithm for gaming? -- Janne Blomqvist From: Ryan Cox [ryan_...@byu.edu] Sent

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox
Thanks. I can certainly call it that. My understanding is that this would be a slightly different implementation from Moab/Maui, but I don't know those as well so I could be wrong. Either way, the concept is similar enough that a more recognizable term might be good. Does anyone else

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-29 Thread Ryan Cox
in a bug report from the University of Chicago: http://bugs.schedmd.com/show_bug.cgi?id=858. Ryan On 07/25/2014 10:31 AM, Ryan Cox wrote: Bill and Don, We have wondered about this ourselves. I just came up with this idea and haven't thought it through completely, but option two seems like

[slurm-dev] Re: fairshare

2014-07-15 Thread Ryan Cox
just need to figure out a database query to cull this information? Thanks, Bill -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: installing slurm on CentOS 5.10

2014-06-24 Thread Ryan Cox
management system. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] LEVEL_BASED prioritization method

2014-06-20 Thread Ryan Cox
to our use case), see http://tech.ryancox.net/2014/06/problems-with-slurm-prioritization.html. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Fairshare=parent on an account: What should it do?

2014-06-10 Thread Ryan Cox
student to have administrative control over the subaccount since he actually knows the students but not have it affect priority calculations. Ryan -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: How to spread jobs among nodes?

2014-05-08 Thread Ryan Cox
', 'SelectTypeParameters' = 'CR_Core_Memory', 'SelectType'= 'select/cons_res', -- Perfection is just a word I use occasionally with mustard. --Atom Powers-- -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Need Help Understanding Cgroup Swapiness

2014-04-21 Thread Ryan Cox
the exceeding 50 MB or so... they would actually fit in the swap area and the job should not be killed... What am I missing here? Should the code itself be aware of the given mem.limit=9000MB? Thanks for any explanation. MG -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young

[slurm-dev] Re: SLRUM as a load balancer for interactive use

2014-03-25 Thread Ryan Cox
for Science Ltd. E-Mail: olli-pekka.le...@csc.fi Tel: +358 50 381 8604 skype: oplehto // twitter: ople -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University http://tech.ryancox.net

[slurm-dev] Re: Job being canceled due to time limits

2013-09-05 Thread Ryan Cox
is being imposed about 5 minutes into the job. Thanks -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: job steps not properly identified for jobs using step_batch cgroups

2013-08-12 Thread Ryan Cox
system administrator research computing center university of chicago 773.702.1104 -- andy wettstein hpc system administrator research computing center university of chicago 773.702.1104 -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: cgroups usage

2013-08-06 Thread Ryan Cox
. Is this amount realistic? Is there a more efficient method to control memory usage on nodes which are shared? Thank you for any advice, Kevin -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Job submit plugin to improve backfill

2013-06-28 Thread Ryan Cox
. -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Job Groups

2013-06-19 Thread Ryan Cox
at the documentation I don't see anyway to do this other than what I stated above. -Paul Edmon- -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University

[slurm-dev] Re: Job Groups

2013-06-19 Thread Ryan Cox
, Ryan Cox wrote: Paul, We were discussing this yesterday due to a user not limiting the amount of jobs hammering our storage. A QOS with a GrpJobs limit sounds like the best approach for both us and you. Ryan On 06/19/2013 09:36 AM, Paul Edmon wrote: I have a group here that wants to submit

[slurm-dev] Re: untracked processes

2013-02-21 Thread Ryan Cox
and then launches sub-processes to solve (MPI on a local system) without connecting the process IDs(?) In any event, I'm guessing I'm not the first person to run into this. Is there a recommended solution to configure SLURM to track codes like this? Thanks, ~Mike C. -- Ryan Cox