[slurm-dev] Re: How do the terms “job”, “task”, and “step” relate to each other?

2017-10-02 Thread Lipari, Don
Allow me to direct you to a doc I created that could help answer your question: https://hpc.llnl.gov/banks-jobs/running-jobs/slurm-user-manual And for a more general intro, see: https://hpc.llnl.gov/banks-jobs/running-jobs/batch-system-primer Don On 10/2/17, 6:16 AM, "Kynn Jones"

[slurm-dev] Re: Change in meaning of --nodelist

2017-07-26 Thread Lipari, Don
Loris, Taking Slurm v2.3 as a data point, the sbatch behavior seems consistent with what you report for v16.05. I deliberately attempted to request fewer nodes than I specified with the --nodelist option (attempting the potential behavior you describe below) and sbatch complained: $ sbatch

[slurm-dev] Re: LLNL sxterm command

2017-04-21 Thread Lipari, Don
sxterm is a script LLNL created to simplify the request for a job that launches an xterm window. While one can replicate everything sxterm does with sbatch + job script, it has become so popular with our users, I included it in the official Slurm commands listed in the manual you cite. Don

[slurm-dev] Slurm Training

2017-04-13 Thread Lipari, Don
A recent post on Slurm training prompted me to mention that I created for our users the following guides to using Slurm: https://hpc.llnl.gov/banks-jobs/running-jobs The Batch System Primer introduces new users to HPC batch scheduling concepts. The Slurm User Manual and Quick Start Guide

[slurm-dev] RE: sacctmgr modify cluster controlhost?

2016-05-10 Thread Lipari, Don
You might consider this fix: https://github.com/SchedMD/slurm/commit/79c9a49913625a0f790f29897f0f099156f94268 It was committed to the slurm-15.08 branch yesterday. Don Lipari > -Original Message- > From: Ryan Novosielski [mailto:novos...@rutgers.edu] > Sent: Tuesday, May 10, 2016 8:36

[slurm-dev] RE: multifactor2 priority weighting

2015-05-13 Thread Lipari, Don
The original multi-factor plugin generated fair-share factors ranging between 0.0 and 1.0. Under this formula, 0.5 was the factor a user/account would see if their usage was commensurate with their shares. The multi-factor2 plugin fair-share factors range between 0.0 and 100.0, with 1.0

[slurm-dev] RE: multifactor2 priority weighting

2015-05-13 Thread Lipari, Don
. Don -Original Message- From: Lipari, Don [mailto:lipa...@llnl.gov] Sent: Wednesday, May 13, 2015 7:56 AM To: slurm-dev Subject: [slurm-dev] RE: multifactor2 priority weighting The original multi-factor plugin generated fair-share factors ranging between 0.0 and 1.0. Under

[slurm-dev] Re: possible bug in srun --unbuffered option

2015-03-09 Thread Lipari, Don
-Original Message- From: David Bigagli [mailto:da...@schedmd.com] Sent: Thursday, March 05, 2015 10:49 AM To: slurm-dev Subject: [slurm-dev] Re: possible bug in srun --unbuffered option Hi, can we see what script is broken? You may want to review #991 to get the full

[slurm-dev] RE: fairshare allocations

2015-01-21 Thread Lipari, Don
-Original Message- From: Bill Wichser [mailto:b...@princeton.edu] Sent: Wednesday, January 21, 2015 8:23 AM To: slurm-dev Subject: [slurm-dev] RE: fairshare allocations On 01/21/2015 11:07 AM, Lipari, Don wrote: -Original Message- From: Bill Wichser [mailto:b

[slurm-dev] RE: fairshare allocations

2015-01-21 Thread Lipari, Don
-Original Message- From: Bill Wichser [mailto:b...@princeton.edu] Sent: Wednesday, January 21, 2015 5:20 AM To: slurm-dev Subject: [slurm-dev] fairshare allocations The algorithm I use is fairtree under 14.11 but I believe that my question relates to any method. As a

[slurm-dev] RE: slurm and grid capabilities

2014-10-03 Thread Lipari, Don
Slurm has a multi-cluster (grid) functionality when Slurm’s database is installed and active. This allows one cluster to submit jobs, and invoke status commands, to another cluster. Jobs submitted that specify multiple cluster candidates will be submitted to the cluster that is projected to

[slurm-dev] RE: Pass environmental variables to srun

2014-07-29 Thread Lipari, Don
Mike, runjob is the task launcher on BG/Q. Take a look at the --envs option documented in the runjob man page. Then look at srun's man page for the --launcher-opts option. I believe the answer may involve the use of the two. It will be something like: srun [...] --launcher-opts=='--envs

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-25 Thread Lipari, Don
Bill, As I understand the dilemma you presented, you want to maximize the utilization of node resources when running with Slurm configured for SelectType=select/cons_res. To do this, you would like to nudge users into requesting only the amount of memory they will need for their jobs. The

[slurm-dev] RE: ReturnToService

2014-04-10 Thread Lipari, Don
First off, the assumption is that you have the same slurm.conf files across your management node and all the compute nodes. The node definitions listed in your slurm.conf file become the gold standard. For example: NodeName=rzmerl[1-152] NodeAddr=erzmerl[1-152] Sockets=2 CoresPerSocket=8

[slurm-dev] RE: rpmbuild params

2014-04-07 Thread Lipari, Don
-Original Message- From: Schmidtmann, Carl [mailto:carl.schmidtm...@rochester.edu] Sent: Monday, April 07, 2014 9:47 AM To: slurm-dev Subject: [slurm-dev] rpmbuild params I am trying to build the rpms for slurm using rpmbuild. I followed the instructions in the install guide

[slurm-dev] RE: sshare and sacct

2014-03-17 Thread Lipari, Don
Gary, The sacct command retrieves job and job step records from the slurmdb and reports the statistics for the requested job(s). The sshare command provides the basis for the fair-scheduling component of the multi-factor plugin. sshare lists the two components (shares and usage) which are

[slurm-dev] RE: sshare and sacct

2014-03-17 Thread Lipari, Don
for non-zero values was never implemented. Don - Gary Skouson -Original Message- From: Lipari, Don [mailto:lipa...@llnl.gov] Sent: Monday, March 17, 2014 8:47 AM To: slurm-dev Subject: [slurm-dev] RE: sshare and sacct Gary, The sacct command retrieves job and job step

[slurm-dev] RE: error: We have more allocated time than is possible...

2014-03-13 Thread Lipari, Don
with very old start times (or with excessive elapsed times) that are still running. sacct -a -X -S 1/1/13 -L -sR -o jobid,user,account,cluster,partition,nnodes,state,start,end,elapsed,exitcode Don -Original Message- From: Lipari, Don Sent: Thursday, May 02, 2013 8:55 AM To: slurm-dev

[slurm-dev] Re: Slurm configuration problem --Age factor not working @ all

2013-11-12 Thread Lipari, Don
There are two ways to confirm your PriorityWeightAge setting has been read in: $ scontrol show conf | grep PriorityWeightAge or $ sprio -w Once you confirm that the system has recognized the values you set in your slurm.conf file, if you still see the problem, I suggest you turn up debugging

[slurm-dev] RE: Job numeration reset

2013-10-01 Thread Lipari, Don
Vsevolod, If one of the remaining files that might cause a conflict with newer SLURM package that you removed was job_state (located in your configured StateSaveLocation) then you deleted Slurm's record of the running job ID. It will reset to your configured FirstJobId after that. There is

[slurm-dev] RE: Problems with priority multifactor being ignored.

2013-09-20 Thread Lipari, Don
Alan, Without more info, it is hard to tell. But the first thing I would consider is that the lower priority jobs are being backfilled. See what you've defined in your slurm.conf: SchedulerType=sched/backfill Don -Original Message- From: Alan V. Cowles

[slurm-dev] RE: Invalid account or account/partition combination specified

2013-05-31 Thread Lipari, Don
Re-run your command and add the WithAssoc option. Then see whether there is a default account present for the cluster of interest. There is probably not and you will have to add it: sacctmgr modify user k202066 cluster=cluster_of_interest set DefaultAccount=intern While default accounts used

[slurm-dev] RE: error: We have more allocated time than is possible...

2013-05-02 Thread Lipari, Don
Have a look at http://bugs.schedmd.com/show_bug.cgi?id=247 and the patches that resulted. Don -Original Message- From: Luis Alves [mailto:luis.al...@csc.fi] Sent: Thursday, May 02, 2013 5:07 AM To: slurm-dev Subject: [slurm-dev] error: We have more allocated time than is

[slurm-dev] RE: Question on sreport

2013-04-25 Thread Lipari, Don
Dustin, Try adding the cluster=your_cluster option to your sreport command. If it succeeds, there could be a problem with the ClusterName setting in your slurm.conf file on the head node. If sreport still fails, look into your slurmdbd.log and see whether there are any entries added when you

[slurm-dev] Re: Interesting queries / plots for slurmdbd?

2013-03-31 Thread Lipari, Don
This could be what you remember: https://github.com/SchedMD/slurm/blob/master/contribs/web_apps/chart_stats.cgi It reads the parameters of the user’s selection, invokes sreport to retrieve the data, and displays the info in charts. I’m afraid I have not been actively maintaining this utility,

[slurm-dev] RE: Ghost JobID in Database?

2013-01-30 Thread Lipari, Don
Matteo, I suspect something happened that prevented the state change for your jobs below from being propagated to the database. You are going to have to modify them manually in mysql. Change the state for these jobs to 3 (JOB_COMPLETE). I would also update the time_end field to something

[slurm-dev] RE: Independent FairShare Configuration for N Cluster Segments

2013-01-28 Thread Lipari, Don
Matt, To allow the jobs from each owner (11, 12, and 13) to be prioritized based on their usage, do not include the fairshare=parent setting. Otherwise, your settings should result in usage that reflects the shares you have configured, assuming all three owners have been submitting enough jobs

[slurm-dev] Re: Access to account in lua job submit plugin

2012-11-05 Thread Lipari, Don
-Original Message- From: Moe Jette [mailto:je...@schedmd.com] Sent: Monday, November 05, 2012 12:24 PM To: slurm-dev Subject: [slurm-dev] Re: Access to account in lua job submit plugin Hi Kent, Responses in-line below. Quoting Kent Engström k...@nsc.liu.se: [...] -

[slurm-dev] Re: slurmdbd: more time than is possible

2012-10-18 Thread Lipari, Don
-Original Message- From: Loris Bennett [mailto:loris.benn...@fu-berlin.de] Sent: Thursday, October 18, 2012 5:33 AM To: slurm-dev Subject: [slurm-dev] Re: slurmdbd: more time than is possible Loris Bennett loris.benn...@fu-berlin.de writes: What additional information, if

[slurm-dev] RE: Reset job's priority to automatic

2012-09-21 Thread Lipari, Don
Miguel, Try holding and then releasing the job. Don From: Miguel Méndez [mailto:miguel.men...@uam.es] Sent: Friday, September 21, 2012 4:30 AM To: slurm-dev Subject: [slurm-dev] Reset job's priority to automatic Hi, I set some job's priorities manually to some value. After that, those job's

[slurm-dev] RE: Monitoring job resources

2012-09-11 Thread Lipari, Don
Yuri, Enable and configure the jobacct_gather plugin. Then try out the sstat command. Don -Original Message- From: Yuri D'Elia [mailto:wav...@thregr.org] Sent: Tuesday, September 11, 2012 7:39 AM To: slurm-dev Subject: [slurm-dev] Monitoring job resources Is it possible to

[slurm-dev] RE: Tools for parsing logs

2012-08-28 Thread Lipari, Don
Take a look at slurm/contribs/web_apps/chart_stats.cgi. It is something I threw together to graph accounting statistics. I have not updated it in quite a while, but it should give you some ideas. The tool generates reports of: Usage - Single User Usage - Single Account Usage - Top Ten Users

[slurm-dev] Re: Issue with Job Size Factor of Multifactor plugin

2012-08-22 Thread Lipari, Don
Miguel, While you make a good argument to base the job size factor on the number of active nodes, this could confuse users. You may get users who wonder why their priority is too low and look to you to explain how their job priority was calculated. It is easier to say that that the job size

[slurm-dev] RE: Configuration best practices

2012-08-10 Thread Lipari, Don
Janne, For your first issue, I would suggest adding the following to your slurm.conf file: PriorityWeightJobSize=some significant number PriorityFavorSmall=YES This gives a priority boost to smaller jobs. Also, I assume you have the sched/backfill plugin enabled to give smaller/shorter jobs

[slurm-dev] Re: Implementing soft limits and notifications with Slurm/Moab

2012-06-05 Thread Lipari, Don
Subject: [slurm-dev] Re: Implementing soft limits and notifications with Slurm/Moab On Mon, Jun 4, 2012 at 1:48 PM, Lipari, Don lipa...@llnl.gov wrote: What appears to be happening is that Moab is sending the canceljob message to SLURM when the job's time limit expires.  It should email

[slurm-dev] Re: Implementing soft limits and notifications with Slurm/Moab

2012-06-04 Thread Lipari, Don
Reading only the description below and looking at the SLURM code, I don't think the fix should be on the SLURM side. What appears to be happening is that Moab is sending the canceljob message to SLURM when the job's time limit expires. It should email the user at that point, but hold off

[slurm-dev] Minor improvement to the schedule() function

2012-05-24 Thread Lipari, Don
Here's a performance enhancement to https://github.com/SchedMD/slurm/commit/158bfbbb978163dd55d5fb0d393965535a1c2979 against the 2.4 branch: Don diff --git a/src/slurmctld/job_scheduler.c b/src/slurmctld/job_scheduler.c index 8057053..80491b4 100644 --- a/src/slurmctld/job_scheduler.c +++

[slurm-dev] Re: Minor bug in display of Partition Name in sinfo

2012-05-14 Thread Lipari, Don
The new sinfo -o %R format option to convey the star-less partition name conflicts with the existing %R option to specify reason. They're both listed in the sinfo man page: %R Partition name, also see %P %R The reason a node is unavailable (down, drained, draining, fail or failing

[slurm-dev] Dependent Job Prioritization Bug

2012-05-07 Thread Lipari, Don
The symptom is that SLURM schedules lower priority jobs to run when higher priority, dependent jobs have their dependencies satisfied. This happens because dependent jobs still have a priority of 1 when the job queue is sorted in the schedule() function. The proposed fix forces jobs to have