Allow me to direct you to a doc I created that could help answer your question:
https://hpc.llnl.gov/banks-jobs/running-jobs/slurm-user-manual
And for a more general intro, see:
https://hpc.llnl.gov/banks-jobs/running-jobs/batch-system-primer
Don
On 10/2/17, 6:16 AM, "Kynn Jones"
Loris,
Taking Slurm v2.3 as a data point, the sbatch behavior seems consistent with
what you report for v16.05.
I deliberately attempted to request fewer nodes than I specified with the
--nodelist option (attempting the potential behavior you describe below) and
sbatch complained:
$ sbatch
sxterm is a script LLNL created to simplify the request for a job that launches
an xterm window. While one can replicate everything sxterm does with sbatch +
job script, it has become so popular with our users, I included it in the
official Slurm commands listed in the manual you cite.
Don
A recent post on Slurm training prompted me to mention that I created for our
users the following guides to using Slurm:
https://hpc.llnl.gov/banks-jobs/running-jobs
The Batch System Primer introduces new users to HPC batch scheduling concepts.
The Slurm User Manual and Quick Start Guide
You might consider this fix:
https://github.com/SchedMD/slurm/commit/79c9a49913625a0f790f29897f0f099156f94268
It was committed to the slurm-15.08 branch yesterday.
Don Lipari
> -Original Message-
> From: Ryan Novosielski [mailto:novos...@rutgers.edu]
> Sent: Tuesday, May 10, 2016 8:36
The original multi-factor plugin generated fair-share factors ranging between
0.0 and 1.0. Under this formula, 0.5 was the factor a user/account would see
if their usage was commensurate with their shares. The multi-factor2 plugin
fair-share factors range between 0.0 and 100.0, with 1.0
.
Don
-Original Message-
From: Lipari, Don [mailto:lipa...@llnl.gov]
Sent: Wednesday, May 13, 2015 7:56 AM
To: slurm-dev
Subject: [slurm-dev] RE: multifactor2 priority weighting
The original multi-factor plugin generated fair-share factors ranging
between 0.0 and 1.0. Under
-Original Message-
From: David Bigagli [mailto:da...@schedmd.com]
Sent: Thursday, March 05, 2015 10:49 AM
To: slurm-dev
Subject: [slurm-dev] Re: possible bug in srun --unbuffered option
Hi,
can we see what script is broken? You may want to review #991 to
get
the full
-Original Message-
From: Bill Wichser [mailto:b...@princeton.edu]
Sent: Wednesday, January 21, 2015 8:23 AM
To: slurm-dev
Subject: [slurm-dev] RE: fairshare allocations
On 01/21/2015 11:07 AM, Lipari, Don wrote:
-Original Message-
From: Bill Wichser [mailto:b
-Original Message-
From: Bill Wichser [mailto:b...@princeton.edu]
Sent: Wednesday, January 21, 2015 5:20 AM
To: slurm-dev
Subject: [slurm-dev] fairshare allocations
The algorithm I use is fairtree under 14.11 but I believe that my
question relates to any method.
As a
Slurm has a multi-cluster (grid) functionality when Slurm’s database is
installed and active. This allows one cluster to submit jobs, and invoke
status commands, to another cluster. Jobs submitted that specify multiple
cluster candidates will be submitted to the cluster that is projected to
Mike,
runjob is the task launcher on BG/Q. Take a look at the --envs option
documented in the runjob man page.
Then look at srun's man page for the --launcher-opts option.
I believe the answer may involve the use of the two. It will be something like:
srun [...] --launcher-opts=='--envs
Bill,
As I understand the dilemma you presented, you want to maximize the utilization
of node resources when running with Slurm configured for
SelectType=select/cons_res. To do this, you would like to nudge users into
requesting only the amount of memory they will need for their jobs. The
First off, the assumption is that you have the same slurm.conf files across
your management node and all the compute nodes.
The node definitions listed in your slurm.conf file become the gold standard.
For example:
NodeName=rzmerl[1-152] NodeAddr=erzmerl[1-152] Sockets=2 CoresPerSocket=8
-Original Message-
From: Schmidtmann, Carl [mailto:carl.schmidtm...@rochester.edu]
Sent: Monday, April 07, 2014 9:47 AM
To: slurm-dev
Subject: [slurm-dev] rpmbuild params
I am trying to build the rpms for slurm using rpmbuild. I followed the
instructions in the install guide
Gary,
The sacct command retrieves job and job step records from the slurmdb and
reports the statistics for the requested job(s).
The sshare command provides the basis for the fair-scheduling component of the
multi-factor plugin. sshare lists the two components (shares and usage) which
are
for
non-zero values was never implemented.
Don
-
Gary Skouson
-Original Message-
From: Lipari, Don [mailto:lipa...@llnl.gov]
Sent: Monday, March 17, 2014 8:47 AM
To: slurm-dev
Subject: [slurm-dev] RE: sshare and sacct
Gary,
The sacct command retrieves job and job step
with very old start
times (or with excessive elapsed times) that are still running.
sacct -a -X -S 1/1/13 -L -sR -o
jobid,user,account,cluster,partition,nnodes,state,start,end,elapsed,exitcode
Don
-Original Message-
From: Lipari, Don
Sent: Thursday, May 02, 2013 8:55 AM
To: slurm-dev
There are two ways to confirm your PriorityWeightAge setting has been read in:
$ scontrol show conf | grep PriorityWeightAge
or
$ sprio -w
Once you confirm that the system has recognized the values you set in your
slurm.conf file, if you still see the problem, I suggest you turn up debugging
Vsevolod,
If one of the remaining files that might cause a conflict with newer SLURM
package that you removed was job_state (located in your configured
StateSaveLocation) then you deleted Slurm's record of the running job ID. It
will reset to your configured FirstJobId after that.
There is
Alan,
Without more info, it is hard to tell. But the first thing I would consider is
that the lower priority jobs are being backfilled. See what you've defined in
your slurm.conf:
SchedulerType=sched/backfill
Don
-Original Message-
From: Alan V. Cowles
Re-run your command and add the WithAssoc option. Then see whether there is a
default account present for the cluster of interest.
There is probably not and you will have to add it:
sacctmgr modify user k202066 cluster=cluster_of_interest set
DefaultAccount=intern
While default accounts used
Have a look at http://bugs.schedmd.com/show_bug.cgi?id=247 and the patches that
resulted.
Don
-Original Message-
From: Luis Alves [mailto:luis.al...@csc.fi]
Sent: Thursday, May 02, 2013 5:07 AM
To: slurm-dev
Subject: [slurm-dev] error: We have more allocated time than is
Dustin,
Try adding the cluster=your_cluster option to your sreport command. If it
succeeds, there could be a problem with the ClusterName setting in your
slurm.conf file on the head node.
If sreport still fails, look into your slurmdbd.log and see whether there are
any entries added when you
This could be what you remember:
https://github.com/SchedMD/slurm/blob/master/contribs/web_apps/chart_stats.cgi
It reads the parameters of the user’s selection, invokes sreport to retrieve
the data, and displays the info in charts.
I’m afraid I have not been actively maintaining this utility,
Matteo,
I suspect something happened that prevented the state change for your jobs
below from being propagated to the database. You are going to have to modify
them manually in mysql. Change the state for these jobs to 3 (JOB_COMPLETE).
I would also update the time_end field to something
Matt,
To allow the jobs from each owner (11, 12, and 13) to be prioritized based on
their usage, do not include the fairshare=parent setting.
Otherwise, your settings should result in usage that reflects the shares you
have configured, assuming all three owners have been submitting enough jobs
-Original Message-
From: Moe Jette [mailto:je...@schedmd.com]
Sent: Monday, November 05, 2012 12:24 PM
To: slurm-dev
Subject: [slurm-dev] Re: Access to account in lua job submit plugin
Hi Kent,
Responses in-line below.
Quoting Kent Engström k...@nsc.liu.se:
[...]
-
-Original Message-
From: Loris Bennett [mailto:loris.benn...@fu-berlin.de]
Sent: Thursday, October 18, 2012 5:33 AM
To: slurm-dev
Subject: [slurm-dev] Re: slurmdbd: more time than is possible
Loris Bennett loris.benn...@fu-berlin.de
writes:
What additional information, if
Miguel,
Try holding and then releasing the job.
Don
From: Miguel Méndez [mailto:miguel.men...@uam.es]
Sent: Friday, September 21, 2012 4:30 AM
To: slurm-dev
Subject: [slurm-dev] Reset job's priority to automatic
Hi,
I set some job's priorities manually to some value. After that, those job's
Yuri,
Enable and configure the jobacct_gather plugin. Then try out the sstat command.
Don
-Original Message-
From: Yuri D'Elia [mailto:wav...@thregr.org]
Sent: Tuesday, September 11, 2012 7:39 AM
To: slurm-dev
Subject: [slurm-dev] Monitoring job resources
Is it possible to
Take a look at slurm/contribs/web_apps/chart_stats.cgi. It is something I
threw together to graph accounting statistics. I have not updated it in quite
a while, but it should give you some ideas.
The tool generates reports of:
Usage - Single User
Usage - Single Account
Usage - Top Ten Users
Miguel,
While you make a good argument to base the job size factor on the number of
active nodes, this could confuse users. You may get users who wonder why their
priority is too low and look to you to explain how their job priority was
calculated. It is easier to say that that the job size
Janne,
For your first issue, I would suggest adding the following to your slurm.conf
file:
PriorityWeightJobSize=some significant number
PriorityFavorSmall=YES
This gives a priority boost to smaller jobs.
Also, I assume you have the sched/backfill plugin enabled to give
smaller/shorter jobs
Subject: [slurm-dev] Re: Implementing soft limits and notifications
with Slurm/Moab
On Mon, Jun 4, 2012 at 1:48 PM, Lipari, Don lipa...@llnl.gov wrote:
What appears to be happening is that Moab is sending the canceljob
message to SLURM when the job's time limit expires. It should email
Reading only the description below and looking at the SLURM code, I don't think
the fix should be on the SLURM side.
What appears to be happening is that Moab is sending the canceljob message to
SLURM when the job's time limit expires. It should email the user at that
point, but hold off
Here's a performance enhancement to
https://github.com/SchedMD/slurm/commit/158bfbbb978163dd55d5fb0d393965535a1c2979
against the 2.4 branch:
Don
diff --git a/src/slurmctld/job_scheduler.c b/src/slurmctld/job_scheduler.c
index 8057053..80491b4 100644
--- a/src/slurmctld/job_scheduler.c
+++
The new sinfo -o %R format option to convey the star-less partition name
conflicts with the existing %R option to specify reason.
They're both listed in the sinfo man page:
%R Partition name, also see %P
%R The reason a node is unavailable (down, drained, draining, fail or
failing
The symptom is that SLURM schedules lower priority jobs to run when higher
priority, dependent jobs have their dependencies satisfied. This happens
because dependent jobs still have a priority of 1 when the job queue is sorted
in the schedule() function. The proposed fix forces jobs to have
39 matches
Mail list logo