We have a number of openings here at Harvard FAS RC. If you are
interested please check out our employment page for details:
https://www.rc.fas.harvard.edu/about/employment/
-Paul Edmon-
I would build MPI using the pmi libraries and slurm with pmi support.
Then you can launch any version of MPI using the same srun command and
leveraging pmi.
-Paul Edmon-
On 07/21/2017 10:03 AM, Manuel Rodríguez Pascual wrote:
Hi all,
I'm trying to provide support to users demanding
idle resources, the lower priority uses it. If the high
priority needs the resources it will requeue the jobs from the lower
priority partition.
-Paul Edmon-
On 7/11/2017 4:29 PM, David Perel wrote:
Hello --
Say on a cluster researcher X has his own reserved partition, XP,
where
Yeah, we keep around a test cluster environment for that purpose to vet
slurm upgrades before we roll them on the production cluster.
Thus far no problems. However, paranoia is usually a good thing for
cases like this.
-Paul Edmon-
On 06/26/2017 07:30 AM, Ole Holm Nielsen wrote:
On 06
all the user processes on the
cluster?
Cheers,
Loris
Paul Edmon <ped...@cfa.harvard.edu> writes:
If you follow the guide on the Slurm website you shouldn't have many problems.
We've made it standard practice here to set all partitions to DOWN and suspend
all the jobs when we do up
are bing done.
-Paul Edmon-
On 06/20/2017 09:37 AM, Nicholas McCollum wrote:
I'm about to update 15.08 to the latest SLURM in August and would
appreciate any notes you have on the process.
I'm especially interested in maintaining the DB as well as
associations. I'd also like to keep
That's a really good point. Pruning your DB can also help with this.
-Paul Edmon-
On 06/20/2017 09:19 AM, Loris Bennett wrote:
Hi Tim,
Tim Fora <tf...@riseup.net> writes:
Hi,
Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to
start. Logs show most of the time was
to
about 30 min or so.
Beyond that I imagine some more specific DB optimization tricks could be
done, but I'm not a DB admin so I won't venture to say.
-Paul Edmon-
On 06/20/2017 08:42 AM, Tim Fora wrote:
Hi,
Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to
start. Logs show
to walk it up to the level you need. You can also have the
purge archive to disk which can be handy if you want to maintain
historical info. The purge itself runs monthly at midnight on the 1st
of the month.
-Paul Edmon-
On 06/08/2017 11:36 AM, Rohan Gadalkar wrote:
Re: [slurm-dev] Re
Yes. Use the DefaultTime option.
*DefaultTime*
Run time limit used for jobs that don't specify a value. If not set
then MaxTime will be used. Format is the same as for MaxTime.
https://slurm.schedmd.com/slurm.conf.html
-Paul Edmon-
On 05/09/2017 05:35 AM, Georg Hildebrand wrote
*-w*, *--nodelist*=
Request a specific list of hosts. The job will contain /all/ of
these hosts and possibly additional hosts as needed to satisfy
resource requirements. The list may be specified as a
comma-separated list of hosts, a range of hosts (host[1-5,7,...] for
example), or
sacct is where you want to look:
https://slurm.schedmd.com/sacct.html
-Paul Edmon-
On 5/4/17 9:09 AM, Mahmood Naderan wrote:
User accounting
Hi,
I read the accounting page https://slurm.schedmd.com/accounting.html
however since it is quite large, I didn't get my answer!
I want to know
the normal slurm commands to effectively do the same thing.
-Paul Edmon-
On 04/20/2017 06:23 AM, Parag Khuraswar wrote:
Hi All,
Does SLURM support below features:-
Job Schedulers
• Workload cum resource manager with policy-aware, resource-aware and
topology-aware scheduling
Sometimes restarting slurm on the node and the master can purge the jobs
as well.
-Paul Edmon-
On 04/10/2017 03:59 PM, Douglas Meyer wrote:
Set node to drain if other jobs running. Then down and then resume. Down will
kill and clear any jobs.
scontrol update nodename= state
You should look at LLN (least loaded nodes):
https://slurm.schedmd.com/slurm.conf.html
That should do what you want.
-Paul Edmon-
On 03/16/2017 12:54 PM, kesim wrote:
Fwd: Scheduling jobs according to the CPU load
-- Forwarded message --
From: *kesim* <ketiw...@gmail.
I'm fairly certain that is coming in the 17.02 release.
According to this presentation it was supposed to be in 16.05:
https://slurm.schedmd.com/SLUG15/MultiCluster.pdf
but from what I recall of the SC BoF full functionality was not going to
be available to 17.02
-Paul Edmon-
On 2/12/2017
I would probably look at qos's for this:
https://slurm.schedmd.com/qos.html
https://slurm.schedmd.com/resource_limits.html
You can attach them to partitions as well which can be handy.
You probably want to use things like MaxJobs, MaxWallDurationPerJob,
MaxTRESperJob.
-Paul Edmon-
On 2/7
Are there other jobs in the queue? If so then those higher priority
jobs are likely waiting for resources before they run. Also the lower
priority jobs probably can't fit in before the higher priority jobs run,
thus they won't get backfilled. That's at least my guess.
-Paul Edmon-
On 01
not do nicing for
suspends. You could theoretically modify the slurm code to do that as
the whole thing is open source and contribute it back to the community,
but the feature doesn't exist currently.
-Paul Edmon-
On 01/06/2017 07:27 AM, Sophana Kok wrote:
Re: [slurm-dev] Re: preemptive fair
We do the same thing except with a prolog script which dumps the job
info to a flat file so we can look up historical jobs. Sadly the
slurmdbd does not store this info, so you have to do it yourself.
-Paul Edmon-
On 01/06/2017 04:31 AM, Loris Bennett wrote:
Sean McGrath <s
I know that a query like this was added to 16.x series. Namely
sacctmgr show runawayjobs
I don't think it is in 15.x series though.
-Paul Edmon-
On 01/04/2017 10:00 AM, Chris Rutledge wrote:
Hello all,
We recently discovered ghost jobs in the
slurm_acct_db._job_table that was the result
with the proper associations. We key off the users primary group id for
their default association. Thus the users account is created when they
submit their first job.
This should spread out the load, unless of course you have all your
users submitting simultaneously.
-Paul Edmon-
On 12/08/2016
Sadly all backfill parameters are global. You can tell slurm to
schedule per partition using the bf_max_job_part flag. That's about it
though.
-Paul Edmon-
On 11/29/2016 01:09 PM, Kumar, Amit wrote:
Dear SLURM,
Can I specify partition specific backfill parameters?
Thank you,
Amit
associations of all users they are
coordinator of, but can only see themselves when listing users.
http://slurm.schedmd.com/slurm.conf.html
That should do what you want.
-Paul Edmon-
On 11/01/2016 07:53 AM, Nathan Harper wrote:
Re: [slurm-dev] Restrict users to see only jobs
is.
-Paul Edmon-
On 09/30/2016 10:49 AM, Sergio Iserte wrote:
Re: [slurm-dev] Re: How to set the maximum priority to a Slurm job?
(from StackOverflow.com)
Thanks,
however, I would like to give the maximum priority.
Should I give a large number to the parameter?
Thank you.
2016-09-30 16:44 GMT+02
scontrol update jobid=jobid priority=blah
That works at least on a per job basis.
-Paul Edmon-
On 09/30/2016 10:34 AM, Sergio Iserte wrote:
How to set the maximum priority to a Slurm job? (from StackOverflow.com)
Hello,
this is a copy of my own StackOverflow post:
http://stackoverflow.com/q
for presenting data, you can also just take
the graphs it generates and embed them elsewhere.
-Paul Edmon-
On 09/27/2016 08:21 AM, John Hearns wrote:
Hello all. What are the thoughts on a Slurm ‘dashboard’. The purpose
being to display cluster status on a large screen monitor.
I rather
Excellent. T hanks for the info.
-Paul Edmon-
On 09/26/2016 04:03 PM, Paul Hargrove wrote:
Re: [slurm-dev] Passing MCA parameters via srun
Paul,
If the user always want a specific set of MCA options, then they
should be placed in $HOME/.openmpi/mca-params.conf
Otherwise, one should use
(specifically OpenMPI 1.10.2) such as to
standard MPI options:
mpirun -mca btl self,openib
My question is how do I get slurm to pass these options through when it
invokes MPI.
-Paul Edmon-
For those using graphite and diamond, check this out as they may be useful.
https://github.com/fasrc/slurm-diamond-collector
-Paul Edmon-
would be a break down of memory and cpu charges.
-Paul Edmon-
for a specific user to see if there was a job that it changed
inordinately more for and why.
My first logical step was to look at sacct but I didn't see an entry
that simply listed RawUsage for the job in terms of TRES. Even better
would be a break down of memory and cpu charges.
-Paul
the big ones or have been running for a long time.
-Paul Edmon-
On 06/10/2016 02:36 AM, Steffen Grunewald wrote:
Good morning everyone,
is there a way to control the order in which jobs get preempted?
That is, for a queue with PreemptMode=REQUEUE, it would make sense to
preempt jobs first
have a pam module for
handling who gets access. That plus cgroups should take care of your
security problems.
Anyways, suffice it to say slurm can work for your environment as your
environment is fairly similar to ours.
-Paul Edmon-
On 05/24/2016 07:33 AM, Šimon Tóth wrote:
Architectural
much internal slurm communications.
Thanks.
-Paul Edmon-
and slurmds.
-Paul Edmon-
On 5/4/2016 10:10 PM, Paul Edmon wrote:
Specifically the upgrade instructions are here:
http://slurm.schedmd.com/quickstart_admin.html
Look at the bottom of the page. If you follow the instructions you
should be fine. Though I would recommend pausing the scheduler
do a major upgrade like this there is no
rolling back due to the database and job structure changes.
-Paul Edmon-
On 5/4/2016 6:23 PM, Lachlan Musicman wrote:
Re: [slurm-dev] Slurm Upgrade Instructions needed
I would backup /etc/slurm.
That's about it.
Cheers
L.
--
The most dangerous
be caused by a node or client that is in a bad
state, but I can't figure out how to trace it back to which one. Does
anyone have any tricks for tracing this sort of error back? I turned on
the Protocol Debug Flag but none of the additional debug statements lead
to the culprit.
-Paul Edmon-
We use this python script as a slurmctld prolog to save ours. Basically
it pulls all the info from the slurm hash files and copies to a separate
filesystem. We used to do it via mysql but the database got too large.
We then use the get_jobscript to actually query the job scripts.
-Paul Edmon
on this point.
-Paul Edmon-
to set up a data repository for historic fairshare data.
Still it is on our docket to do.
sshare is a great way of seeing things too, though it can be a bit much
for your average user.
-Paul Edmon-
On 03/01/2016 04:52 AM, Chris Samuel wrote:
Hi Loris,
On Tue, 1 Mar 2016 12:29:12 AM Loris
Ah, okay. In this case I wanted to print out something to the user and
still succeed, as it was just a warning to the user that their script
had been modified.
Thanks.
-Paul Edmon-
On 01/29/2016 12:36 PM, je...@schedmd.com wrote:
It works for me, but only for the job_submit function
You can also set all the partitions to down. All pending jobs will
pend, new jobs can be submitted and existing jobs will finish. However
no new jobs will be scheduled.
-Paul Edmon-
On 01/20/2016 03:41 PM, Trey Dockendorf wrote:
Re: [slurm-dev] Pause all new submissions
Can likely use
That sounds about right. We have about the same order of magnitude and
our last major upgrade took about an hour for the DB to update itself.
-Paul Edmon-
On 1/12/2016 5:04 PM, Andrew E. Bruno wrote:
We're planning an upgrade from 14.11.10 to 15.08.6 and in the past the
slurmdbd upgrades
for the info.
-Paul Edmon-
but it is about 10 times smaller in terms of number of
nodes than our main one. I'm guessing this is a scaling problem.
Thoughts? Anyone else using MsgAggregation?
-Paul Edmon-
I will have to try that out. Thanks for the info.
-Paul Edmon-
On 12/08/2015 01:54 PM, Danny Auble wrote:
Hey Paul, Unless you have a very busy cluster (100s of jobs a second)
or are running very large jobs (>2000 nodes) I don't think this will
be very useful. But I would exp
at least my thinking, but it's less seamless to the users as they
will have to consciously monitor what is going on.
-Paul Edmon-
On 11/19/2015 10:50 AM, Daniel Letai wrote:
Can you elaborate a little? I'm not sure what kind of QoS will help,
nor how to implement one that will satisfy
Okay, its working now. I just had to be more patient for slurmctld to
pick up the DB change of adding the QOS. Thanks for the help.
-Paul Edmon-
On 11/11/2015 11:54 AM, Bruce Roberts wrote:
I believe the method James describes only allows that one qos to be
used by jobs on the partition
Thanks for the insight. I will try it out on my end.
-Paul Edmon-
On 11/11/2015 3:31 AM, Dennis Mungai wrote:
Re: [slurm-dev] Re: Partition QoS
Thanks a lot, James. Helped a bunch ;-)
*From:*James Oguya [mailto:oguyaja...@gmail.com]
*Sent:* Wednesday, November 11, 2015 11:21 AM
*To:* slurm
=5
MaxSubmitJobsPerUser=5
MaxCPUsPerUser=128
Thanks for the info.
-Paul Edmon-
I did that but it didn't pick it up. I must need to reconfigure again
after I made the qos. I will have to try it again. Let you know how it
goes.
-Paul Edmon-
On 11/10/2015 5:40 PM, Douglas Jacobsen wrote:
Re: [slurm-dev] Partition QoS
Hi Paul,
I did this by creating the qos, e.g
that are in a problematic state will
have their Reason field filled with something.
-Paul Edmon-
On 10/27/2015 05:03 AM, Всеволод Никоноров wrote:
Hello,
if a node is in a MIXED state, is it possible that there are "good" and "bad" states
mixed? I mean, if a node is MIXED when it
For clarity, they should not need to talk to the compute nodes unless
you intend to do interactive work. You should only need to talk to the
master to submit jobs.
-Paul Edmon-
On 10/26/2015 9:45 PM, Paul Edmon wrote:
What we did was that we just opened up port 6817 between the two
service to submit as all you need is the ability of the
login node to talk to the master.
-Paul Edmon-
On 10/26/2015 8:02 PM, Liam Forbes wrote:
Hello,
I’m in the process of setting up SLURM 15.08.X for the first time. I’ve got a
head node and ten compute nodes working fine for serial and parallel
issue I have ever seen where this becomes a problem is in
fringe cases during major version upgrades, even then it is rare.
-Paul Edmon-
On 10/12/2015 3:57 AM, Robbert Eggermont wrote:
Hello,
Some modifications to the slurm.conf require me to restart the slurmd
daemons on all nodes
commit hook uses scontrol to run a
test on the conf before pushing. This typically catches most errors.
Not all though.
-Paul Edmon-
On 10/12/2015 12:41 PM, Antony Cleave wrote:
While this is true be very, very careful when restarting the slurmd on
the controller node.
it's quite easy to miss
Typically that means that the master is having problems communicating
with the nodes. I would check your networking, especially your ACLs.
-Paul Edmon-
On 10/08/2015 10:15 AM, Fany Pagés Díaz wrote:
I have a cluster with 3 nodes, and yesterday isincorrectly turned off
by electrical
The only other thing I can think of it check that the node daemons are
up and okay.
-Paul Edmon-
On 10/08/2015 11:04 AM, Fany Pagés Díaz wrote:
My networks it looks fine, and the communication with the nodes too,
the problems is with slurm, the slumd demon is failed, I never made
any
effect to what you are asking for.
That said they aren't exactly analogous and I can see situations where
one would want to do this sort of thing so that no one person can
monopolize the queue.
-Paul Edmon-
On 10/6/2015 6:30 PM, Kumar, Amit wrote:
Just wanted to jump the wagon, this feature
and --cpu_bind=none
although none have these have prevented the Fortran programs from clumping
on at least a few cores. I would greatly appreciate your help in trying to
figure out how to prevent this behavior under the newly changed core
permissions."
Thanks for the help.
-Paul Edmon-
of the scheduler by bumping this up if we can to 1024, or at least test
to see what happens when we do.
Thanks.
-Paul Edmon-
So the new scripts are in the R2015b prerelease
http://www.mathworks.com/downloads/web_downloads
The ones we have are customized for our site. Mathworks recommends
contacting them for assistance with customizing them for your own.
-Paul Edmon-
On 06/10/2015 09:53 AM, Sean McGrath wrote
Huh. Let me ask around here and see if we can share what they gave us
with the community.
-Paul Edmon-
On 6/10/2015 9:04 AM, Hadrian Djohari wrote:
Re: [slurm-dev] Re: Slurm integration scripts with Matlab
Hi Paul and others,
We have contacted Mathworks and they only came back with the 2011
We've been working with Mathworks to get this sorted so that DCS and
Matlab can work with Slurm. I think if you contact them you should be
able to get the scripts they gave us.
-Paul Edmon-
On 6/9/2015 5:24 PM, Hadrian Djohari wrote:
Slurm integration scripts with Matlab
Hi,
We
happened to us a couple of times when trying to debug the
scheduler. So in reality the optimization did nothing in the first
place. So any concerns about speed should be alleviated.
-Paul Edmon-
On 04/24/2015 03:31 PM, Chris Read wrote:
Re: [slurm-dev] Re: Slurm versions 14.11.6 is now available
Correction 50,000 jobs per day. That 1.5 million is per month. My
bad. Still no degradation in performance seen in our environment.
-Paul Edmon-
On 04/24/2015 03:31 PM, Chris Read wrote:
Re: [slurm-dev] Re: Slurm versions 14.11.6 is now available
On Fri, Apr 24, 2015 at 4:41 AM, Janne
I've been curious about this for a bit. What is the procedure for
rolling back a minor and major release of slurm in case something goes
wrong?
-Paul Edmon-
Do you have all the ports open between all the compute nodes as well?
Since slurm builds a tree to communicate all the nodes need to talk to
every other node on those ports and do so with out a huge amount of
latency. You might want to try to up your timeouts.
-Paul Edmon-
On 03/31/2015
to
unexpected reboot. Is there a way to suppress this when the node is
rebooted by this flag? Obviously the reboot wasn't unexpected as slurm
was aware of it due to the flag.
-Paul Edmon-
at defer or max_rpc_cnt.
-Paul Edmon-
On 03/27/2015 07:32 AM, Mehdi Denou wrote:
Hi,
Using gdb you can retrieve which thread own the locks on the slurmctld
internal structures (and block all the others).
Then it will be easier to understand what is happening.
Le 27/03/2015 12:24, Stuart Rankin
Yeah, we use puppet and yum to manage our stack. Works pretty well and
scales nicely.
-Paul Edmon-
On 03/26/2015 11:46 AM, Jason Bacon wrote:
+1 for using package managers in general.
On our CentOS clusters, I do the munge and slurm installs using pkgsrc
(+ pkgsrc-wip).
http
All of them should be owned by munge. Further more for security's sake
I would make them all only accessible to munge, at least the etc one.
-Paul Edmon-
On 03/25/2015 10:29 AM, Jeff Layton wrote:
I assume the same is true for /var/log/munge and /var/run/munge?
How about /etc/munge?
Thanks
Yea, that folder and the files inside needs to be owned by munge.
-Paul Edmon-
On 03/25/2015 09:54 AM, Jeff Layton wrote:
Good morning,
Thanks for all of the advice in regard to slurm on NFS. I've
started on my slurm quest by installing munge but I'm
having some trouble. I'm not sure
I have tried building slurm 14.11.4 on CentOS7 but it never quite worked
right. I'm not sure if it has been vetted for RHEL7 yet. I didn't dig
too deeply though when I did build it as I just figured it wasn't ready
for RHEL7.
-Paul Edmon-
On 03/24/2015 10:32 AM, Fred Liu wrote:
Hi
then control the version via RPM installs.
-Paul Edmon-
On 3/24/2015 4:22 PM, Jason Bacon wrote:
I ran one of our CentOS clusters this way for about a year and found
it to be more trouble than it was worth.
I recently reconfigured it to run all system services from local disks
so that nodes
Yup, that's exactly what we do. We make sure to export it read only and
make sure that it is synced and hard mounted. Not much else to it.
-Paul Edmon-
On 03/24/2015 03:43 PM, Jeff Layton wrote:
Good afternoon,
I apologies for the newb question but I'm setting up slurm
for the first
Interesting. Yeah we use v3 here. Hadn't tried out v4, and good thing
we didn't then.
-Paul Edmon-
On 03/24/2015 04:05 PM, Uwe Sauter wrote:
And if you are planning on using cgroups, don't use NFSv4. There are problems
that cause the NFS client process to freeze (and
with that freeze
Oh and for the record we are running 14.11.4
-Paul Edmon-
On 03/10/2015 09:26 AM, Paul Edmon wrote:
So when I tried to do an archive dump I got the following error. What
does this mean?
[root@holy-slurm01 slurm]# sacctmgr -i archive dump
sacctmgr: error: slurmdbd: Getting response
Is it safe to try again?
-Paul Edmon-
On 03/06/2015 03:07 PM, Paul Edmon wrote:
Ah, okay, that was the command I was looking for. I wasn't sure how
to force it. Thanks.
-Paul Edmon-
On 03/06/2015 01:43 PM, Danny Auble wrote:
It looks like I might stand corrected though. It looks like you
as a feature in a future release?
-Paul Edmon-
On 03/10/2015 11:18 AM, Danny Auble wrote:
The fatal you received means your query lasted more than 15 minutes,
mysql deemed it hung and aborted. You can increase the timeout for
innodb_lock_wait_timeout in your my.cnf and try again
Okay, that's what I suspected. We set it to 6 months. So I guess then
the purge will happen on April 1st.
-Paul Edmon-
On 03/06/2015 12:33 PM, Danny Auble wrote:
Paul, do you have Purge* set up in the slurmdbd.conf? Archiving takes
place during the Purge process. If no Purge values
Ah, okay, that was the command I was looking for. I wasn't sure how to
force it. Thanks.
-Paul Edmon-
On 03/06/2015 01:43 PM, Danny Auble wrote:
It looks like I might stand corrected though. It looks like you will
have to wait for the month to go by before the purge starts
into pending state in hold,
meaning their priority is zero. Separate multiple exit code by a
comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state.
Restarted jobs will have the environment variable
*SLURM_RESTART_COUNT* set to the number of times the job has been
restarted.
-Paul
Basically the node cuts out due to hardware issues and the jobs is
requeued. I'm just trying to figure out why it sent them into a held
state as opposed to just simply requeueing as normal. Thoughts?
-Paul Edmon-
On 03/03/2015 12:11 PM, David Bigagli wrote:
There are no default values
We are definitely using the default for that one. So it should be
requeueing just fine.
-Paul Edmon-
On 03/03/2015 01:05 PM, Lipari, Don wrote:
It looks like the governing config parameter would be:
JobRequeue
This option controls what to do by default after a node failure
In this case the Node was in a funny state where it couldn't resolve
user id's. So right after the job tried to launch it failed and
requeued. We just let the scheduler do what it will when it lists
Node_fail.
-Paul Edmon-
On 03/03/2015 01:20 PM, David Bigagli wrote:
How do you set
Ah, good to know. I do prefer that behavior, just didn't expect it.
Thanks.
-Paul Edmon-
On 03/03/2015 02:00 PM, David Bigagli wrote:
Ah ok, the job failed to launch in this case Slurm requeue the job in
held state, the previous behaviour was to terminate the job.
The reason
\
ThreadsPerCore=1 Feature=intel Gres=gpu:2
-Paul Edmon-
On 2/2/2015 1:09 PM, Bruce Roberts wrote:
Yes. All nodes and their resources need to be defined in the
slurm.conf on each node, not a different .conf on each node.
On 02/02/2015 10:04 AM, Slurm User wrote:
slurm.conf consistent
Yeah, that's good to get started for a conf, but then following the man
page is the next step.
-Paul Edmon-
On 2/2/2015 1:29 PM, Slurm User wrote:
Re: [slurm-dev] Re: slurm.conf consistent across all nodes
Ian, Paul
Thanks for your replies, that makes sense!!!
I was using
\
AllowGroups=important_people \
Nodes=blah
# JOB PREEMPTION
PreemptType=preempt/partition_prio
PreemptMode=REQUEUE
Since serial_requeue is the lowest priority it gets scheduled last and
if any jobs come in from the higher priority queue it requeues the lower
priority jobs.
-Paul Edmon
These parameters work well for a cluster of 50,000 cores, 57 queues, and
about 40,000 jobs per day. We are running 14.03.8
-Paul Edmon-
On 10/20/2014 02:19 PM, Mikael Johansson wrote:
Hello,
Yeah, I looked at that, and have now four partitions defined like this:
PartitionName=short
If memory serves I thought that 14.03 was supposed to support hooking
into FlexLM licensing. However, I can't find any documentation on
that. Was that pushed off to a future release?
-Paul Edmon-
I don't know if this has been done in the newer versions of slurm but it
would be good to have sacct be able to list both the JobID and the index
of the Job Array if it is a job array.
Thanks.
-Paul Edmon-
into QoS?
-Paul Edmon-
On 5/21/2014 6:52 PM, je...@schedmd.com wrote:
Quoting Paul Edmon ped...@cfa.harvard.edu:
We have just started using QoS here and I was curious about a few
features which would make our lives easier.
1. Spillover/overflow: Essentially if you use up one QoS you would
Well more like the naive ones namely:
sacctmgr delete job JobID
How do you set the endtime? Do you do that via scontrol?
-Paul Edmon-
On 04/21/2014 10:14 PM, Danny Auble wrote:
What are the obvious ones?
I would expect setting the end time to the start time and state to 4
(I think
Thanks. Sorry forgot about that thread.
I'm wagering that the jobs got orphaned due to timing out. Essentially
they actually launched but the didn't successfully update the database
because it was busy.
-Paul Edmon-
On 04/22/2014 12:15 PM, Danny Auble wrote:
Paul I think this was covered
sbatch die with
an error rather than have srun just hang up?
Thanks for any insight.
-Paul Edmon-
Is there a way to delete a JobID and it's relevant data from the slurm
database? I have a user that I want to remove but there is a job which
slurm thinks is not complete that is preventing me. I want slurm to
just remove that job data as it shouldn't impact anything.
-Paul Edmon-
is looping over mpirun's like this:
do i=1,1000
mpirun -np 64 ./executable
enddo
Each run lasts about 5 minutes. If one of the mpirun's fails to launch
the entire thing hangs. It would be better if srun kept trying instead
of just failing.
-Paul Edmon-
On 4/16/2014 11:16 PM, Paul Edmon
1 - 100 of 161 matches
Mail list logo