Sure I can hunt that info down. So what would be the command to remove
the job from the DB? I tried the obvious ones I could think of but with
not effect.
-Paul Edmon-
On 4/21/2014 4:31 PM, Danny Auble wrote:
Paul, you should be able to remove the job with no issue. The real
question
So if you are running in defer mode for the scheduler what determines
the frequency of the main loop for the scheduler? Can this be changed?
-Paul Edmon-
Thanks. That's helpful.
-Paul Edmon-
On 04/11/2014 03:00 PM, je...@schedmd.com wrote:
In defer mode, the main scheduling loop runs once per minute, but most
of your jobs will typically be scheduled by the backfill scheduler
instread (although that depends upon your configuration
One more question, what controls the maximum runtime for the Main Scheduler?
-Paul Edmon-
On 04/11/2014 03:02 PM, Paul Edmon wrote:
Thanks. That's helpful.
-Paul Edmon-
On 04/11/2014 03:00 PM, je...@schedmd.com wrote:
In defer mode, the main scheduling loop runs once per minute
, and
allow it to breath and catch up.
Is there a way to automate this?
-Paul Edmon-
updated. I did a scontrol
reconfigure but that didn't help. Only a full slurm restart fixed it.
So is this a known feature? Is there a way to force it to update its
cache with out a full restart? I would hate to have to restart after
every time I did this.
-Paul Edmon-
The DB itself is on the same machine as the CTLD, so it should be blocking.
I will amp up the debug and see what I find.
-Paul Edmon-
On 3/20/2014 4:26 PM, Danny Auble wrote:
Paul I would check your slurmdbd log about not being able to talk to
your slurmctld on the cluster.
What you
Sorry I meant shouldn't be blocking.
-Paul Edmon-
On 3/20/2014 9:41 PM, Paul Edmon wrote:
The DB itself is on the same machine as the CTLD, so it should be
blocking.
I will amp up the debug and see what I find.
-Paul Edmon-
On 3/20/2014 4:26 PM, Danny Auble wrote:
Paul I would check
. This doesn't happen on reconfigures only on restarts.
Is there a way to prevent it from doing this query or at least make this
query nonblocking for slurm? Thanks.
-Paul Edmon-
Okay, that would be great.
-Paul Edmon-
On 02/21/2014 02:28 PM, Danny Auble wrote:
At the moment, no. Perhaps it could be looked at for future versions
though.
On 02/21/14 11:20, Paul Edmon wrote:
Whenever we do a:
service slurm restart
on our master it ends up initiating a massive
:
User= user1, Account=account1
I want to set it to:
User= user1, Account=account2
How would I do that? I tried the obvious and naive methods but no such
luck.
-Paul Edmon-
So could you instead add them to a different account and then after that
remove the old account association?
-Paul Edmon-
On 02/20/2014 12:14 PM, Danny Auble wrote:
Sorry Paul, there is no way to change a users account. It doesn't
work well in accounting.
You would have to add a new
get any cross talk. Can this be done? It would be incredibly
helpful for our environment.
-Paul Edmon-
thousands of jobs in the queue. I think we would take the
hit for having to spin through all the partitions in order to make sure
every partition is treated properly.
-Paul Edmon-
On 02/10/2014 11:12 AM, Alejandro Lucero Palau wrote:
Hi Paul,
What's the max cycle latency for main scheduling
Is there an option for limiting the number of jobs a user can have
running on a given partition? We have an interactive queue that I want
to limit to 5 jobs per user.
-Paul Edmon-
Thanks. We will check it out.
-Paul Edmon-
On 02/10/2014 02:19 PM, je...@schedmd.com wrote:
Hi Paul,
This should achieve the results that you are looking for using a new
configuration parameter. The attached patch, including documentation
changes, is built against Slurm version 2.6. You
We are on the latest release so that shouldn't be the issue.
-Paul Edmon-
On 01/27/2014 07:30 AM, Moe Jette wrote:
There were changes in Slurm version 2.6 with respect to lock handling
which may effect this. If you are using an earlier version of slurm,
that would be a reason to upgrade
So I've found that if some one submits a ton of jobs that have a very
short runtime slurm tends to trash as jobs are launching and exiting
pretty much constantly. Is there an easy way to enforce a minimum runtime?
-Paul Edmon-
stuff. It should
be helpful.
-Paul Edmon-
On 1/26/2014 7:21 PM, Moe Jette wrote:
A great deal depends upon your hardware and configuration. Slurm
should be able to handle a few hundred jobs per soecond when tuned for
high throughput as described here:
http://slurm.schedmd.com
into the
next version of SLURM as it would be handy in our environment.
-Paul Edmon-
would be good.
-Paul Edmon-
On 1/15/2014 5:51 PM, Christopher Samuel wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi Paul,
On 16/01/14 03:29, Paul Edmon wrote:
Thoughts? If this doesn't exist it may be a good thing to add into
the next version of SLURM as it would be handy in our
For reference we are coming from a LSF environment where our users are
used to:
bsub -q test_queue my_program
Where my_program could be anything from simple bash commands to an
actual program to run.
-Paul Edmon-
On 12/13/2013 04:59 PM, Silva, Luis wrote:
sbatch as a one liner
Include the ignore_jobs flag. That will force the reservation.
-Paul Edmon-
On 11/12/2013 12:11 PM, Jacqueline Scoggins wrote:
Admin reservation on busy nodes
Running slurm 2.5.7 and tried to reserve the nodes of the cluster
because of hardware issues that needed to be repaired. Some
remains to be rebooted somewhere. Is there a way to access it?
-Paul Edmon-
What about those that have yet to be hit with maint?
-Paul Edmon-
On 10/31/2013 5:37 PM, Moe Jette wrote:
sinfo will show the node state as maint. sinfo can filter on that
node state too:
sinfo -N --state=maint
Quoting Paul Edmon ped...@cfa.harvard.edu:
So we recently used scontrol
Thanks.
-Paul Edmon-
On 10/31/2013 9:53 PM, Morris Jette wrote:
Maint means reboot pending. Cleared after reboot.
Paul Edmon ped...@cfa.harvard.edu wrote:
What about those that have yet to be hit with maint?
-Paul Edmon-
On 10/31/2013 5:37 PM, Moe Jette wrote:
sinfo
Thanks. That should work nicely.
-Paul Edmon-
On 10/28/2013 8:37 PM, Moe Jette wrote:
Paul,
I'm working on getting that into the next release v2.6.4, which should
be available within days. To wait for all job array elements to
complete, you would just specify the primary job id. You can
diagnostic
checks like this should respond in a timely manner, even if the data
that is contained there is a little out of date.
-Paul Edmon-
On 10/17/2013 11:57 AM, Moe Jette wrote:
There is no faster way to get job counts, but you might find the sdiag
command helpful.
Quoting Damien
True. I was just contemplating ways to make it more responsive.
Multiple copies of the data would do that, I just wasn't sure whether
keeping that in sync would be a head ache.
-Paul Edmon-
On 10/17/2013 1:01 PM, Moe Jette wrote:
Sending old data quickly seems very dangerous, especially
is 2 seconds. What is your MessageTimeout set to?
Danny
On 10/17/13 10:21, Paul Edmon wrote:
True. I was just contemplating ways to make it more responsive.
Multiple copies of the data would do that, I just wasn't sure whether
keeping that in sync would be a head ache.
-Paul Edmon-
On 10
jobs such as squeue and scancel are not
working. So I can't tell who sent in this many jobs.
-Paul Edmon-
That's good to hear. Is there an option to do it per user? I didn't
see one in the slurm.conf. I may have missed it.
-Paul Edmon-
On 9/29/2013 5:42 PM, Moe Jette wrote:
Quoting Paul Edmon ped...@cfa.harvard.edu:
Yeah, that's why we set the 500,000 job limit. Though I didn't
collapse the ridiculous number of queues we have, so that option is
right out. We also don't want to start splitting our environment
between multiple seperate masters for each queue as we want centralized
accounting and fairshare, plus we want one system to rule them all.
-Paul Edmon-
everything. We have since reduced the backfill
frequeny which has helped greatly.
-Paul Edmon-
On 09/27/2013 12:06 PM, Moe Jette wrote:
Responses in-line below.
You should also be aware that Don Lipari of LLNL presented a tutorial
about Slurm's scheduler at the 2012 Slurm User Group meeting
2.6.1
-Paul Edmon-
On 09/27/2013 12:15 PM, Danny Auble wrote:
Paul you are running 2.6 correct?
On 09/27/2013 09:11 AM, Paul Edmon wrote:
Thanks Moe. I will try a few of those things and look that
presentation. I did try bf_continue at one point, but that cause our
entire system
stride and array number you can go too? As you can see we have users
that do some crazy things.
-Paul Edmon-
Ah, that would explain it:
[root@itc011 ~]# scontrol show config | grep MaxArraySize
MaxArraySize= 1001
Will have to boost that one. Is there any limit to the stride? I'm
guessing no but I just want to check.
-Paul Edmon-
On 9/9/2013 11:41 AM, Moe Jette wrote:
These both
Thanks.
-Paul Edmon-
On 9/9/2013 11:49 AM, Moe Jette wrote:
There is no stride limit.
Quoting Paul Edmon ped...@cfa.harvard.edu:
Ah, that would explain it:
[root@itc011 ~]# scontrol show config | grep MaxArraySize
MaxArraySize= 1001
Will have to boost that one
if it can resolve all the hosts as well and fail if it can't.
Thanks.
-Paul Edmon-
That would be great. I can send in my version as well. It has a few
additional features such as sorting by partition and ordering the
pending queue by job priority.
-Paul Edmon-
On 8/11/2013 5:49 PM, Danny Auble wrote:
Karl, if you send us a copy we may be able to put it in the contribs
insight would be appreciated.
-Paul Edmon-
On 08/06/2013 09:30 PM, Paul Edmon wrote:
So our SLURM 2.5.7 install went down this evening with a massive bout of:
[2013-08-06T17:12:20-04:00] sched: Allocate JobId=113950
NodeList=holy2b09103 #CPUs=4
[2013-08-06T17:12:20-04:00] sched: Allocate JobId
. Is
there something I am missing? Some way of making it more robust? We've
tried the HA fail over but that didn't work when this happened and
caused other problems when the install went split brained.
-Paul Edmon-
Haven't seen a response on this so I thought I would re-ping. Has anyone
ever seen the below error?
-Paul Edmon-
On 07/04/2013 01:36 PM, Paul Edmon wrote:
So we are running slurm-2.5.7 on our cluster with a master and a
backup. This morning our primary suffered from this error:
[2013-07
files on the shared
filesystem was supposed to prevent this as all the current running jobs
are written there. Or did I misunderstand and those files are only
update when the master goes down? I would like to understand why we
lost jobs so we can prevent it from happening again.
-Paul Edmon-
list first before
putting a nail in it. From my look at the documentation I don't see
anyway to do this other than what I stated above.
-Paul Edmon-
Thanks for the input. Can GrpJobs be modified from the user side?
-Paul Edmon-
On 06/19/2013 12:15 PM, Ryan Cox wrote:
Paul,
We were discussing this yesterday due to a user not limiting the amount
of jobs hammering our storage. A QOS with a GrpJobs limit sounds like
the best approach
Okay, thanks.
-Paul Edmon-
On 06/19/2013 04:32 PM, Ryan Cox wrote:
Not that I'm aware of. I don't know of a way to give users control over
a QOS like you can do with account coordinators for accounts.
Ryan
On 06/19/2013 10:55 AM, Paul Edmon wrote:
Thanks for the input. Can GrpJobs
.
-Paul Edmon-
On 06/12/2013 12:30 PM, Alan V. Cowles wrote:
Hey Guys,
I've seen a few references to the slurmctld as a multithreaded process
but it doesn't seem that way.
We had a user submit 18000 jobs to our cluster (512 slots) and it shows
512 fully loaded, shows those jobs running, shows
. However, I've done this before and hadn't seen this issue
crop up. Is there a way to remove this job from sacct? scancel does not
work on it.
-Paul Edmon-
Do you mean the node that hosts the slurmdb? Or the node that runs
slurmctld? Or are you speaking of the nodes on which that job ran?
-Paul Edmon-
On 06/05/2013 10:45 AM, Sefa Arslan wrote:
if possible, rebooting the workerker node is the fastest solution.
On 06/05/2013 05:10 PM, Paul
Hmm, maybe its the ThreadsPerCore? Perhaps its thinks there are half as
many core as there really are due to the ThreadsPerCore. Thus if you do
the --mem-per-cpu it will only give you half, as it only counts cores
not threads*cores?
-Paul Edmon-
On 05/23/2013 01:31 PM, S. Aravindan wrote
Thanks.
-Paul Edmon-
On 4/29/2013 10:34 AM, Carles Fenoy wrote:
Re: [slurm-dev] Re: JobHeldAdmin
Dear Paul,
You should be able to release the job with the command:
scontrol release JOBID
Regards,
Carles Fenoy
Barcelona Supercomputing Center
On Mon, Apr 29, 2013 at 3:49 PM, Carl Schmidtmann
Available to this Queue
1328 of 1992 Cores Used (66.67%)
247 of 250 Nodes Used (98.80%), 0 Nodes Closed by Admin, 1 Nodes
Unavailable
Other Queues Which Submit To The Hosts For This Queue: priority
PENDING JOBS-
No matching jobs found
-Paul Edmon-
Thanks. This should be helpful.
-Paul Edmon-
On 4/17/2013 5:29 PM, Karl Schulz wrote:
Paul,
I too have done showq variants for LSF in the past and have ported a version
to use the C api in slurm (you do need slurm-devel installed to build
though). I sent you a dist tarball offline
Excellent thanks for letting me know. Just for my own information I
would need to only restart slurm right to have this change take not
restart the full machine?
-Paul Edmon-
On 3/3/2013 10:48 PM, Andy Riebs wrote:
PAul,
Assuming that you are using a recent (2.5.x or later) version
a NodeAddr
list. I would expect it to refuse the new conf and spit out an error
message.
-Paul Edmon-
On 01/30/2013 01:03 PM, David Bigagli wrote:
Re: [slurm-dev] Adding new nodes to slurm.conf
Do you have the slurmctld log when the master failed? It should be
enough to add the hostname
So during that period the master would cease managing everything and you
wouldn't be able to submit? Are those the only dangers for shutting
down the master?
We tend to be in an environment where things are in production but also
in flux.
-Paul Edmon-
On 01/30/2013 03:58 PM, Moe Jette
We are also interested in this as we are migrating from LSF to SLURM.
We will likely cook up some of our own as we migrate, but we haven't
gotten there yet.
-Paul Edmon-
On 1/20/2013 12:32 PM, Fred Liu wrote:
Does anyone have LSF command wrappers for Slurm or is interested in
working
Cool. Thanks for the info.
-Paul Edmon-
On 01/16/2013 03:05 AM, Ole Holm Nielsen wrote:
On 01/15/2013 11:36 PM, Paul Edmon wrote:
So does any one have an example node health check script for SLURM? One
that would be run by HealthCheckProgram defined in slurm.conf. I'd
rather not reinvent
of hostnames that don't
actually exist in DNS due to typo, it causes the slurmctld to freak out
an die.
-Paul Edmon-
On 12/05/2012 01:56 PM, Danny Auble wrote:
This shouldn't happen. If you have an example of what you are talking
about it would be interesting to fix. Typo's/bad info
Thanks for the info. Could this be put on the docket for the next
update? In the meantime we can work around it.
-Paul Edmon-
On 12/4/2012 3:17 PM, je...@schedmd.com wrote:
it would be pretty simple to add, but there is no mechanism to do this
today
--
Sent from my Android phone. Please
101 - 161 of 161 matches
Mail list logo