[slurm-dev] Re: Removing Job from Slurm Database

2014-04-21 Thread Paul Edmon
Sure I can hunt that info down. So what would be the command to remove the job from the DB? I tried the obvious ones I could think of but with not effect. -Paul Edmon- On 4/21/2014 4:31 PM, Danny Auble wrote: Paul, you should be able to remove the job with no issue. The real question

[slurm-dev] Primary Loop Frequency in defer

2014-04-11 Thread Paul Edmon
So if you are running in defer mode for the scheduler what determines the frequency of the main loop for the scheduler? Can this be changed? -Paul Edmon-

[slurm-dev] Re: Primary Loop Frequency in defer

2014-04-11 Thread Paul Edmon
Thanks. That's helpful. -Paul Edmon- On 04/11/2014 03:00 PM, je...@schedmd.com wrote: In defer mode, the main scheduling loop runs once per minute, but most of your jobs will typically be scheduled by the backfill scheduler instread (although that depends upon your configuration

[slurm-dev] Re: Primary Loop Frequency in defer

2014-04-11 Thread Paul Edmon
One more question, what controls the maximum runtime for the Main Scheduler? -Paul Edmon- On 04/11/2014 03:02 PM, Paul Edmon wrote: Thanks. That's helpful. -Paul Edmon- On 04/11/2014 03:00 PM, je...@schedmd.com wrote: In defer mode, the main scheduling loop runs once per minute

[slurm-dev] auto-defer

2014-04-04 Thread Paul Edmon
, and allow it to breath and catch up. Is there a way to automate this? -Paul Edmon-

[slurm-dev] DB Cache

2014-03-20 Thread Paul Edmon
updated. I did a scontrol reconfigure but that didn't help. Only a full slurm restart fixed it. So is this a known feature? Is there a way to force it to update its cache with out a full restart? I would hate to have to restart after every time I did this. -Paul Edmon-

[slurm-dev] Re: DB Cache

2014-03-20 Thread Paul Edmon
The DB itself is on the same machine as the CTLD, so it should be blocking. I will amp up the debug and see what I find. -Paul Edmon- On 3/20/2014 4:26 PM, Danny Auble wrote: Paul I would check your slurmdbd log about not being able to talk to your slurmctld on the cluster. What you

[slurm-dev] Re: DB Cache

2014-03-20 Thread Paul Edmon
Sorry I meant shouldn't be blocking. -Paul Edmon- On 3/20/2014 9:41 PM, Paul Edmon wrote: The DB itself is on the same machine as the CTLD, so it should be blocking. I will amp up the debug and see what I find. -Paul Edmon- On 3/20/2014 4:26 PM, Danny Auble wrote: Paul I would check

[slurm-dev] MySQL query blocking slurmctld

2014-02-21 Thread Paul Edmon
. This doesn't happen on reconfigures only on restarts. Is there a way to prevent it from doing this query or at least make this query nonblocking for slurm? Thanks. -Paul Edmon-

[slurm-dev] Re: MySQL query blocking slurmctld

2014-02-21 Thread Paul Edmon
Okay, that would be great. -Paul Edmon- On 02/21/2014 02:28 PM, Danny Auble wrote: At the moment, no. Perhaps it could be looked at for future versions though. On 02/21/14 11:20, Paul Edmon wrote: Whenever we do a: service slurm restart on our master it ends up initiating a massive

[slurm-dev] Account Associate Change

2014-02-20 Thread Paul Edmon
: User= user1, Account=account1 I want to set it to: User= user1, Account=account2 How would I do that? I tried the obvious and naive methods but no such luck. -Paul Edmon-

[slurm-dev] Re: Account Associate Change

2014-02-20 Thread Paul Edmon
So could you instead add them to a different account and then after that remove the old account association? -Paul Edmon- On 02/20/2014 12:14 PM, Danny Auble wrote: Sorry Paul, there is no way to change a users account. It doesn't work well in accounting. You would have to add a new

[slurm-dev] SLURM Partitions

2014-02-10 Thread Paul Edmon
get any cross talk. Can this be done? It would be incredibly helpful for our environment. -Paul Edmon-

[slurm-dev] Re: SLURM Partitions

2014-02-10 Thread Paul Edmon
thousands of jobs in the queue. I think we would take the hit for having to spin through all the partitions in order to make sure every partition is treated properly. -Paul Edmon- On 02/10/2014 11:12 AM, Alejandro Lucero Palau wrote: Hi Paul, What's the max cycle latency for main scheduling

[slurm-dev] Number of Jobs per User per Partition

2014-02-10 Thread Paul Edmon
Is there an option for limiting the number of jobs a user can have running on a given partition? We have an interactive queue that I want to limit to 5 jobs per user. -Paul Edmon-

[slurm-dev] Re: SLURM Partitions

2014-02-10 Thread Paul Edmon
Thanks. We will check it out. -Paul Edmon- On 02/10/2014 02:19 PM, je...@schedmd.com wrote: Hi Paul, This should achieve the results that you are looking for using a new configuration parameter. The attached patch, including documentation changes, is built against Slurm version 2.6. You

[slurm-dev] Re: Minimum Runtime

2014-01-27 Thread Paul Edmon
We are on the latest release so that shouldn't be the issue. -Paul Edmon- On 01/27/2014 07:30 AM, Moe Jette wrote: There were changes in Slurm version 2.6 with respect to lock handling which may effect this. If you are using an earlier version of slurm, that would be a reason to upgrade

[slurm-dev] Minimum Runtime

2014-01-26 Thread Paul Edmon
So I've found that if some one submits a ton of jobs that have a very short runtime slurm tends to trash as jobs are launching and exiting pretty much constantly. Is there an easy way to enforce a minimum runtime? -Paul Edmon-

[slurm-dev] Re: Minimum Runtime

2014-01-26 Thread Paul Edmon
stuff. It should be helpful. -Paul Edmon- On 1/26/2014 7:21 PM, Moe Jette wrote: A great deal depends upon your hardware and configuration. Slurm should be able to handle a few hundred jobs per soecond when tuned for high throughput as described here: http://slurm.schedmd.com

[slurm-dev] Memory Usage Fairshare

2014-01-15 Thread Paul Edmon
into the next version of SLURM as it would be handy in our environment. -Paul Edmon-

[slurm-dev] Re: Memory Usage Fairshare

2014-01-15 Thread Paul Edmon
would be good. -Paul Edmon- On 1/15/2014 5:51 PM, Christopher Samuel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Paul, On 16/01/14 03:29, Paul Edmon wrote: Thoughts? If this doesn't exist it may be a good thing to add into the next version of SLURM as it would be handy in our

[slurm-dev] Re: sbatch as a one liner on the command line

2013-12-13 Thread Paul Edmon
For reference we are coming from a LSF environment where our users are used to: bsub -q test_queue my_program Where my_program could be anything from simple bash commands to an actual program to run. -Paul Edmon- On 12/13/2013 04:59 PM, Silva, Luis wrote: sbatch as a one liner

[slurm-dev] Re: Admin reservation on busy nodes

2013-11-12 Thread Paul Edmon
Include the ignore_jobs flag. That will force the reservation. -Paul Edmon- On 11/12/2013 12:11 PM, Jacqueline Scoggins wrote: Admin reservation on busy nodes Running slurm 2.5.7 and tried to reserve the nodes of the cluster because of hardware issues that needed to be repaired. Some

[slurm-dev] Reboot_Nodes List

2013-10-31 Thread Paul Edmon
remains to be rebooted somewhere. Is there a way to access it? -Paul Edmon-

[slurm-dev] Re: Reboot_Nodes List

2013-10-31 Thread Paul Edmon
What about those that have yet to be hit with maint? -Paul Edmon- On 10/31/2013 5:37 PM, Moe Jette wrote: sinfo will show the node state as maint. sinfo can filter on that node state too: sinfo -N --state=maint Quoting Paul Edmon ped...@cfa.harvard.edu: So we recently used scontrol

[slurm-dev] Re: Reboot_Nodes List

2013-10-31 Thread Paul Edmon
Thanks. -Paul Edmon- On 10/31/2013 9:53 PM, Morris Jette wrote: Maint means reboot pending. Cleared after reboot. Paul Edmon ped...@cfa.harvard.edu wrote: What about those that have yet to be hit with maint? -Paul Edmon- On 10/31/2013 5:37 PM, Moe Jette wrote: sinfo

[slurm-dev] Re: Job Dependencies on Arrays

2013-10-28 Thread Paul Edmon
Thanks. That should work nicely. -Paul Edmon- On 10/28/2013 8:37 PM, Moe Jette wrote: Paul, I'm working on getting that into the next release v2.6.4, which should be available within days. To wait for all job array elements to complete, you would just specify the primary job id. You can

[slurm-dev] Re: Proper way of finding how many jobs are currently running/pending ?

2013-10-17 Thread Paul Edmon
diagnostic checks like this should respond in a timely manner, even if the data that is contained there is a little out of date. -Paul Edmon- On 10/17/2013 11:57 AM, Moe Jette wrote: There is no faster way to get job counts, but you might find the sdiag command helpful. Quoting Damien

[slurm-dev] Re: Proper way of finding how many jobs are currently running/pending ?

2013-10-17 Thread Paul Edmon
True. I was just contemplating ways to make it more responsive. Multiple copies of the data would do that, I just wasn't sure whether keeping that in sync would be a head ache. -Paul Edmon- On 10/17/2013 1:01 PM, Moe Jette wrote: Sending old data quickly seems very dangerous, especially

[slurm-dev] Re: Proper way of finding how many jobs are currently running/pending ?

2013-10-17 Thread Paul Edmon
is 2 seconds. What is your MessageTimeout set to? Danny On 10/17/13 10:21, Paul Edmon wrote: True. I was just contemplating ways to make it more responsive. Multiple copies of the data would do that, I just wasn't sure whether keeping that in sync would be a head ache. -Paul Edmon- On 10

[slurm-dev] Insane message length

2013-09-29 Thread Paul Edmon
jobs such as squeue and scancel are not working. So I can't tell who sent in this many jobs. -Paul Edmon-

[slurm-dev] Re: Insane message length

2013-09-29 Thread Paul Edmon
That's good to hear. Is there an option to do it per user? I didn't see one in the slurm.conf. I may have missed it. -Paul Edmon- On 9/29/2013 5:42 PM, Moe Jette wrote: Quoting Paul Edmon ped...@cfa.harvard.edu: Yeah, that's why we set the 500,000 job limit. Though I didn't

[slurm-dev] Scheduling with Many Partitions

2013-09-27 Thread Paul Edmon
collapse the ridiculous number of queues we have, so that option is right out. We also don't want to start splitting our environment between multiple seperate masters for each queue as we want centralized accounting and fairshare, plus we want one system to rule them all. -Paul Edmon-

[slurm-dev] Re: Scheduling with Many Partitions

2013-09-27 Thread Paul Edmon
everything. We have since reduced the backfill frequeny which has helped greatly. -Paul Edmon- On 09/27/2013 12:06 PM, Moe Jette wrote: Responses in-line below. You should also be aware that Don Lipari of LLNL presented a tutorial about Slurm's scheduler at the 2012 Slurm User Group meeting

[slurm-dev] Re: Scheduling with Many Partitions

2013-09-27 Thread Paul Edmon
2.6.1 -Paul Edmon- On 09/27/2013 12:15 PM, Danny Auble wrote: Paul you are running 2.6 correct? On 09/27/2013 09:11 AM, Paul Edmon wrote: Thanks Moe. I will try a few of those things and look that presentation. I did try bf_continue at one point, but that cause our entire system

[slurm-dev] Job Arrays

2013-09-09 Thread Paul Edmon
stride and array number you can go too? As you can see we have users that do some crazy things. -Paul Edmon-

[slurm-dev] Re: Job Arrays

2013-09-09 Thread Paul Edmon
Ah, that would explain it: [root@itc011 ~]# scontrol show config | grep MaxArraySize MaxArraySize= 1001 Will have to boost that one. Is there any limit to the stride? I'm guessing no but I just want to check. -Paul Edmon- On 9/9/2013 11:41 AM, Moe Jette wrote: These both

[slurm-dev] Re: Job Arrays

2013-09-09 Thread Paul Edmon
Thanks. -Paul Edmon- On 9/9/2013 11:49 AM, Moe Jette wrote: There is no stride limit. Quoting Paul Edmon ped...@cfa.harvard.edu: Ah, that would explain it: [root@itc011 ~]# scontrol show config | grep MaxArraySize MaxArraySize= 1001 Will have to boost that one

[slurm-dev] Feature Request

2013-08-27 Thread Paul Edmon
if it can resolve all the hosts as well and fail if it can't. Thanks. -Paul Edmon-

[slurm-dev] Re: showq wrapper for Slurm?

2013-08-11 Thread Paul Edmon
That would be great. I can send in my version as well. It has a few additional features such as sorting by partition and ordering the pending queue by job priority. -Paul Edmon- On 8/11/2013 5:49 PM, Danny Auble wrote: Karl, if you send us a copy we may be able to put it in the contribs

[slurm-dev] Re: oversubscribe

2013-08-08 Thread Paul Edmon
insight would be appreciated. -Paul Edmon- On 08/06/2013 09:30 PM, Paul Edmon wrote: So our SLURM 2.5.7 install went down this evening with a massive bout of: [2013-08-06T17:12:20-04:00] sched: Allocate JobId=113950 NodeList=holy2b09103 #CPUs=4 [2013-08-06T17:12:20-04:00] sched: Allocate JobId

[slurm-dev] Massive SLURM failure

2013-08-06 Thread Paul Edmon
. Is there something I am missing? Some way of making it more robust? We've tried the HA fail over but that didn't work when this happened and caused other problems when the install went split brained. -Paul Edmon-

[slurm-dev] Re: Master Failure

2013-07-09 Thread Paul Edmon
Haven't seen a response on this so I thought I would re-ping. Has anyone ever seen the below error? -Paul Edmon- On 07/04/2013 01:36 PM, Paul Edmon wrote: So we are running slurm-2.5.7 on our cluster with a master and a backup. This morning our primary suffered from this error: [2013-07

[slurm-dev] Master Failure

2013-07-04 Thread Paul Edmon
files on the shared filesystem was supposed to prevent this as all the current running jobs are written there. Or did I misunderstand and those files are only update when the master goes down? I would like to understand why we lost jobs so we can prevent it from happening again. -Paul Edmon-

[slurm-dev] Job Groups

2013-06-19 Thread Paul Edmon
list first before putting a nail in it. From my look at the documentation I don't see anyway to do this other than what I stated above. -Paul Edmon-

[slurm-dev] Re: Job Groups

2013-06-19 Thread Paul Edmon
Thanks for the input. Can GrpJobs be modified from the user side? -Paul Edmon- On 06/19/2013 12:15 PM, Ryan Cox wrote: Paul, We were discussing this yesterday due to a user not limiting the amount of jobs hammering our storage. A QOS with a GrpJobs limit sounds like the best approach

[slurm-dev] Re: Job Groups

2013-06-19 Thread Paul Edmon
Okay, thanks. -Paul Edmon- On 06/19/2013 04:32 PM, Ryan Cox wrote: Not that I'm aware of. I don't know of a way to give users control over a QOS like you can do with account coordinators for accounts. Ryan On 06/19/2013 10:55 AM, Paul Edmon wrote: Thanks for the input. Can GrpJobs

[slurm-dev] Re: Slurmctld multithreaded?

2013-06-12 Thread Paul Edmon
. -Paul Edmon- On 06/12/2013 12:30 PM, Alan V. Cowles wrote: Hey Guys, I've seen a few references to the slurmctld as a multithreaded process but it doesn't seem that way. We had a user submit 18000 jobs to our cluster (512 slots) and it shows 512 fully loaded, shows those jobs running, shows

[slurm-dev] Orphaned Jobs

2013-06-05 Thread Paul Edmon
. However, I've done this before and hadn't seen this issue crop up. Is there a way to remove this job from sacct? scancel does not work on it. -Paul Edmon-

[slurm-dev] Re: Orphaned Jobs

2013-06-05 Thread Paul Edmon
Do you mean the node that hosts the slurmdb? Or the node that runs slurmctld? Or are you speaking of the nodes on which that job ran? -Paul Edmon- On 06/05/2013 10:45 AM, Sefa Arslan wrote: if possible, rebooting the workerker node is the fastest solution. On 06/05/2013 05:10 PM, Paul

[slurm-dev] Re: Memory Issues

2013-05-23 Thread Paul Edmon
Hmm, maybe its the ThreadsPerCore? Perhaps its thinks there are half as many core as there really are due to the ThreadsPerCore. Thus if you do the --mem-per-cpu it will only give you half, as it only counts cores not threads*cores? -Paul Edmon- On 05/23/2013 01:31 PM, S. Aravindan wrote

[slurm-dev] Re: JobHeldAdmin

2013-04-29 Thread Paul Edmon
Thanks. -Paul Edmon- On 4/29/2013 10:34 AM, Carles Fenoy wrote: Re: [slurm-dev] Re: JobHeldAdmin Dear Paul, You should be able to release the job with the command: scontrol release JOBID Regards, Carles Fenoy Barcelona Supercomputing Center On Mon, Apr 29, 2013 at 3:49 PM, Carl Schmidtmann

[slurm-dev] showq for SLURM

2013-04-17 Thread Paul Edmon
Available to this Queue 1328 of 1992 Cores Used (66.67%) 247 of 250 Nodes Used (98.80%), 0 Nodes Closed by Admin, 1 Nodes Unavailable Other Queues Which Submit To The Hosts For This Queue: priority PENDING JOBS- No matching jobs found -Paul Edmon-

[slurm-dev] Re: showq for SLURM

2013-04-17 Thread Paul Edmon
Thanks. This should be helpful. -Paul Edmon- On 4/17/2013 5:29 PM, Karl Schulz wrote: Paul, I too have done showq variants for LSF in the past and have ported a version to use the C api in slurm (you do need slurm-devel installed to build though). I sent you a dist tarball offline

[slurm-dev] Re: SLURM and locked pages

2013-03-04 Thread Paul Edmon
Excellent thanks for letting me know. Just for my own information I would need to only restart slurm right to have this change take not restart the full machine? -Paul Edmon- On 3/3/2013 10:48 PM, Andy Riebs wrote: PAul, Assuming that you are using a recent (2.5.x or later) version

[slurm-dev] Re: Adding new nodes to slurm.conf

2013-01-30 Thread Paul Edmon
a NodeAddr list. I would expect it to refuse the new conf and spit out an error message. -Paul Edmon- On 01/30/2013 01:03 PM, David Bigagli wrote: Re: [slurm-dev] Adding new nodes to slurm.conf Do you have the slurmctld log when the master failed? It should be enough to add the hostname

[slurm-dev] Re: Adding new nodes to slurm.conf

2013-01-30 Thread Paul Edmon
So during that period the master would cease managing everything and you wouldn't be able to submit? Are those the only dangers for shutting down the master? We tend to be in an environment where things are in production but also in flux. -Paul Edmon- On 01/30/2013 03:58 PM, Moe Jette

[slurm-dev] RE: LSF command wrappers for Slurm?

2013-01-20 Thread Paul Edmon
We are also interested in this as we are migrating from LSF to SLURM. We will likely cook up some of our own as we migrate, but we haven't gotten there yet. -Paul Edmon- On 1/20/2013 12:32 PM, Fred Liu wrote: Does anyone have LSF command wrappers for Slurm or is interested in working

[slurm-dev] Re: Health Check Program

2013-01-16 Thread Paul Edmon
Cool. Thanks for the info. -Paul Edmon- On 01/16/2013 03:05 AM, Ole Holm Nielsen wrote: On 01/15/2013 11:36 PM, Paul Edmon wrote: So does any one have an example node health check script for SLURM? One that would be run by HealthCheckProgram defined in slurm.conf. I'd rather not reinvent

[slurm-dev] Re: slurm.conf syntax checker

2012-12-05 Thread Paul Edmon
of hostnames that don't actually exist in DNS due to typo, it causes the slurmctld to freak out an die. -Paul Edmon- On 12/05/2012 01:56 PM, Danny Auble wrote: This shouldn't happen. If you have an example of what you are talking about it would be interesting to fix. Typo's/bad info

[slurm-dev] Re: Port Restriction for srun

2012-12-04 Thread Paul Edmon
Thanks for the info. Could this be put on the docket for the next update? In the meantime we can work around it. -Paul Edmon- On 12/4/2012 3:17 PM, je...@schedmd.com wrote: it would be pretty simple to add, but there is no mechanism to do this today -- Sent from my Android phone. Please

<    1   2