[slurm-dev] Jobs at Harvard Research Computing

2017-09-11 Thread Paul Edmon
We have a number of openings here at Harvard FAS RC. If you are interested please check out our employment page for details: https://www.rc.fas.harvard.edu/about/employment/ -Paul Edmon-

[slurm-dev] Re: multiple MPI versions with slurm

2017-07-21 Thread Paul Edmon
I would build MPI using the pmi libraries and slurm with pmi support. Then you can launch any version of MPI using the same srun command and leveraging pmi. -Paul Edmon- On 07/21/2017 10:03 AM, Manuel Rodríguez Pascual wrote: Hi all, I'm trying to provide support to users demanding

[slurm-dev] Re: SLURM automatically use idle nodes in reserved partitions?

2017-07-11 Thread Paul Edmon
idle resources, the lower priority uses it. If the high priority needs the resources it will requeue the jobs from the lower priority partition. -Paul Edmon- On 7/11/2017 4:29 PM, David Perel wrote: Hello -- Say on a cluster researcher X has his own reserved partition, XP, where

[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Paul Edmon
Yeah, we keep around a test cluster environment for that purpose to vet slurm upgrades before we roll them on the production cluster. Thus far no problems. However, paranoia is usually a good thing for cases like this. -Paul Edmon- On 06/26/2017 07:30 AM, Ole Holm Nielsen wrote: On 06

[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Paul Edmon
all the user processes on the cluster? Cheers, Loris Paul Edmon <ped...@cfa.harvard.edu> writes: If you follow the guide on the Slurm website you shouldn't have many problems. We've made it standard practice here to set all partitions to DOWN and suspend all the jobs when we do up

[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Paul Edmon
are bing done. -Paul Edmon- On 06/20/2017 09:37 AM, Nicholas McCollum wrote: I'm about to update 15.08 to the latest SLURM in August and would appreciate any notes you have on the process. I'm especially interested in maintaining the DB as well as associations. I'd also like to keep

[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Paul Edmon
That's a really good point. Pruning your DB can also help with this. -Paul Edmon- On 06/20/2017 09:19 AM, Loris Bennett wrote: Hi Tim, Tim Fora <tf...@riseup.net> writes: Hi, Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to start. Logs show most of the time was

[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02

2017-06-20 Thread Paul Edmon
to about 30 min or so. Beyond that I imagine some more specific DB optimization tricks could be done, but I'm not a DB admin so I won't venture to say. -Paul Edmon- On 06/20/2017 08:42 AM, Tim Fora wrote: Hi, Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to start. Logs show

[slurm-dev] Re: understanding of Purge in Slurmdb.conf

2017-06-08 Thread Paul Edmon
to walk it up to the level you need. You can also have the purge archive to disk which can be handy if you want to maintain historical info. The purge itself runs monthly at midnight on the 1st of the month. -Paul Edmon- On 06/08/2017 11:36 AM, Rohan Gadalkar wrote: Re: [slurm-dev] Re

[slurm-dev] Re: Partition default job time limit

2017-05-09 Thread Paul Edmon
Yes. Use the DefaultTime option. *DefaultTime* Run time limit used for jobs that don't specify a value. If not set then MaxTime will be used. Format is the same as for MaxTime. https://slurm.schedmd.com/slurm.conf.html -Paul Edmon- On 05/09/2017 05:35 AM, Georg Hildebrand wrote

[slurm-dev] Re: Specify node name for a job

2017-05-04 Thread Paul Edmon
*-w*, *--nodelist*= Request a specific list of hosts. The job will contain /all/ of these hosts and possibly additional hosts as needed to satisfy resource requirements. The list may be specified as a comma-separated list of hosts, a range of hosts (host[1-5,7,...] for example), or

[slurm-dev] Re: User accounting

2017-05-04 Thread Paul Edmon
sacct is where you want to look: https://slurm.schedmd.com/sacct.html -Paul Edmon- On 5/4/17 9:09 AM, Mahmood Naderan wrote: User accounting Hi, I read the accounting page https://slurm.schedmd.com/accounting.html however since it is quite large, I didn't get my answer! I want to know

[slurm-dev] Re: GUI Job submission portal

2017-04-20 Thread Paul Edmon
the normal slurm commands to effectively do the same thing. -Paul Edmon- On 04/20/2017 06:23 AM, Parag Khuraswar wrote: Hi All, Does SLURM support below features:- Job Schedulers • Workload cum resource manager with policy-aware, resource-aware and topology-aware scheduling

[slurm-dev] Re: Deleting jobs in Completing state on hung nodes

2017-04-10 Thread Paul Edmon
Sometimes restarting slurm on the node and the master can purge the jobs as well. -Paul Edmon- On 04/10/2017 03:59 PM, Douglas Meyer wrote: Set node to drain if other jobs running. Then down and then resume. Down will kill and clear any jobs. scontrol update nodename= state

[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-16 Thread Paul Edmon
You should look at LLN (least loaded nodes): https://slurm.schedmd.com/slurm.conf.html That should do what you want. -Paul Edmon- On 03/16/2017 12:54 PM, kesim wrote: Fwd: Scheduling jobs according to the CPU load -- Forwarded message -- From: *kesim* <ketiw...@gmail.

[slurm-dev] Re: Federated Cluster scheduling

2017-02-12 Thread Paul Edmon
I'm fairly certain that is coming in the 17.02 release. According to this presentation it was supposed to be in 16.05: https://slurm.schedmd.com/SLUG15/MultiCluster.pdf but from what I recall of the SC BoF full functionality was not going to be available to 17.02 -Paul Edmon- On 2/12/2017

[slurm-dev] Re: distributing fair share allocation

2017-02-07 Thread Paul Edmon
I would probably look at qos's for this: https://slurm.schedmd.com/qos.html https://slurm.schedmd.com/resource_limits.html You can attach them to partitions as well which can be handy. You probably want to use things like MaxJobs, MaxWallDurationPerJob, MaxTRESperJob. -Paul Edmon- On 2/7

[slurm-dev] Re: Priority issue

2017-01-25 Thread Paul Edmon
Are there other jobs in the queue? If so then those higher priority jobs are likely waiting for resources before they run. Also the lower priority jobs probably can't fit in before the higher priority jobs run, thus they won't get backfilled. That's at least my guess. -Paul Edmon- On 01

[slurm-dev] Re: preemptive fair share scheduling

2017-01-06 Thread Paul Edmon
not do nicing for suspends. You could theoretically modify the slurm code to do that as the whole thing is open source and contribute it back to the community, but the feature doesn't exist currently. -Paul Edmon- On 01/06/2017 07:27 AM, Sophana Kok wrote: Re: [slurm-dev] Re: preemptive fair

[slurm-dev] Re: where to find completed job execution command

2017-01-06 Thread Paul Edmon
We do the same thing except with a prolog script which dumps the job info to a flat file so we can look up historical jobs. Sadly the slurmdbd does not store this info, so you have to do it yourself. -Paul Edmon- On 01/06/2017 04:31 AM, Loris Bennett wrote: Sean McGrath <s

[slurm-dev] Re: Proper way to clean up ghost jobs (slurm 15.08.12-1)

2017-01-04 Thread Paul Edmon
I know that a query like this was added to 16.x series. Namely sacctmgr show runawayjobs I don't think it is in 15.x series though. -Paul Edmon- On 01/04/2017 10:00 AM, Chris Rutledge wrote: Hello all, We recently discovered ghost jobs in the slurm_acct_db._job_table that was the result

[slurm-dev] Re: Best practice for adding & maintaining lots of user accounts?

2016-12-09 Thread Paul Edmon
with the proper associations. We key off the users primary group id for their default association. Thus the users account is created when they submit their first job. This should spread out the load, unless of course you have all your users submitting simultaneously. -Paul Edmon- On 12/08/2016

[slurm-dev] Re: Are Backfill parameters supported per partition?

2016-11-29 Thread Paul Edmon
Sadly all backfill parameters are global. You can tell slurm to schedule per partition using the bf_max_job_part flag. That's about it though. -Paul Edmon- On 11/29/2016 01:09 PM, Kumar, Amit wrote: Dear SLURM, Can I specify partition specific backfill parameters? Thank you, Amit

[slurm-dev] Re: Restrict users to see only jobs of their groups

2016-11-01 Thread Paul Edmon
associations of all users they are coordinator of, but can only see themselves when listing users. http://slurm.schedmd.com/slurm.conf.html That should do what you want. -Paul Edmon- On 11/01/2016 07:53 AM, Nathan Harper wrote: Re: [slurm-dev] Restrict users to see only jobs

[slurm-dev] Re: How to set the maximum priority to a Slurm job? (from StackOverflow.com)

2016-09-30 Thread Paul Edmon
is. -Paul Edmon- On 09/30/2016 10:49 AM, Sergio Iserte wrote: Re: [slurm-dev] Re: How to set the maximum priority to a Slurm job? (from StackOverflow.com) Thanks, however, I would like to give the maximum priority. Should I give a large number to the parameter? Thank you. 2016-09-30 16:44 GMT+02

[slurm-dev] Re: How to set the maximum priority to a Slurm job? (from StackOverflow.com)

2016-09-30 Thread Paul Edmon
scontrol update jobid=jobid priority=blah That works at least on a per job basis. -Paul Edmon- On 09/30/2016 10:34 AM, Sergio Iserte wrote: How to set the maximum priority to a Slurm job? (from StackOverflow.com) Hello, this is a copy of my own StackOverflow post: http://stackoverflow.com/q

[slurm-dev] Re: Slurm web dashboards

2016-09-27 Thread Paul Edmon
for presenting data, you can also just take the graphs it generates and embed them elsewhere. -Paul Edmon- On 09/27/2016 08:21 AM, John Hearns wrote: Hello all. What are the thoughts on a Slurm ‘dashboard’. The purpose being to display cluster status on a large screen monitor. I rather

[slurm-dev] Re: Passing MCA parameters via srun

2016-09-26 Thread Paul Edmon
Excellent. T hanks for the info. -Paul Edmon- On 09/26/2016 04:03 PM, Paul Hargrove wrote: Re: [slurm-dev] Passing MCA parameters via srun Paul, If the user always want a specific set of MCA options, then they should be placed in $HOME/.openmpi/mca-params.conf Otherwise, one should use

[slurm-dev] Passing MCA parameters via srun

2016-09-26 Thread Paul Edmon
(specifically OpenMPI 1.10.2) such as to standard MPI options: mpirun -mca btl self,openib My question is how do I get slurm to pass these options through when it invokes MPI. -Paul Edmon-

[slurm-dev] Slurm Diamond Collectors

2016-07-07 Thread Paul Edmon
For those using graphite and diamond, check this out as they may be useful. https://github.com/fasrc/slurm-diamond-collector -Paul Edmon-

[slurm-dev] Per Job Usage

2016-06-14 Thread Paul Edmon
would be a break down of memory and cpu charges. -Paul Edmon-

[slurm-dev] Per Job Usage

2016-06-14 Thread Paul Edmon
for a specific user to see if there was a job that it changed inordinately more for and why. My first logical step was to look at sacct but I didn't see an entry that simply listed RawUsage for the job in terms of TRES. Even better would be a break down of memory and cpu charges. -Paul

[slurm-dev] Re: Preemption order?

2016-06-10 Thread Paul Edmon
the big ones or have been running for a long time. -Paul Edmon- On 06/10/2016 02:36 AM, Steffen Grunewald wrote: Good morning everyone, is there a way to control the order in which jobs get preempted? That is, for a queue with PreemptMode=REQUEUE, it would make sense to preempt jobs first

[slurm-dev] Re: Architectural constraints of Slurm

2016-05-24 Thread Paul Edmon
have a pam module for handling who gets access. That plus cgroups should take care of your security problems. Anyways, suffice it to say slurm can work for your environment as your environment is fairly similar to ours. -Paul Edmon- On 05/24/2016 07:33 AM, Šimon Tóth wrote: Architectural

[slurm-dev] Print Out Slurm Network Hierarchy

2016-05-09 Thread Paul Edmon
much internal slurm communications. Thanks. -Paul Edmon-

[slurm-dev] Re: Slurm Upgrade Instructions needed

2016-05-04 Thread Paul Edmon
and slurmds. -Paul Edmon- On 5/4/2016 10:10 PM, Paul Edmon wrote: Specifically the upgrade instructions are here: http://slurm.schedmd.com/quickstart_admin.html Look at the bottom of the page. If you follow the instructions you should be fine. Though I would recommend pausing the scheduler

[slurm-dev] Re: Slurm Upgrade Instructions needed

2016-05-04 Thread Paul Edmon
do a major upgrade like this there is no rolling back due to the database and job structure changes. -Paul Edmon- On 5/4/2016 6:23 PM, Lachlan Musicman wrote: Re: [slurm-dev] Slurm Upgrade Instructions needed I would backup /etc/slurm. That's about it. Cheers L. -- The most dangerous

[slurm-dev] Zero Bytes Received

2016-05-02 Thread Paul Edmon
be caused by a node or client that is in a bad state, but I can't figure out how to trace it back to which one. Does anyone have any tricks for tracing this sort of error back? I turned on the Protocol Debug Flag but none of the additional debug statements lead to the culprit. -Paul Edmon-

[slurm-dev] Re: Saving job submissions

2016-04-27 Thread Paul Edmon
We use this python script as a slurmctld prolog to save ours. Basically it pulls all the info from the slurm hash files and copies to a separate filesystem. We used to do it via mysql but the database got too large. We then use the get_jobscript to actually query the job scripts. -Paul Edmon

[slurm-dev] TRES in QoS

2016-04-11 Thread Paul Edmon
on this point. -Paul Edmon-

[slurm-dev] Re: User education tools for fair share

2016-03-01 Thread Paul Edmon
to set up a data repository for historic fairshare data. Still it is on our docket to do. sshare is a great way of seeing things too, though it can be a bit much for your average user. -Paul Edmon- On 03/01/2016 04:52 AM, Chris Samuel wrote: Hi Loris, On Tue, 1 Mar 2016 12:29:12 AM Loris

[slurm-dev] Re: job_submit.lua slurm.log_user

2016-01-29 Thread Paul Edmon
Ah, okay. In this case I wanted to print out something to the user and still succeed, as it was just a warning to the user that their script had been modified. Thanks. -Paul Edmon- On 01/29/2016 12:36 PM, je...@schedmd.com wrote: It works for me, but only for the job_submit function

[slurm-dev] Re: Pause all new submissions

2016-01-20 Thread Paul Edmon
You can also set all the partitions to down. All pending jobs will pend, new jobs can be submitted and existing jobs will finish. However no new jobs will be scheduled. -Paul Edmon- On 01/20/2016 03:41 PM, Trey Dockendorf wrote: Re: [slurm-dev] Pause all new submissions Can likely use

[slurm-dev] Re: slurmdbd upgrade

2016-01-12 Thread Paul Edmon
That sounds about right. We have about the same order of magnitude and our last major upgrade took about an hour for the DB to update itself. -Paul Edmon- On 1/12/2016 5:04 PM, Andrew E. Bruno wrote: We're planning an upgrade from 14.11.10 to 15.08.6 and in the past the slurmdbd upgrades

[slurm-dev] Grp Limits for Partition QoS

2016-01-06 Thread Paul Edmon
for the info. -Paul Edmon-

[slurm-dev] MsgAggregation Parameters

2015-12-08 Thread Paul Edmon
but it is about 10 times smaller in terms of number of nodes than our main one. I'm guessing this is a scaling problem. Thoughts? Anyone else using MsgAggregation? -Paul Edmon-

[slurm-dev] Re: MsgAggregation Parameters

2015-12-08 Thread Paul Edmon
I will have to try that out. Thanks for the info. -Paul Edmon- On 12/08/2015 01:54 PM, Danny Auble wrote: Hey Paul, Unless you have a very busy cluster (100s of jobs a second) or are running very large jobs (>2000 nodes) I don't think this will be very useful. But I would exp

[slurm-dev] Re: A floating exclusive partition

2015-11-19 Thread Paul Edmon
at least my thinking, but it's less seamless to the users as they will have to consciously monitor what is going on. -Paul Edmon- On 11/19/2015 10:50 AM, Daniel Letai wrote: Can you elaborate a little? I'm not sure what kind of QoS will help, nor how to implement one that will satisfy

[slurm-dev] Re: Partition QoS

2015-11-12 Thread Paul Edmon
Okay, its working now. I just had to be more patient for slurmctld to pick up the DB change of adding the QOS. Thanks for the help. -Paul Edmon- On 11/11/2015 11:54 AM, Bruce Roberts wrote: I believe the method James describes only allows that one qos to be used by jobs on the partition

[slurm-dev] Re: Partition QoS

2015-11-11 Thread Paul Edmon
Thanks for the insight. I will try it out on my end. -Paul Edmon- On 11/11/2015 3:31 AM, Dennis Mungai wrote: Re: [slurm-dev] Re: Partition QoS Thanks a lot, James. Helped a bunch ;-) *From:*James Oguya [mailto:oguyaja...@gmail.com] *Sent:* Wednesday, November 11, 2015 11:21 AM *To:* slurm

[slurm-dev] Partition QoS

2015-11-10 Thread Paul Edmon
=5 MaxSubmitJobsPerUser=5 MaxCPUsPerUser=128 Thanks for the info. -Paul Edmon-

[slurm-dev] Re: Partition QoS

2015-11-10 Thread Paul Edmon
I did that but it didn't pick it up. I must need to reconfigure again after I made the qos. I will have to try it again. Let you know how it goes. -Paul Edmon- On 11/10/2015 5:40 PM, Douglas Jacobsen wrote: Re: [slurm-dev] Partition QoS Hi Paul, I did this by creating the qos, e.g

[slurm-dev] Re: Interpreting

2015-10-27 Thread Paul Edmon
that are in a problematic state will have their Reason field filled with something. -Paul Edmon- On 10/27/2015 05:03 AM, Всеволод Никоноров wrote: Hello, if a node is in a MIXED state, is it possible that there are "good" and "bad" states mixed? I mean, if a node is MIXED when it

[slurm-dev] Re: login node configuration?

2015-10-26 Thread Paul Edmon
For clarity, they should not need to talk to the compute nodes unless you intend to do interactive work. You should only need to talk to the master to submit jobs. -Paul Edmon- On 10/26/2015 9:45 PM, Paul Edmon wrote: What we did was that we just opened up port 6817 between the two

[slurm-dev] Re: login node configuration?

2015-10-26 Thread Paul Edmon
service to submit as all you need is the ability of the login node to talk to the master. -Paul Edmon- On 10/26/2015 8:02 PM, Liam Forbes wrote: Hello, I’m in the process of setting up SLURM 15.08.X for the first time. I’ve got a head node and ten compute nodes working fine for serial and parallel

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-12 Thread Paul Edmon
issue I have ever seen where this becomes a problem is in fringe cases during major version upgrades, even then it is rare. -Paul Edmon- On 10/12/2015 3:57 AM, Robbert Eggermont wrote: Hello, Some modifications to the slurm.conf require me to restart the slurmd daemons on all nodes

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-12 Thread Paul Edmon
commit hook uses scontrol to run a test on the conf before pushing. This typically catches most errors. Not all though. -Paul Edmon- On 10/12/2015 12:41 PM, Antony Cleave wrote: While this is true be very, very careful when restarting the slurmd on the controller node. it's quite easy to miss

[slurm-dev] Re: the nodes in state down*

2015-10-08 Thread Paul Edmon
Typically that means that the master is having problems communicating with the nodes. I would check your networking, especially your ACLs. -Paul Edmon- On 10/08/2015 10:15 AM, Fany Pagés Díaz wrote: I have a cluster with 3 nodes, and yesterday isincorrectly turned off by electrical

[slurm-dev] Re: the nodes in state down*

2015-10-08 Thread Paul Edmon
The only other thing I can think of it check that the node daemons are up and okay. -Paul Edmon- On 10/08/2015 11:04 AM, Fany Pagés Díaz wrote: My networks it looks fine, and the communication with the nodes too, the problems is with slurm, the slumd demon is failed, I never made any

[slurm-dev] RE: slurm job priorities depending on the number of each user's jobs

2015-10-06 Thread Paul Edmon
effect to what you are asking for. That said they aren't exactly analogous and I can see situations where one would want to do this sort of thing so that no one person can monopolize the queue. -Paul Edmon- On 10/6/2015 6:30 PM, Kumar, Amit wrote: Just wanted to jump the wagon, this feature

[slurm-dev] cgroups, julia, and threads

2015-09-24 Thread Paul Edmon
and --cpu_bind=none although none have these have prevented the Fortran programs from clumping on at least a few cores. I would greatly appreciate your help in trying to figure out how to prevent this behavior under the newly changed core permissions." Thanks for the help. -Paul Edmon-

[slurm-dev] Slurmctld Thread Count

2015-07-09 Thread Paul Edmon
of the scheduler by bumping this up if we can to 1024, or at least test to see what happens when we do. Thanks. -Paul Edmon-

[slurm-dev] Re: Slurm integration scripts with Matlab

2015-06-16 Thread Paul Edmon
So the new scripts are in the R2015b prerelease http://www.mathworks.com/downloads/web_downloads The ones we have are customized for our site. Mathworks recommends contacting them for assistance with customizing them for your own. -Paul Edmon- On 06/10/2015 09:53 AM, Sean McGrath wrote

[slurm-dev] Re: Slurm integration scripts with Matlab

2015-06-10 Thread Paul Edmon
Huh. Let me ask around here and see if we can share what they gave us with the community. -Paul Edmon- On 6/10/2015 9:04 AM, Hadrian Djohari wrote: Re: [slurm-dev] Re: Slurm integration scripts with Matlab Hi Paul and others, We have contacted Mathworks and they only came back with the 2011

[slurm-dev] Re: Slurm integration scripts with Matlab

2015-06-09 Thread Paul Edmon
We've been working with Mathworks to get this sorted so that DCS and Matlab can work with Slurm. I think if you contact them you should be able to get the scripts they gave us. -Paul Edmon- On 6/9/2015 5:24 PM, Hadrian Djohari wrote: Slurm integration scripts with Matlab Hi, We

[slurm-dev] Re: Slurm versions 14.11.6 is now available

2015-04-24 Thread Paul Edmon
happened to us a couple of times when trying to debug the scheduler. So in reality the optimization did nothing in the first place. So any concerns about speed should be alleviated. -Paul Edmon- On 04/24/2015 03:31 PM, Chris Read wrote: Re: [slurm-dev] Re: Slurm versions 14.11.6 is now available

[slurm-dev] Re: Slurm versions 14.11.6 is now available

2015-04-24 Thread Paul Edmon
Correction 50,000 jobs per day. That 1.5 million is per month. My bad. Still no degradation in performance seen in our environment. -Paul Edmon- On 04/24/2015 03:31 PM, Chris Read wrote: Re: [slurm-dev] Re: Slurm versions 14.11.6 is now available On Fri, Apr 24, 2015 at 4:41 AM, Janne

[slurm-dev] Upgrade Rollbacks

2015-04-02 Thread Paul Edmon
I've been curious about this for a bit. What is the procedure for rolling back a minor and major release of slurm in case something goes wrong? -Paul Edmon-

[slurm-dev] Re: Problems running job

2015-03-31 Thread Paul Edmon
Do you have all the ports open between all the compute nodes as well? Since slurm builds a tree to communicate all the nodes need to talk to every other node on those ports and do so with out a huge amount of latency. You might want to try to up your timeouts. -Paul Edmon- On 03/31/2015

[slurm-dev] --reboot

2015-03-30 Thread Paul Edmon
to unexpected reboot. Is there a way to suppress this when the node is rebooted by this flag? Obviously the reboot wasn't unexpected as slurm was aware of it due to the flag. -Paul Edmon-

[slurm-dev] Re: slurmctld thread number blowups leading to deadlock in 14.11.4

2015-03-27 Thread Paul Edmon
at defer or max_rpc_cnt. -Paul Edmon- On 03/27/2015 07:32 AM, Mehdi Denou wrote: Hi, Using gdb you can retrieve which thread own the locks on the slurmctld internal structures (and block all the others). Then it will be easier to understand what is happening. Le 27/03/2015 12:24, Stuart Rankin

[slurm-dev] Re: slurm on NFS for a cluster?

2015-03-26 Thread Paul Edmon
Yeah, we use puppet and yum to manage our stack. Works pretty well and scales nicely. -Paul Edmon- On 03/26/2015 11:46 AM, Jason Bacon wrote: +1 for using package managers in general. On our CentOS clusters, I do the munge and slurm installs using pkgsrc (+ pkgsrc-wip). http

[slurm-dev] Re: slurm on NFS for a cluster - Part II

2015-03-25 Thread Paul Edmon
All of them should be owned by munge. Further more for security's sake I would make them all only accessible to munge, at least the etc one. -Paul Edmon- On 03/25/2015 10:29 AM, Jeff Layton wrote: I assume the same is true for /var/log/munge and /var/run/munge? How about /etc/munge? Thanks

[slurm-dev] Re: slurm on NFS for a cluster - Part II

2015-03-25 Thread Paul Edmon
Yea, that folder and the files inside needs to be owned by munge. -Paul Edmon- On 03/25/2015 09:54 AM, Jeff Layton wrote: Good morning, Thanks for all of the advice in regard to slurm on NFS. I've started on my slurm quest by installing munge but I'm having some trouble. I'm not sure

[slurm-dev] Re: successful systemd service start on RHEL7?

2015-03-24 Thread Paul Edmon
I have tried building slurm 14.11.4 on CentOS7 but it never quite worked right. I'm not sure if it has been vetted for RHEL7 yet. I didn't dig too deeply though when I did build it as I just figured it wasn't ready for RHEL7. -Paul Edmon- On 03/24/2015 10:32 AM, Fred Liu wrote: Hi

[slurm-dev] Re: slurm on NFS for a cluster?

2015-03-24 Thread Paul Edmon
then control the version via RPM installs. -Paul Edmon- On 3/24/2015 4:22 PM, Jason Bacon wrote: I ran one of our CentOS clusters this way for about a year and found it to be more trouble than it was worth. I recently reconfigured it to run all system services from local disks so that nodes

[slurm-dev] Re: slurm on NFS for a cluster?

2015-03-24 Thread Paul Edmon
Yup, that's exactly what we do. We make sure to export it read only and make sure that it is synced and hard mounted. Not much else to it. -Paul Edmon- On 03/24/2015 03:43 PM, Jeff Layton wrote: Good afternoon, I apologies for the newb question but I'm setting up slurm for the first

[slurm-dev] Re: slurm on NFS for a cluster?

2015-03-24 Thread Paul Edmon
Interesting. Yeah we use v3 here. Hadn't tried out v4, and good thing we didn't then. -Paul Edmon- On 03/24/2015 04:05 PM, Uwe Sauter wrote: And if you are planning on using cgroups, don't use NFSv4. There are problems that cause the NFS client process to freeze (and with that freeze

[slurm-dev] Re: SlurmDBD Archiving

2015-03-10 Thread Paul Edmon
Oh and for the record we are running 14.11.4 -Paul Edmon- On 03/10/2015 09:26 AM, Paul Edmon wrote: So when I tried to do an archive dump I got the following error. What does this mean? [root@holy-slurm01 slurm]# sacctmgr -i archive dump sacctmgr: error: slurmdbd: Getting response

[slurm-dev] Re: SlurmDBD Archiving

2015-03-10 Thread Paul Edmon
Is it safe to try again? -Paul Edmon- On 03/06/2015 03:07 PM, Paul Edmon wrote: Ah, okay, that was the command I was looking for. I wasn't sure how to force it. Thanks. -Paul Edmon- On 03/06/2015 01:43 PM, Danny Auble wrote: It looks like I might stand corrected though. It looks like you

[slurm-dev] Re: SlurmDBD Archiving

2015-03-10 Thread Paul Edmon
as a feature in a future release? -Paul Edmon- On 03/10/2015 11:18 AM, Danny Auble wrote: The fatal you received means your query lasted more than 15 minutes, mysql deemed it hung and aborted. You can increase the timeout for innodb_lock_wait_timeout in your my.cnf and try again

[slurm-dev] Re: SlurmDBD Archiving

2015-03-06 Thread Paul Edmon
Okay, that's what I suspected. We set it to 6 months. So I guess then the purge will happen on April 1st. -Paul Edmon- On 03/06/2015 12:33 PM, Danny Auble wrote: Paul, do you have Purge* set up in the slurmdbd.conf? Archiving takes place during the Purge process. If no Purge values

[slurm-dev] Re: SlurmDBD Archiving

2015-03-06 Thread Paul Edmon
Ah, okay, that was the command I was looking for. I wasn't sure how to force it. Thanks. -Paul Edmon- On 03/06/2015 01:43 PM, Danny Auble wrote: It looks like I might stand corrected though. It looks like you will have to wait for the month to go by before the purge starts

[slurm-dev] Requeue Exit

2015-03-03 Thread Paul Edmon
into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul

[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon
Basically the node cuts out due to hardware issues and the jobs is requeued. I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values

[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon
We are definitely using the default for that one. So it should be requeueing just fine. -Paul Edmon- On 03/03/2015 01:05 PM, Lipari, Don wrote: It looks like the governing config parameter would be: JobRequeue This option controls what to do by default after a node failure

[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon
In this case the Node was in a funny state where it couldn't resolve user id's. So right after the job tried to launch it failed and requeued. We just let the scheduler do what it will when it lists Node_fail. -Paul Edmon- On 03/03/2015 01:20 PM, David Bigagli wrote: How do you set

[slurm-dev] Re: Requeue Exit

2015-03-03 Thread Paul Edmon
Ah, good to know. I do prefer that behavior, just didn't expect it. Thanks. -Paul Edmon- On 03/03/2015 02:00 PM, David Bigagli wrote: Ah ok, the job failed to launch in this case Slurm requeue the job in held state, the previous behaviour was to terminate the job. The reason

[slurm-dev] Re: slurm.conf consistent across all nodes

2015-02-02 Thread Paul Edmon
\ ThreadsPerCore=1 Feature=intel Gres=gpu:2 -Paul Edmon- On 2/2/2015 1:09 PM, Bruce Roberts wrote: Yes. All nodes and their resources need to be defined in the slurm.conf on each node, not a different .conf on each node. On 02/02/2015 10:04 AM, Slurm User wrote: slurm.conf consistent

[slurm-dev] Re: slurm.conf consistent across all nodes

2015-02-02 Thread Paul Edmon
Yeah, that's good to get started for a conf, but then following the man page is the next step. -Paul Edmon- On 2/2/2015 1:29 PM, Slurm User wrote: Re: [slurm-dev] Re: slurm.conf consistent across all nodes Ian, Paul Thanks for your replies, that makes sense!!! I was using

[slurm-dev] Re: Partition for unused resources until needed by any other partition

2014-10-20 Thread Paul Edmon
\ AllowGroups=important_people \ Nodes=blah # JOB PREEMPTION PreemptType=preempt/partition_prio PreemptMode=REQUEUE Since serial_requeue is the lowest priority it gets scheduled last and if any jobs come in from the higher priority queue it requeues the lower priority jobs. -Paul Edmon

[slurm-dev] Re: Partition for unused resources until needed by any other partition

2014-10-20 Thread Paul Edmon
These parameters work well for a cluster of 50,000 cores, 57 queues, and about 40,000 jobs per day. We are running 14.03.8 -Paul Edmon- On 10/20/2014 02:19 PM, Mikael Johansson wrote: Hello, Yeah, I looked at that, and have now four partitions defined like this: PartitionName=short

[slurm-dev] 14.03 FlexLM

2014-07-02 Thread Paul Edmon
If memory serves I thought that 14.03 was supposed to support hooking into FlexLM licensing. However, I can't find any documentation on that. Was that pushed off to a future release? -Paul Edmon-

[slurm-dev] Job Array sacct feature request

2014-06-05 Thread Paul Edmon
I don't know if this has been done in the newer versions of slurm but it would be good to have sacct be able to list both the JobID and the index of the Job Array if it is a job array. Thanks. -Paul Edmon-

[slurm-dev] Re: QoS Feature Requests

2014-05-21 Thread Paul Edmon
into QoS? -Paul Edmon- On 5/21/2014 6:52 PM, je...@schedmd.com wrote: Quoting Paul Edmon ped...@cfa.harvard.edu: We have just started using QoS here and I was curious about a few features which would make our lives easier. 1. Spillover/overflow: Essentially if you use up one QoS you would

[slurm-dev] Re: Removing Job from Slurm Database

2014-04-22 Thread Paul Edmon
Well more like the naive ones namely: sacctmgr delete job JobID How do you set the endtime? Do you do that via scontrol? -Paul Edmon- On 04/21/2014 10:14 PM, Danny Auble wrote: What are the obvious ones? I would expect setting the end time to the start time and state to 4 (I think

[slurm-dev] Re: Removing Job from Slurm Database

2014-04-22 Thread Paul Edmon
Thanks. Sorry forgot about that thread. I'm wagering that the jobs got orphaned due to timing out. Essentially they actually launched but the didn't successfully update the database because it was busy. -Paul Edmon- On 04/22/2014 12:15 PM, Danny Auble wrote: Paul I think this was covered

[slurm-dev] srun and node unknown state

2014-04-21 Thread Paul Edmon
sbatch die with an error rather than have srun just hang up? Thanks for any insight. -Paul Edmon-

[slurm-dev] Removing Job from Slurm Database

2014-04-21 Thread Paul Edmon
Is there a way to delete a JobID and it's relevant data from the slurm database? I have a user that I want to remove but there is a job which slurm thinks is not complete that is preventing me. I want slurm to just remove that job data as it shouldn't impact anything. -Paul Edmon-

[slurm-dev] Re: srun and node unknown state

2014-04-21 Thread Paul Edmon
is looping over mpirun's like this: do i=1,1000 mpirun -np 64 ./executable enddo Each run lasts about 5 minutes. If one of the mpirun's fails to launch the entire thing hangs. It would be better if srun kept trying instead of just failing. -Paul Edmon- On 4/16/2014 11:16 PM, Paul Edmon

  1   2   >