[slurm-dev] Jobs at Harvard Research Computing
We have a number of openings here at Harvard FAS RC. If you are interested please check out our employment page for details: https://www.rc.fas.harvard.edu/about/employment/ -Paul Edmon-
[slurm-dev] Re: multiple MPI versions with slurm
I would build MPI using the pmi libraries and slurm with pmi support. Then you can launch any version of MPI using the same srun command and leveraging pmi. -Paul Edmon- On 07/21/2017 10:03 AM, Manuel Rodríguez Pascual wrote: Hi all, I'm trying to provide support to users demanding different versions of MPI, namely mvapich2 and OpenMPI. In particular, I'd like to support "srun -n X ./my_parallel_app", maybe with "--with-mpi=X" flag. Has anybody got any suggestion on how to do that? I'm am kind of new to cluster administration, so probably there is some obvious solution that I am not aware of. Cheers, Manuel
[slurm-dev] Re: SLURM automatically use idle nodes in reserved partitions?
You can use the REQUEUE partition functionality in slurm to accomplish this. We do that here. Basically we have a high priority partition that is what the hardware owners use and then a lower priority partition that backfills it with the requeue option. If the high priority partition has idle resources, the lower priority uses it. If the high priority needs the resources it will requeue the jobs from the lower priority partition. -Paul Edmon- On 7/11/2017 4:29 PM, David Perel wrote: Hello -- Say on a cluster researcher X has his own reserved partition, XP, where normally only X can run jobs. Can SLURM be configured so that if some resource metric(s) (e.g., load average, memory usage) on XP nodes is below a given level for a given time period (such nodes are labelled "XP-L"), then: - if researcher Y is authorized to use XP, Y's jobs can automatically use the XP-L nodes; and - as soon as X submits jobs to XP for which XP-L are needed, Y's jobs on the XP-L nodes are moved off of those nodes to wherever they can run, or queued Thanks. ___ David Perel IST Academic and Research Computing Systems (ARCS) New Jersey Institute of Technology dav...@njit.edu
[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database
Yeah, we keep around a test cluster environment for that purpose to vet slurm upgrades before we roll them on the production cluster. Thus far no problems. However, paranoia is usually a good thing for cases like this. -Paul Edmon- On 06/26/2017 07:30 AM, Ole Holm Nielsen wrote: On 06/26/2017 01:24 PM, Loris Bennett wrote: We have also upgraded in place continuously from 2.2.4 to currently 16.05.10 without any problems. As I mentioned previously, it can be handy to make a copy of the statesave directory, once the daemons have been stopped. However, if you want to know how long the upgrade might take, then yours is a good approach. What is your use case here? Do you want to inform the users about the length of the outage with regard to job submission? I want to be 99.9% sure that upgrading (my first one) will actually work. I also want to know how roughly long the slurmdbd will be down so that the cluster doesn't kill all jobs due to timeouts. Better to be safe than sorry. I don't expect to inform the users, since the operation is expected to run smoothly without troubles for user jobs. Thanks, Ole Re: [slurm-dev] Dry run upgrade procedure for the slurmdbd database We did it in place, worked as noted on the tin. It was less painful than I expected. TBH, your procedures are admirable, but you shouldn't worry - it's a relatively smooth process. cheers L. -- "Mission Statement: To provide hope and inspiration for collective action, to build collective power, to achieve collective transformation, rooted in grief and rage but pointed towards vision and dreams." - Patrisse Cullors, Black Lives Matter founder On 26 June 2017 at 20:04, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> wrote: We're planning to upgrade Slurm 16.05 to 17.02 soon. The most critical step seems to me to be the upgrade of the slurmdbd database, which may also take tens of minutes. I thought it's a good idea to test the slurmdbd database upgrade locally on a drained compute node in order to verify both correctness and the time required. I've developed the dry run upgrade procedure documented in the Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm Question 1: Would people who have real-world Slurm upgrade experience kindly offer comments on this procedure? My testing was actually successful, and the database conversion took less than 5 minutes in our case. A crucial step is starting the slurmdbd manually after the upgrade. But how can we be sure that the database conversion has been 100% completed? Question 2: Can anyone confirm that the output "slurmdbd: debug2: Everything rolled up" indeed signifies that conversion is complete? Thanks, Ole
[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02
We actually use: scontrol suspend I will note though that with the suspend behavior that slurm has, it is best to do this while the partitions are set to DOWN so new jobs don't try to schedule. Else you will end up with oversubscribed nodes. Else SIGSTOP is a good route to go. -Paul Edmon- On 06/20/2017 10:32 AM, Loris Bennett wrote: Hi Nick, We do our upgrades while full production is up and running. We just stop the Slurm daemons, dump the database and copy the statesave directory just in case. We then do the update, and finally restart the Slurm daemons. We only lost jobs once during an upgrade back around 2.2.6 or so, but that was due a rather brittle configuration provided by our vendor (the statesave path contained the Slurm version), rather than Slurm itself and was before we had acquired any Slurm expertise ourselves. Paul: How do you pause the jobs? SIGSTOP all the user processes on the cluster? Cheers, Loris Paul Edmon <ped...@cfa.harvard.edu> writes: If you follow the guide on the Slurm website you shouldn't have many problems. We've made it standard practice here to set all partitions to DOWN and suspend all the jobs when we do upgrades. This has led to far greater stability. So we haven't lost any jobs in an upgrade. The only weirdness we have seen is if jobs exit while the DB upgrade is going. Sometimes it can leave residual jobs in the DB that were properly closed out. This is why we pause all the jobs as it makes it such that we don't end up with jobs exiting before the DB is back. In 16.05+ you have the: sacctmgr show runawayjobs Feature which can clean up all those orphan jobs. So its not as much a concern anymore. Beyond that we follow the guide at the bottom of this page: https://slurm.schedmd.com/quickstart_admin.html I haven't tried going two major versions at once though. The docs indicate that it should work fine. We generally try to keep pace with current stable. Given that you only have 100,000 jobs your upgrade should probably go fairly quick. I could imagine around 10-15 minutes. Our DB has several million jobs and it takes about 30 min to an hour depending on what operations are bing done. -Paul Edmon- On 06/20/2017 09:37 AM, Nicholas McCollum wrote: I'm about to update 15.08 to the latest SLURM in August and would appreciate any notes you have on the process. I'm especially interested in maintaining the DB as well as associations. I'd also like to keep the pending job list if possible. I've only got around 100,000 jobs in the DB so far, since January. Thanks Nick McCollum Alabama Supercomputer Authority On Jun 20, 2017 8:07 AM, Paul Edmon <ped...@cfa.harvard.edu> wrote: Yeah, that sounds about right. Changes between major versions can take quite a bit of time. In the past I've seen upgrades take 2-3 hours for the DB. As for ways to speed it up. Putting the DB on newer hardware if you haven't already helps quite a bit (depends on architecture as to how much gain you will get, we went from AMD Abu Dhabi to Intel Broadwell and saw a factor of 3-4 speed improvement). Upgrading to the latest version of MariaDB if you are on an old version of MySQL can get you about 30-40%. Doing all of these whittled our DB upgrade times for major upgrades to about 30 min or so. Beyond that I imagine some more specific DB optimization tricks could be done, but I'm not a DB admin so I won't venture to say. -Paul Edmon- On 06/20/2017 08:42 AM, Tim Fora wrote: > Hi, > > Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to > start. Logs show most of the time was spent on this step and other table > changes: > > adding column admin_comment after account in table > > Does this sound right? Any ideas to help things speed up. > > Thanks, > Tim > >
[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02
If you follow the guide on the Slurm website you shouldn't have many problems. We've made it standard practice here to set all partitions to DOWN and suspend all the jobs when we do upgrades. This has led to far greater stability. So we haven't lost any jobs in an upgrade. The only weirdness we have seen is if jobs exit while the DB upgrade is going. Sometimes it can leave residual jobs in the DB that were properly closed out. This is why we pause all the jobs as it makes it such that we don't end up with jobs exiting before the DB is back. In 16.05+ you have the: sacctmgr show runawayjobs Feature which can clean up all those orphan jobs. So its not as much a concern anymore. Beyond that we follow the guide at the bottom of this page: https://slurm.schedmd.com/quickstart_admin.html I haven't tried going two major versions at once though. The docs indicate that it should work fine. We generally try to keep pace with current stable. Given that you only have 100,000 jobs your upgrade should probably go fairly quick. I could imagine around 10-15 minutes. Our DB has several million jobs and it takes about 30 min to an hour depending on what operations are bing done. -Paul Edmon- On 06/20/2017 09:37 AM, Nicholas McCollum wrote: I'm about to update 15.08 to the latest SLURM in August and would appreciate any notes you have on the process. I'm especially interested in maintaining the DB as well as associations. I'd also like to keep the pending job list if possible. I've only got around 100,000 jobs in the DB so far, since January. Thanks Nick McCollum Alabama Supercomputer Authority On Jun 20, 2017 8:07 AM, Paul Edmon <ped...@cfa.harvard.edu> wrote: Yeah, that sounds about right. Changes between major versions can take quite a bit of time. In the past I've seen upgrades take 2-3 hours for the DB. As for ways to speed it up. Putting the DB on newer hardware if you haven't already helps quite a bit (depends on architecture as to how much gain you will get, we went from AMD Abu Dhabi to Intel Broadwell and saw a factor of 3-4 speed improvement). Upgrading to the latest version of MariaDB if you are on an old version of MySQL can get you about 30-40%. Doing all of these whittled our DB upgrade times for major upgrades to about 30 min or so. Beyond that I imagine some more specific DB optimization tricks could be done, but I'm not a DB admin so I won't venture to say. -Paul Edmon- On 06/20/2017 08:42 AM, Tim Fora wrote: > Hi, > > Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to > start. Logs show most of the time was spent on this step and other table > changes: > > adding column admin_comment after account in table > > Does this sound right? Any ideas to help things speed up. > > Thanks, > Tim > >
[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02
That's a really good point. Pruning your DB can also help with this. -Paul Edmon- On 06/20/2017 09:19 AM, Loris Bennett wrote: Hi Tim, Tim Fora <tf...@riseup.net> writes: Hi, Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to start. Logs show most of the time was spent on this step and other table changes: adding column admin_comment after account in table Does this sound right? Any ideas to help things speed up. It probably depends a great deal on how many entries you have in your database and what sort of hardware you have. We are up to around 1.6 million jobs and have never purged anything. I seem to remember the last update between major releases taking long enough to allow me to get slightly uneasy, but not long enough for me to really worry, so I guess it was probably around 10-15 minutes. Our CPUs are around 6 years old, but the DB is on an SSD. HTH Loris
[slurm-dev] Re: Long delay starting slurmdbd after upgrade to 17.02
Yeah, that sounds about right. Changes between major versions can take quite a bit of time. In the past I've seen upgrades take 2-3 hours for the DB. As for ways to speed it up. Putting the DB on newer hardware if you haven't already helps quite a bit (depends on architecture as to how much gain you will get, we went from AMD Abu Dhabi to Intel Broadwell and saw a factor of 3-4 speed improvement). Upgrading to the latest version of MariaDB if you are on an old version of MySQL can get you about 30-40%. Doing all of these whittled our DB upgrade times for major upgrades to about 30 min or so. Beyond that I imagine some more specific DB optimization tricks could be done, but I'm not a DB admin so I won't venture to say. -Paul Edmon- On 06/20/2017 08:42 AM, Tim Fora wrote: Hi, Upgraded from 15.08 to 17.02. It took about one hour for slurmdbd to start. Logs show most of the time was spent on this step and other table changes: adding column admin_comment after account in table Does this sound right? Any ideas to help things speed up. Thanks, Tim
[slurm-dev] Re: understanding of Purge in Slurmdb.conf
We use purge here with out much of a problem. Though I will caution that if you have a large database that you have not ever purged before (when we initially did it we had 3 years of data accrued and we purged down to 6 months), you should do it in stages. So do maybe a few months at a time to walk it up to the level you need. You can also have the purge archive to disk which can be handy if you want to maintain historical info. The purge itself runs monthly at midnight on the 1st of the month. -Paul Edmon- On 06/08/2017 11:36 AM, Rohan Gadalkar wrote: Re: [slurm-dev] Re: understanding of Purge in Slurmdb.conf Hello Dr. Loris, I will go through the document and will try to work on it. I'll reply you on the same as soon as I work on this. Thanks for the example. Regards, Rohan On Thu, Jun 8, 2017 at 3:17 PM, Loris Bennett <loris.benn...@fu-berlin.de <mailto:loris.benn...@fu-berlin.de>> wrote: Rohan Gadalkar <rohangadal...@gmail.com <mailto:rohangadal...@gmail.com>> writes: > Re: [slurm-dev] Re: understanding of Purge in Slurmdb.conf > > Hello Dr.Lorris, > > I want to understand how the purge works. > > I mentioned the points like PurgeJobAfter etc; which I did not understand. > > How do I use these parameters in slurmdb.conf ?? > > Is there any way that you can make me understand this ?? I'm not sure about that, but if, for example, you write PurgeJobAfter=6 in your slurmdb.conf, all database entries referring to jobs which are older than 6 months will be deleted at the beginning of each month. Disclaimer: I haven't used these settings - I am repeating what it says in the documentation. Cheers, Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de <mailto:loris.benn...@fu-berlin.de>
[slurm-dev] Re: Partition default job time limit
Yes. Use the DefaultTime option. *DefaultTime* Run time limit used for jobs that don't specify a value. If not set then MaxTime will be used. Format is the same as for MaxTime. https://slurm.schedmd.com/slurm.conf.html -Paul Edmon- On 05/09/2017 05:35 AM, Georg Hildebrand wrote: Partition default job time limit Hi @here, Is it possible to have a default job time limit for an slurm partition that is lower than the MaxTime? Viele Grüße / kind regards Georg
[slurm-dev] Re: Specify node name for a job
*-w*, *--nodelist*= Request a specific list of hosts. The job will contain /all/ of these hosts and possibly additional hosts as needed to satisfy resource requirements. The list may be specified as a comma-separated list of hosts, a range of hosts (host[1-5,7,...] for example), or a filename. The host list will be assumed to be a filename if it contains a "/" character. If you specify a minimum node or processor count larger than can be satisfied by the supplied host list, additional resources will be allocated on other nodes as needed. Duplicate node names in the list will be ignored. The order of the node names in the list is not important; the node names will be sorted by Slurm. On 5/4/2017 12:49 PM, Mahmood Naderan wrote: Specify node name for a job Hi, Sorry for this simple question but I didn't find that in the verbose document pages. I want to specify a node name for a job. If I change #SBATCH --nodes=1 to #SBATCH --nodes=compute-0-1 Then the sbatch command returns an error where the node count is wrong. How can I specify node name in the sbatch script? Regards, Mahmood
[slurm-dev] Re: User accounting
sacct is where you want to look: https://slurm.schedmd.com/sacct.html -Paul Edmon- On 5/4/17 9:09 AM, Mahmood Naderan wrote: User accounting Hi, I read the accounting page https://slurm.schedmd.com/accounting.html however since it is quite large, I didn't get my answer! I want to know the user stats for their jobs. For example, something like this clock time for all jobs including successful and not> cores used> ... Assume a user has submitted 2 jobs with the following specs: job1: 10 minutes, 2.4GB memory, 4 cores, 1GB disk, success job2: 15 minutes, 6GB memory, 2 cores, 2 GB disk, failed (due to his code and not the system error) So the report looks like user, 2, 1, 25 min, 6, 8.4GB, 3GB, ... How can I get that? Regards, Mahmood
[slurm-dev] Re: GUI Job submission portal
There is no webportal that SchedMD provides that I am aware of. That said the API exists such that one could write one if they were so inclined. The rest of the features that you like slurm does support, I'm not sure how much is exposed via the API though. One could write wrappers around the normal slurm commands to effectively do the same thing. -Paul Edmon- On 04/20/2017 06:23 AM, Parag Khuraswar wrote: Hi All, Does SLURM support below features:- Job Schedulers • Workload cum resource manager with policy-aware, resource-aware and topology-aware scheduling • Advance reservation support • Support of job submission through CLI, Web-services and APIs • Heterogeneous and multi- cluster support • Preemptive and backfill scheduling support • Application integration support • Live reconfiguration capability • SLA/Equivalent • GPU Aware scheduling Intuitive web interface to submit and monitor jobs Regards, Parag
[slurm-dev] Re: Deleting jobs in Completing state on hung nodes
Sometimes restarting slurm on the node and the master can purge the jobs as well. -Paul Edmon- On 04/10/2017 03:59 PM, Douglas Meyer wrote: Set node to drain if other jobs running. Then down and then resume. Down will kill and clear any jobs. scontrol update nodename= state=drain reason=job_sux scontrol update nodename= state=down reason=job_sux scontrol update nodename= state=resume If it happens again either reboot or stop and restart slurm. Make sure you verify it has stopped. Doug -Original Message- From: Gene Soudlenkov [mailto:g.soudlen...@auckland.ac.nz] Sent: Monday, April 10, 2017 12:56 PM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] Re: Deleting jobs in Completing state on hung nodes It happens sometimes - in our case epilogue code got stuck. Either check the processes and kill whicehver ones belong to the user or simply reboot the nodes. Cheers, Gene -- New Zealand eScience Infrastructure Centre for eResearch The University of Auckland e: g.soudlen...@auckland.ac.nz p: +64 9 3737599 ext 89834 c: +64 21 840 825 f: +64 9 373 7453 w: www.nesi.org.nz On 11/04/17 07:52, Tus wrote: I have 2 nodes that have hardware issues and died with jobs running on them. I am not able to fix the nodes at the moment but want to delete the jobs that are stuck in completing state from slurm. I have set the nodes to DRAIN and tried scancel which did not work. How do I remove these jobs?
[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load
You should look at LLN (least loaded nodes): https://slurm.schedmd.com/slurm.conf.html That should do what you want. -Paul Edmon- On 03/16/2017 12:54 PM, kesim wrote: Fwd: Scheduling jobs according to the CPU load -- Forwarded message -- From: *kesim* <ketiw...@gmail.com <mailto:ketiw...@gmail.com>> Date: Thu, Mar 16, 2017 at 5:50 PM Subject: Scheduling jobs according to the CPU load To: slurm-dev@schedmd.com <mailto:slurm-dev@schedmd.com> Hi all, I am a new user and I created a small network of 11 nodes 7 CPUs per node out of users desktops. I configured slurm as: SelectType=select/cons_res SelectTypeParameters=CR_CPU When I submit a task with srun -n70 task It will fill 10 nodes with 7 tasks/node. However, I have no clue what is the algorithm of choosing the nodes. Users run programs on the nodes and some nodes are more busy than others. It seems logical that the scheduler should submit the tasks to the less busy nodes but it is not the case. In the sinfo -N -o '%N %O %C' I can see that the jobs are allocated to the node11 with the load 2.06 leaving the node4 which is totally idling. That somehow make no sense to me. node1 0.00 7/0/0/7 node20.26 7/0/0/7 node3 0.54 7/0/0/7 node40.07 0/7/0/7 node5 0.00 7/0/0/7 node60.01 7/0/0/7 node7 0.00 7/0/0/7 node8 0.01 7/0/0/7 node90.06 7/0/0/7 node10 0.11 7/0/0/7 node11 2.06 7/0/0/7 How can I configure slurm to be able to fill the node with minimum load first?
[slurm-dev] Re: Federated Cluster scheduling
I'm fairly certain that is coming in the 17.02 release. According to this presentation it was supposed to be in 16.05: https://slurm.schedmd.com/SLUG15/MultiCluster.pdf but from what I recall of the SC BoF full functionality was not going to be available to 17.02 -Paul Edmon- On 2/12/2017 11:05 AM, Anastasia Angelopoulou wrote: Federated Cluster scheduling Hi, I would like to ask if federated cluster scheduling is available in the currently released Slurm versions. If yes, is there any documentation about it? Has anyone tried it? I would appreciate any information. Thank you. Regards, Anastasia
[slurm-dev] Re: distributing fair share allocation
I would probably look at qos's for this: https://slurm.schedmd.com/qos.html https://slurm.schedmd.com/resource_limits.html You can attach them to partitions as well which can be handy. You probably want to use things like MaxJobs, MaxWallDurationPerJob, MaxTRESperJob. -Paul Edmon- On 2/7/2017 10:11 AM, Hossein Pourreza wrote: Greetings, We have a cluster with SLURM 16.05.4 as its workload manager with fair share policy. From time to time, we need to create accounts for courses where students' usage gets charged against that account. I am wondering what the best approach would be to limit the number of cores and/or walltime of each job asked by each student without creating a dedicated allocation for a course. For example, an instructor may want to limit the walltime of student jobs to 2 hours and not more than 2 cores. She/he may also want to put a cap on the number of SUs that a student can use so a single student cannot burn all the course allocation. Thanks Hossein
[slurm-dev] Re: Priority issue
Are there other jobs in the queue? If so then those higher priority jobs are likely waiting for resources before they run. Also the lower priority jobs probably can't fit in before the higher priority jobs run, thus they won't get backfilled. That's at least my guess. -Paul Edmon- On 01/25/2017 12:14 AM, Roe Zohar wrote: Priority issue Hello, I have 1000 jobs in queue. The queue have 4 servers yet only two servers are with running jobs and all other jobs are pending with reason "priority". I think that only when current running jobs will finish, the other jobs will start running on all other server. Any idea why are they waiting with priority now instead of start running on idle servers? thanks
[slurm-dev] Re: preemptive fair share scheduling
I might email them directly then or hit up their bugzilla page. They noted late last year around SC that they no longer have the personnel necessary to monitor the dev list. They are hiring though, so hopefully they will get sufficient people to watch this again as it was good to have them chiming in occasionally to answer general questions. As to the specific slurm question, I don't think it can do that specifically. Or at least I can't see an obvious solution in slurm to do what you want. That said I haven't put a whole ton of thought into how I might pull this off in slurm. You would probably have to do something crafty to do it with settings and environment in slurm. Regardless it would be nontrivial to do. You likely would have to write some external scripts or perhaps modify the slurm code to pull it off. As to your second question, as it stands the code does not do nicing for suspends. You could theoretically modify the slurm code to do that as the whole thing is open source and contribute it back to the community, but the feature doesn't exist currently. -Paul Edmon- On 01/06/2017 07:27 AM, Sophana Kok wrote: Re: [slurm-dev] Re: preemptive fair share scheduling On Fri, Jan 6, 2017 at 1:15 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk <mailto:ole.h.niel...@fysik.dtu.dk>> wrote: On 01/06/2017 01:07 PM, Sophana Kok wrote: Hi again, is there someone from schedmd to respond? On the contact form, it is written to contact the mailing list for technical questions. If the Slurm community doesn't get you going, you need to buy commercial support, see https://www.schedmd.com/support.php <https://www.schedmd.com/support.php>, if you want SchedMD to support you. /Ole My question is about slurm features. We already use SGE, and are wondering if slurm fit our needs. We can't buy support if we don't know if slum is what we are looking for...
[slurm-dev] Re: where to find completed job execution command
We do the same thing except with a prolog script which dumps the job info to a flat file so we can look up historical jobs. Sadly the slurmdbd does not store this info, so you have to do it yourself. -Paul Edmon- On 01/06/2017 04:31 AM, Loris Bennett wrote: Sean McGrath <smcg...@tchpc.tcd.ie> writes: Hi, On Thu, Jan 05, 2017 at 02:29:11PM -0800, Prasad, Bhanu wrote: Hi, Is there a convenient command like `scontrol show job id` to check more info of jobs that are completed Not to my knowledge. or any command to check the sbatch command run in that particular job How we do this is with the slurmctld epilog script: EpilogSlurmctld=/etc/slurm/slurm.epilogslurmctld Which does the following: /usr/bin/scontrol show job=$SLURM_JOB_ID > $recordsdir/$SLURM_JOBID.record The `scontrol show jobid=` record is saved to the file system for future reference if it is needed. It might be worth using the option '--oneliner' to print out the record in a single line. You could then parse it more easily for, say, then inserting the data into a table in a database. Cheers, Loris
[slurm-dev] Re: Proper way to clean up ghost jobs (slurm 15.08.12-1)
I know that a query like this was added to 16.x series. Namely sacctmgr show runawayjobs I don't think it is in 15.x series though. -Paul Edmon- On 01/04/2017 10:00 AM, Chris Rutledge wrote: Hello all, We recently discovered ghost jobs in the slurm_acct_db._job_table that was the result of process crash back in July that was not corrected soon enough. Should this be corrected simply by updating the state and time_end fields for each of these records or is there something other than raw SQL we should be using in this case? I ran the following query to find potential ghost jobs and validated each with squeue that the id_job is currently invalid, so I'm almost 100% I have a valid list of job_db_inx to update. Query -- select job_db_inx,id_job,time_start,time_end,state,partition,id_user from hatteras_job_table where time_start <= 1480339901 and state = 1; Thanks in advance, Chris Rutledge
[slurm-dev] Re: Best practice for adding & maintaining lots of user accounts?
We manage about 800 accounts in ours, so an order of magnitude smaller. We use the lua plugin at submission time. In it the lau script checks if the user has an account in slurm (the script itself keeps a cache so as not to pound the database with this query). If not it will make one with the proper associations. We key off the users primary group id for their default association. Thus the users account is created when they submit their first job. This should spread out the load, unless of course you have all your users submitting simultaneously. -Paul Edmon- On 12/08/2016 09:05 PM, Tuo Chen Peng wrote: Hello, Is there a guideline / best practice for (1) maintain huge number of user accounts in slurm? (2) adding thousands of user in batch? We are maintaining a slurm cluster with 10k+ user accounts. But recently we need to add thousands more of users and noticed that adding new users could sometimes cause slurmctld to hang or ignore user requests for several minutes (users are currently added one-by-one by using sacctmgr add user) By enabling verbose logging in slurmctld (‘debug5’) I can see slurmctld does 2 things when a new user is added (1) walk the account/user tree (2) recalculate normalized shares for every other user I’m guessing these steps takes time, and with usual scheduling work they hanged slurmctld in our case. This might be ok if we just need to add a handful of user, but we do need to add thousands of user to slurm So a few minutes of hang in slurm for each user add means we can only add 1 user every 10 minutes to avoid blocking user from accessing slurm And therefore will take 1 day to add 120 users, and 8+ days to add 1000 users for example Which works, but seems very inefficient And it doesn’t seem right, because with more users added, the normalized shares calculation will only take more time. So, does anyone know what might be the efficient way to add huge number of users to slurm and maintain them? TuoChen Peng This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
[slurm-dev] Re: Are Backfill parameters supported per partition?
Sadly all backfill parameters are global. You can tell slurm to schedule per partition using the bf_max_job_part flag. That's about it though. -Paul Edmon- On 11/29/2016 01:09 PM, Kumar, Amit wrote: Dear SLURM, Can I specify partition specific backfill parameters? Thank you, Amit
[slurm-dev] Re: Restrict users to see only jobs of their groups
For slurmdbd you can set this: *PrivateData* This controls what type of information is hidden from regular users. By default, all information is visible to all users. User *SlurmUser*, *root*, and users with AdminLevel=Admin can always view all information. Multiple values may be specified with a comma separator. Acceptable values include: *accounts* prevents users from viewing any account definitions unless they are coordinators of them. *jobs* prevents users from viewing job records belonging to other users unless they are coordinators of the association running the job when using sacct. *reservations* restricts getting reservation information to users with operator status and above. *usage* prevents users from viewing usage of any other user. This applys to sreport. *users* prevents users from viewing information of any user other than themselves, this also makes it so users can only see associations they deal with. Coordinators can see associations of all users they are coordinator of, but can only see themselves when listing users http://slurm.schedmd.com/slurmdbd.conf.html For the slurm.conf you can do the same: *PrivateData* This controls what type of information is hidden from regular users. By default, all information is visible to all users. User *SlurmUser* and *root* can always view all information. Multiple values may be specified with a comma separator. Acceptable values include: *accounts* (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from viewing any account definitions unless they are coordinators of them. *cloud* Powered down nodes in the cloud are visible. *jobs* Prevents users from viewing jobs or job steps belonging to other users. (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from viewing job records belonging to other users unless they are coordinators of the association running the job when using sacct. *nodes* Prevents users from viewing node state information. *partitions* Prevents users from viewing partition state information. *reservations* Prevents regular users from viewing reservations which they can not use. *usage* Prevents users from viewing usage of any other user, this applies to sshare. (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from viewing usage of any other user, this applies to sreport. *users* (NON-SlurmDBD ACCOUNTING ONLY) Prevents users from viewing information of any user other than themselves, this also makes it so users can only see associations they deal with. Coordinators can see associations of all users they are coordinator of, but can only see themselves when listing users. http://slurm.schedmd.com/slurm.conf.html That should do what you want. -Paul Edmon- On 11/01/2016 07:53 AM, Nathan Harper wrote: Re: [slurm-dev] Restrict users to see only jobs of their groups Hi, No solution, but a 'me too'. -- *Nathan Harper*// IT Systems Lead *e: * nathan.har...@cfms.org.uk <mailto:nathan.har...@cfms.org.uk> // *t: * 0117 906 1104 // *m: * 07875 510891 // *w: * www.cfms.org.uk <http://www.cfms.org.uk%22> // Linkedin grey icon scaled <http://uk.linkedin.com/pub/nathan-harper/21/696/b81> CFMS Services Ltd// Bristol & Bath Science Park // Dirac Crescent // Emersons Green // Bristol // BS16 7FR CFMS Services Ltd is registered in England and Wales No 05742022 - a subsidiary of CFMS Ltd CFMS Services Ltd registered office // Victoria House // 51 Victoria Street // Bristol // BS1 6AD On 1 November 2016 at 11:49, Taras Shapovalov <taras.shapova...@brightcomputing.com <mailto:taras.shapova...@brightcomputing.com>> wrote: Hey guys, Could you tell me how I can configure slurm that users will only be able to see the jobs submitted by members of their own user group? Similar restriction by partition or by account is also would work for me. I cannot find a solution anywhere. Best regards, Taras
[slurm-dev] Re: How to set the maximum priority to a Slurm job? (from StackOverflow.com)
Yeah, that's usually what I do. Priority is a 32-bit int. So I would just give it something like 9. That should max out most priority, unless of course you are using that last digit in which case you may have to do something like 20. It all really depends on how you set the priority scaling in your slurm.conf. You just need to give the job a higher priority than any other job can possibly have in the queue, which is determined by your slurm.conf settings. You can use sprio to see what the priorities are for jobs in the queue to get an idea of what the max is. -Paul Edmon- On 09/30/2016 10:49 AM, Sergio Iserte wrote: Re: [slurm-dev] Re: How to set the maximum priority to a Slurm job? (from StackOverflow.com) Thanks, however, I would like to give the maximum priority. Should I give a large number to the parameter? Thank you. 2016-09-30 16:44 GMT+02:00 Paul Edmon <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>>: scontrol update jobid=jobid priority=blah That works at least on a per job basis. -Paul Edmon- On 09/30/2016 10:34 AM, Sergio Iserte wrote: Hello, this is a copy of my own StackOverflow post: http://stackoverflow.com/q/39787477/4286662 <http://stackoverflow.com/q/39787477/4286662> Maybe, here, it can be solved faster: I am using /sched/backfill/ policy and the default job priority policy (FIFO) and as administrator I need to give the maximum priority to a given job. I have found that submission options like: |--priority=| or |--nice[=adjustment]| could be useful, but I do not know which values I should assign them in order to provide the job with the highest priority. Another approach could be to set a low priority by default to all the jobs and to the special ones increase it. Any idea of how I could carry it out? Best regards, Sergio. -- Sergio Iserte High Performance Computing & Architectures (HPCA) Department of Computer Science and Engineering (DICC) Universitat Jaume I (UJI)/ / -- Sergio Iserte High Performance Computing & Architectures (HPCA) Department of Computer Science and Engineering (DICC) Universitat Jaume I (UJI)/ /
[slurm-dev] Re: How to set the maximum priority to a Slurm job? (from StackOverflow.com)
scontrol update jobid=jobid priority=blah That works at least on a per job basis. -Paul Edmon- On 09/30/2016 10:34 AM, Sergio Iserte wrote: How to set the maximum priority to a Slurm job? (from StackOverflow.com) Hello, this is a copy of my own StackOverflow post: http://stackoverflow.com/q/39787477/4286662 Maybe, here, it can be solved faster: I am using /sched/backfill/ policy and the default job priority policy (FIFO) and as administrator I need to give the maximum priority to a given job. I have found that submission options like: |--priority=| or |--nice[=adjustment]| could be useful, but I do not know which values I should assign them in order to provide the job with the highest priority. Another approach could be to set a low priority by default to all the jobs and to the special ones increase it. Any idea of how I could carry it out? Best regards, Sergio. -- Sergio Iserte High Performance Computing & Architectures (HPCA) Department of Computer Science and Engineering (DICC) Universitat Jaume I (UJI)/ /
[slurm-dev] Re: Slurm web dashboards
Nice. I might recommend grafana. There is a nice dashboard for that here: http://giovannitorres.me/graphing-sdiag-with-graphite.html Then there are other diamond collectors you can use to gather other statistics: https://github.com/fasrc/slurm-diamond-collector Grafana is pretty flexible for presenting data, you can also just take the graphs it generates and embed them elsewhere. -Paul Edmon- On 09/27/2016 08:21 AM, John Hearns wrote: Hello all. What are the thoughts on a Slurm ‘dashboard’. The purpose being to display cluster status on a large screen monitor. I rather liked the look of this, based on dashing,io https://github.com/julcollas/dashing-slurm/blob/master/README.md Sadly dashing.io is not being supported, and this looks two years old now. There is the slurm-web from EDF. Also a containerised dashboard by Christian Kniep (hello Christian). Any other packages out there please? Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of XMA Ltd are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP
[slurm-dev] Re: Passing MCA parameters via srun
Excellent. T hanks for the info. -Paul Edmon- On 09/26/2016 04:03 PM, Paul Hargrove wrote: Re: [slurm-dev] Passing MCA parameters via srun Paul, If the user always want a specific set of MCA options, then they should be placed in $HOME/.openmpi/mca-params.conf Otherwise, one should use environment variables with the prefix OMPI_MCA_ A far more complete answer is here: https://www.open-mpi.org/faq/?category=tuning#setting-mca-params -[another] Paul On Mon, Sep 26, 2016 at 12:52 PM, Paul Edmon <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>> wrote: Typically on our cluster we use: srun --mpi=pmi2 To launch MPI applications as we have task affinity turned on and we want to ensure that the binding works properly (if we just invoke mpirun it doesn't bind properly). One of our users wishes to pass in more options to srun to tune MPI (specifically OpenMPI 1.10.2) such as to standard MPI options: mpirun -mca btl self,openib My question is how do I get slurm to pass these options through when it invokes MPI. -Paul Edmon- -- Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov> Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
[slurm-dev] Passing MCA parameters via srun
Typically on our cluster we use: srun --mpi=pmi2 To launch MPI applications as we have task affinity turned on and we want to ensure that the binding works properly (if we just invoke mpirun it doesn't bind properly). One of our users wishes to pass in more options to srun to tune MPI (specifically OpenMPI 1.10.2) such as to standard MPI options: mpirun -mca btl self,openib My question is how do I get slurm to pass these options through when it invokes MPI. -Paul Edmon-
[slurm-dev] Slurm Diamond Collectors
For those using graphite and diamond, check this out as they may be useful. https://github.com/fasrc/slurm-diamond-collector -Paul Edmon-
[slurm-dev] Per Job Usage
So I don't thread hijack, here is my post in a new thread. Is there a way in sacct (or some other command) to get the RawUsage of a job, rather than the aggregate calculated by sshare. I want to see the break down of which job charged how much against a users total usage and hence fairshare score. Essentially I want to reconstruct the math it used for a specific user to see if there was a job that it changed inordinately more for and why. My first logical step was to look at sacct but I didn't see an entry that simply listed RawUsage for the job in terms of TRES. Even better would be a break down of memory and cpu charges. -Paul Edmon-
[slurm-dev] Per Job Usage
Is there a way in sacct (or some other command) to get the RawUsage of a job, rather than the aggregate calculated by sshare. I want to see the break down of which job charged how much against a users total usage and hence fairshare score. Essentially I want to reconstruct the math it used for a specific user to see if there was a job that it changed inordinately more for and why. My first logical step was to look at sacct but I didn't see an entry that simply listed RawUsage for the job in terms of TRES. Even better would be a break down of memory and cpu charges. -Paul Edmon-
[slurm-dev] Re: Preemption order?
I'm interested in this as well. It would be good to have a few knobs to fine tune which jobs preferentially get requeued. I know it adds overhead to the scheduling but the benefit to us outweighs that. Essentially it would be good to both preempt jobs that are small and short before the big ones or have been running for a long time. -Paul Edmon- On 06/10/2016 02:36 AM, Steffen Grunewald wrote: Good morning everyone, is there a way to control the order in which jobs get preempted? That is, for a queue with PreemptMode=REQUEUE, it would make sense to preempt jobs first that haven't run for long; for CANCEL with some GraceTime, perhaps addressing the longest-running jobs first would be better - can Slurm's strategy be hinted twoards the "right" way? Thanks, - S
[slurm-dev] Re: Architectural constraints of Slurm
This email actually reminded me of what we run here. We've been operating with slurm for the past 3 years in an environment similar to your own. We schedule about 50k cores, 90 partitions, heterogeneous architecture of AMD, Intel, high memory versus low memory and GPU. Slurm has handled it all and a lot of the lessons we learned have been pushed back to the community. Thus the current version of slurm should be able to handle all that you have defined there. The only tricky part that I am aware of may be the Kerberos part. We use ldap for our auth on the boxes, and slurm does have a pam module for handling who gets access. That plus cgroups should take care of your security problems. Anyways, suffice it to say slurm can work for your environment as your environment is fairly similar to ours. -Paul Edmon- On 05/24/2016 07:33 AM, Šimon Tóth wrote: Architectural constraints of Slurm Hello, I'm looking for information on Slurm architectural constraints as we are considering a switch to Slurm. We are currently running a heavily modified version of Torque with a custom scheduler. Our system (~13k CPU cores), is heavily heterogenous (~40 clusters) with complex operational constraints. The system generally handles 10k-50k enqued jobs. We are currently scheduling CPU cores, Memory, GPU cards, Scratch space (local, SSD and NFS with different machines having access to different combination of these) and software licenses. Machines are described by a set of physical and software properties (can be requested by users) and their speed (users can request ranges of machine performance). Jobs carry complex requests. Each job can request sets of machines, where each set caries a different specification. Each set is described by the amount of resources requested, machine properties (negative specification is supported for specifying nodes that do not have specific property) and the number of nodes with such specification. Nodes can be allocated exclusively (in which case, the specification describes the minimum amount) or shared, with each resource still being allocated exclusively (jobs cannot overlap in cores, memory, gpu cards). We are relying on Kerberos which is used for both identification and authentication. Each running task inside of a job has a nanny process that periodically refreshes the kerberos ticket for that particular process. For scalability reasons, our scheduler relies on the server to keep an up-to-date and full state of the system. The server therefore counts the current resource allocation state for each of the resources on each of the nodes. Please let me know if this sounds like something Slurm could handle, or if there are any limitations in Slurm that would make this impossible to support. Sincerely, Simon Toth
[slurm-dev] Print Out Slurm Network Hierarchy
We've been having some node communications time outs on our network. I know that slurm communicates in a heirarchical fashion. Is it possible to have slurm spit out its communications tree? This would be helpful for isolating nodes that could be causing problems. Especially since we have nodes on various different networks and locations. In addition it would be great if we could specify our own topology or at least give hints to slurm about what would be a good lay out for its communications. From what I know the topology plugin is really more for compute topology and not so much internal slurm communications. Thanks. -Paul Edmon-
[slurm-dev] Re: Slurm Upgrade Instructions needed
Oh and the longest part of the upgrade is the database upgrade. It varies depending on how big it is. We have about 20 million jobs in ours and it takes about 30-40 minutes. The rest of the upgrade is pretty quick, as it just involves updating the rpms and restarting the scheduler and slurmds. -Paul Edmon- On 5/4/2016 10:10 PM, Paul Edmon wrote: Specifically the upgrade instructions are here: http://slurm.schedmd.com/quickstart_admin.html Look at the bottom of the page. If you follow the instructions you should be fine. Though I would recommend pausing the scheduler by suspending all the jobs and setting all the partitions to down. This helps with keeping the database in lock step as jobs won't be exiting or launching while you are upgrading the database. You can do it live though, but you may end up with a few jobs in the database that are inconsistent. The rest is very smooth and works well. Just note that once you do a major upgrade like this there is no rolling back due to the database and job structure changes. -Paul Edmon- On 5/4/2016 6:23 PM, Lachlan Musicman wrote: Re: [slurm-dev] Slurm Upgrade Instructions needed I would backup /etc/slurm. That's about it. Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 5 May 2016 at 07:36, Balaji Deivam <balaji.dei...@seagate.com <mailto:balaji.dei...@seagate.com>> wrote: Hello, Right now we are using Slurm 14.11.3 and planning to upgrade to the latest version 15.8.7. Could you please share the latest upgrade steps document link? Also please let me know what are the stuffs I have to take backup in the existing slurm version before installing new one. Thanks & Regards, Balaji Deivam Staff Analyst - Business Data Center Seagate Technology - 389 Disc Drive, Longmont, CO 80503 | 720-684- _3395_
[slurm-dev] Re: Slurm Upgrade Instructions needed
Specifically the upgrade instructions are here: http://slurm.schedmd.com/quickstart_admin.html Look at the bottom of the page. If you follow the instructions you should be fine. Though I would recommend pausing the scheduler by suspending all the jobs and setting all the partitions to down. This helps with keeping the database in lock step as jobs won't be exiting or launching while you are upgrading the database. You can do it live though, but you may end up with a few jobs in the database that are inconsistent. The rest is very smooth and works well. Just note that once you do a major upgrade like this there is no rolling back due to the database and job structure changes. -Paul Edmon- On 5/4/2016 6:23 PM, Lachlan Musicman wrote: Re: [slurm-dev] Slurm Upgrade Instructions needed I would backup /etc/slurm. That's about it. Cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 5 May 2016 at 07:36, Balaji Deivam <balaji.dei...@seagate.com <mailto:balaji.dei...@seagate.com>> wrote: Hello, Right now we are using Slurm 14.11.3 and planning to upgrade to the latest version 15.8.7. Could you please share the latest upgrade steps document link? Also please let me know what are the stuffs I have to take backup in the existing slurm version before installing new one. Thanks & Regards, Balaji Deivam Staff Analyst - Business Data Center Seagate Technology - 389 Disc Drive, Longmont, CO 80503 | 720-684- _3395_
[slurm-dev] Zero Bytes Received
I'm seeing quite a few of these errors: May 2 11:33:29 holy-slurm01 slurmctld[47253]: error: slurm_receive_msg: Zero Bytes were transmitted or received May 2 11:33:29 holy-slurm01 slurmctld[47253]: error: slurm_receive_msg: Zero Bytes were transmitted or received I know that this can be caused by a node or client that is in a bad state, but I can't figure out how to trace it back to which one. Does anyone have any tricks for tracing this sort of error back? I turned on the Protocol Debug Flag but none of the additional debug statements lead to the culprit. -Paul Edmon-
[slurm-dev] Re: Saving job submissions
We use this python script as a slurmctld prolog to save ours. Basically it pulls all the info from the slurm hash files and copies to a separate filesystem. We used to do it via mysql but the database got too large. We then use the get_jobscript to actually query the job scripts. -Paul Edmon- On 04/27/2016 03:41 AM, Lennart Karlsson wrote: On 04/27/2016 08:46 AM, Miguel Gila wrote: Another option is to run (as SlurmUser, or root) when the job is still in the system (R or PD): # scontrol show jobid= -dd and dump its output somewhere. Miguel Hi, We are using this one, run as slurmctld.prolog, saving the output for 30 days. Cheers, -- Lennart Karlsson, UPPMAX, Uppsala University, Sweden http://www.uppmax.uu.se/ #!/usr/bin/python -tt import logging import os import sys import syslog import traceback from glob import glob def get_env_vars(job_id): #if not os.path.exists('/slurm/spool/job.%s/environment' % job_id): #return [] paths = glob('/slurm/spool/hash.*/job.%s/environment' % job_id) if len(paths) == 0: return [] # Not sure how this would happen, but it would be weird if len(paths) > 1: return [] # Skip the first four bytes, which are a uint32 indicating the # length of the following data. env_raw = open(paths[0]).read()[4:] env_vars = [] for env in env_raw.split('\0'): if env in ['', ';']: continue name, value = env.split('=', 1) env_vars.append("%s='%s'" % (name, value.replace("'", r"'\''"))) return env_vars def get_job_script(job_id): #if not os.path.exists('/slurm/spool/job.%s/script' % job_id): #return '' paths = glob('/slurm/spool/hash.*/job.%s/script' % job_id) if len(paths) == 0: return '' # Not sure how this would happen, but it would be weird if len(paths) > 1: return '' # The job script has a trailing NULL. o_O script_raw = open(paths[0]).read().rstrip('\0') shebang = rest = '' try: shebang, rest = script_raw.split('\n', 1) except ValueError, e: raise Exception('No lines in script file /slurm/spool/hash.*/job.%s/script' % job_id) sbatch_lines = [] rest_lines = [] in_sbatch = True for line in rest.split('\n'): if in_sbatch: if line.startswith('#SBATCH') or line.strip() == '': sbatch_lines.append(line) else: in_sbatch = False rest_lines.append(line) else: rest_lines.append(line) return '%s\n%s\n\n%s\n BEGIN RUNTIME ENV \n%s' % ( shebang, '\n'.join(sbatch_lines).strip(), '\n'.join(rest_lines), '\n'.join(get_env_vars(job_id)), ) def save_job_script(job_id): # The number of subdirectories the scripts will be distributed into # job_id modulo SUBDIRCOUNT determines parent directory for the job id SLURM_JOBSCRIPT_SUBDIRCOUNT = 1000; SLURM_JOBSCRIPT_SUBDIRCOUNT = int(os.environ.get("SLURM_JOBSCRIPT_SUBDIRCOUNT","1000")) # The root path of the jobscripts SLURM_JOBSCRIPT_HOME = os.environ.get("SLURM_JOBSCRIPT_HOME","/jobscripts") subdir = str(int(job_id) % SLURM_JOBSCRIPT_SUBDIRCOUNT) scriptdir = os.path.join(SLURM_JOBSCRIPT_HOME,subdir) if not os.path.exists(scriptdir): os.mkdir(scriptdir) scriptfile = os.path.join(scriptdir,job_id) with open(scriptfile,'w') as f: f.write(get_job_script(job_id)) if __name__ == '__main__': try: save_job_script(os.environ['SLURM_JOB_ID']) except Exception as e: syslog.openlog(os.path.basename(sys.argv[0]), syslog.LOG_PID) for line in traceback.format_exc().split('\n'): syslog.syslog(syslog.LOG_ERR, line) get_jobscript.sh Description: Bourne shell script
[slurm-dev] TRES in QoS
So when you list a TRES in a QoS limit do you use the billing weight modified TRES or the actual raw number? So say for instance I want to set a memory limit of 256 GB and I am billing at 0.25 per GB. Do then do: GrpTRES=mem=256 or GrpTRES=mem=64 The docs seem a bit ambiguous on this point. -Paul Edmon-
[slurm-dev] Re: User education tools for fair share
We use the showq tool. We are looking into fairshare plotting over time. I've been doing it on our test cluster and seeing the evolution of fairshare overtime really gives you a good feel for what your usage is and what is going on. We haven't implemented it yet though as we would need to set up a data repository for historic fairshare data. Still it is on our docket to do. sshare is a great way of seeing things too, though it can be a bit much for your average user. -Paul Edmon- On 03/01/2016 04:52 AM, Chris Samuel wrote: Hi Loris, On Tue, 1 Mar 2016 12:29:12 AM Loris Bennett wrote: To help the user understand their current fairshare/priority status, I usually point them to 'sprio', generally in the following incantation: sprio -l | sort -nk3 to get the jobs sorted by priority. Thanks for that, our local version of "showq" already orders waiting jobs by priority, so I'll direct users to that first, and then to sprio to get a breakdown of priority. All the best, Chris
[slurm-dev] Re: job_submit.lua slurm.log_user
Ah, okay. In this case I wanted to print out something to the user and still succeed, as it was just a warning to the user that their script had been modified. Thanks. -Paul Edmon- On 01/29/2016 12:36 PM, je...@schedmd.com wrote: It works for me, but only for the job_submit function (not job_modify) and requires that the function return an error code (not "SUCCESS"). Here's a trivial example function slurm_job_submit(job_desc, part_list, submit_uid) slurm.log_user("TEST"); return slurm.ERROR end $ srun hostname srun: error: TEST srun: error: Unable to allocate resources: Unspecified error Quoting Paul Edmon <ped...@cfa.harvard.edu>: So I've been working with the job_submit.lua part of our config, we are running 15.08.7 and while slurm.log_info works, slurm.log_user does not. My understanding is that slurm.log_user will print to the users console. My code looks like the following: slurm.log_user("Your current fairshare is not high enough to use the priority partition. Your job has been automatically reset to the following: %s",job_desc.partition) Does any one see a problem with this? Does any one have a version of job_submit.lua that uses slurm.log_user and it works? If so what is the trick? Or am I misunderstanding what slurm.log_user does? -Paul Edmon-
[slurm-dev] Re: Pause all new submissions
You can also set all the partitions to down. All pending jobs will pend, new jobs can be submitted and existing jobs will finish. However no new jobs will be scheduled. -Paul Edmon- On 01/20/2016 03:41 PM, Trey Dockendorf wrote: Re: [slurm-dev] Pause all new submissions Can likely use a reservation: http://slurm.schedmd.com/reservations.html. I use something like this to put entire cluster into maintenance: scontrol create reservation starttime=now duration=infinite accounts=mgmt flags=maint,ignore_jobs nodes=ALL The above will ignore running jobs and prevent new jobs from starting. It also allows people in mgmt account to run jobs by using --reservation= with sbatch. - Trey = Trey Dockendorf Systems Analyst I Texas A University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: treyd...@tamu.edu <mailto:treyd...@tamu.edu> Jabber: treyd...@tamu.edu <mailto:treyd...@tamu.edu> On Wed, Jan 20, 2016 at 2:06 PM, Jordan Willis <jwillis0...@gmail.com <mailto:jwillis0...@gmail.com>> wrote: Hi, How can I suspend new submissions to slurm without killing or suspending everything currently scheduled? Also, can I do this on the cluster level or does it have to be by partition? Jordan
[slurm-dev] Re: slurmdbd upgrade
That sounds about right. We have about the same order of magnitude and our last major upgrade took about an hour for the DB to update itself. -Paul Edmon- On 1/12/2016 5:04 PM, Andrew E. Bruno wrote: We're planning an upgrade from 14.11.10 to 15.08.6 and in the past the slurmdbd upgrades can take a while (~20-30 minutes). Curious if there's any way to gauge what we can expect this time? To give an idea, we have upwards of ~4.5 million records in our job tables. We frequently get these warnings in our logs: Warning: Note very large processing time from hourly_rollup ... Also, any "best practices" for upgrading the slurmdbd other than: yum update; systemctl restart slurmdbd (this usually hangs until the database migrations are complete) Thanks in advance for any pointers. Cheers, --Andrew
[slurm-dev] Grp Limits for Partition QoS
So I want to set up a partition so that each account gets up to a certain TRES limit and no more. I don't want to do it per user as a user could be part of multiple accounts, and I don't want to do it globally per account as we have backfill partitions that I still want the account to have access too. I just don't want any given account to dominate the partition, especially if that account has perfect fairshare and submits a ton of jobs at once. I was hoping that the Partition QoS could help me do this. However, I'm not sure how Grp limits apply here. I would use something like GrpTRES but from my reading of the description it would apply to the whole QoS and hence the whole partition rather than on a per Account basis. There is MaxTRESPerUser, but I want something more like MaxTRESPerAccount (I don't know if this exists and I'm just missing it). Thanks for the info. -Paul Edmon-
[slurm-dev] MsgAggregation Parameters
We recently upgraded to 15.08.4 from 14.11 and we wanted to try out the MsgAggregation to see if that would improve cluster throughput and responsiveness. However when we turned it on with the settings of WindowMsgs=10 and WindowTime=100 everything slowed to a crawl and it looked like the slurmctld was threading like crazy. When we turned it off everything returned to normal. Does any one have any suggestions or guidelines for what to set the MsgAggregationParam to? I'm guessing it depends on the size of the cluster as we have the same settings on our test cluster but it is about 10 times smaller in terms of number of nodes than our main one. I'm guessing this is a scaling problem. Thoughts? Anyone else using MsgAggregation? -Paul Edmon-
[slurm-dev] Re: MsgAggregation Parameters
I will have to try that out. Thanks for the info. -Paul Edmon- On 12/08/2015 01:54 PM, Danny Auble wrote: Hey Paul, Unless you have a very busy cluster (100s of jobs a second) or are running very large jobs (>2000 nodes) I don't think this will be very useful. But I would expect MsgAggregationParams=WindowMsgs=10,WindowTime=10 to be more what you would want. WindowTime=100 may be too long of a wait. I am surprised at the threading of your slurmctld though, I would expect it to have much less threading. Be sure to restart all your slurmd's as well as the slurmctld when you change the parameter. Try again and see if lowering the WindowTime down improves your situation. Danny On 12/08/15 10:25, Paul Edmon wrote: We recently upgraded to 15.08.4 from 14.11 and we wanted to try out the MsgAggregation to see if that would improve cluster throughput and responsiveness. However when we turned it on with the settings of WindowMsgs=10 and WindowTime=100 everything slowed to a crawl and it looked like the slurmctld was threading like crazy. When we turned it off everything returned to normal. Does any one have any suggestions or guidelines for what to set the MsgAggregationParam to? I'm guessing it depends on the size of the cluster as we have the same settings on our test cluster but it is about 10 times smaller in terms of number of nodes than our main one. I'm guessing this is a scaling problem. Thoughts? Anyone else using MsgAggregation? -Paul Edmon-
[slurm-dev] Re: A floating exclusive partition
Yeah, I guess QoS won't really work for overflow. I was more thinking of the QoS as a way to create a floating partition of 5 nodes with the rest being in the public queue. They would send jobs to the QoS to hit that and then when it is full they would submit to public as normal. That's at least my thinking, but it's less seamless to the users as they will have to consciously monitor what is going on. -Paul Edmon- On 11/19/2015 10:50 AM, Daniel Letai wrote: Can you elaborate a little? I'm not sure what kind of QoS will help, nor how to implement one that will satisfy the requirements. On 11/19/2015 04:52 PM, Paul Edmon wrote: You might consider a QoS for this. It may not do everything you want but it will give you the flexibility. -Paul Edmon- On 11/19/2015 04:49 AM, Daniel Letai wrote: Hi, Suppose I have a 100 node cluster with ~5% nodes down at any given time (maintanence/hw failure/...). One of the projects requires exclusive use of 5 nodes, and be able to use entire cluster when available (when other projects aren't running). I can do this easily if I maintain a static list of the exclusive nodes in slurm.conf: PartitionName=public Nodes=tux0[01-95] Default=YES PartitionName=special Nodes=tux[001-100] Default=NO And allowing only that project to use partition special. However, due to the downtime of 5%, I'd like to maintain a dynamic exclusive 5 nodes. Any suggestions? The project is serial and deployed as array of single node jobs, so I can run it even when the other 95 nodes are full. Thanks, --Dani_L.
[slurm-dev] Re: Partition QoS
Okay, its working now. I just had to be more patient for slurmctld to pick up the DB change of adding the QOS. Thanks for the help. -Paul Edmon- On 11/11/2015 11:54 AM, Bruce Roberts wrote: I believe the method James describes only allows that one qos to be used by jobs on the partition, it does not set up a Partition QOS as Paul is looking for. Create the QOS as James describes, but instead of AllowQOS use the QOS option as Danny eluded to, and as documented here http://slurm.schedmd.com/slurm.conf.html#OPT_QOS. PartitionName=batch Nodes=compute[01-05] *QOS*=vip Default=NO MaxTime=UNLIMITED State=UP This will allow for a partition QOS that isn't associated with the job as talked about in the SLUG15 presentation http://slurm.schedmd.com/SLUG15/Partition_QOS.pdf On 11/11/15 08:09, Paul Edmon wrote: Thanks for the insight. I will try it out on my end. -Paul Edmon- On 11/11/2015 3:31 AM, Dennis Mungai wrote: Re: [slurm-dev] Re: Partition QoS Thanks a lot, James. Helped a bunch ;-) *From:*James Oguya [mailto:oguyaja...@gmail.com] *Sent:* Wednesday, November 11, 2015 11:21 AM *To:* slurm-dev <slurm-dev@schedmd.com> *Subject:* [slurm-dev] Re: Partition QoS I was able to do this on slurm 15.08.3 on a test environment. Here are my notes: - create qos sacctmgr add qos Name=vip MaxJobsPerUser=5 MaxSubmitJobsPerUser=5 MaxTRESPerUser=cpu=128 - create partition in slurm.conf PartitionName=batch Nodes=compute[01-05] AllowQOS=vip Default=NO MaxTime=UNLIMITED State=UP I believe the keyword here is to specify the QoS name when creating the partition, for my case, I used AllowQOS=vip James On Wed, Nov 11, 2015 at 4:16 AM, Danny Auble <d...@schedmd.com <mailto:d...@schedmd.com>> wrote: I'm guessing you also added the qos to your partition line in your slurm.conf as well. On November 10, 2015 5:08:24 PM PST, Paul Edmon <ped...@cfa.harvard.edu> wrote: I did that but it didn't pick it up. I must need to reconfigure again after I made the qos. I will have to try it again. Let you know how it goes. -Paul Edmon- On 11/10/2015 5:40 PM, Douglas Jacobsen wrote: Hi Paul, I did this by creating the qos, e.g. sacctmgr create qos part_whatever Then in slurm.conf setting qos=part_whatever in the "whatever" partition definition. scontrol reconfigure finally, set the limits on the qos: sacctmgr modify qos set MaxJobsPerUser=5 where name=part_whatever ... -Doug Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> - __o -- _ '\<,_ --(_)/ (_)__ On Tue, Nov 10, 2015 at 2:29 PM, Paul Edmon <ped...@cfa.harvard.edu> wrote: In 15.08 you are able to set QoS limits directly on a partition. So how do you actually accomplish this? I've tried a couple of ways, but no luck. I haven't seen a demo of how to do this anywhere either. My goal is to set up a partition with the following QoS parameters: MaxJobsPerUser=5 MaxSubmitJobsPerUser=5 MaxCPUsPerUser=128 Thanks for the info. -Paul Edmon- -- /James Oguya/ __ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. __
[slurm-dev] Re: Partition QoS
Thanks for the insight. I will try it out on my end. -Paul Edmon- On 11/11/2015 3:31 AM, Dennis Mungai wrote: Re: [slurm-dev] Re: Partition QoS Thanks a lot, James. Helped a bunch ;-) *From:*James Oguya [mailto:oguyaja...@gmail.com] *Sent:* Wednesday, November 11, 2015 11:21 AM *To:* slurm-dev <slurm-dev@schedmd.com> *Subject:* [slurm-dev] Re: Partition QoS I was able to do this on slurm 15.08.3 on a test environment. Here are my notes: - create qos sacctmgr add qos Name=vip MaxJobsPerUser=5 MaxSubmitJobsPerUser=5 MaxTRESPerUser=cpu=128 - create partition in slurm.conf PartitionName=batch Nodes=compute[01-05] AllowQOS=vip Default=NO MaxTime=UNLIMITED State=UP I believe the keyword here is to specify the QoS name when creating the partition, for my case, I used AllowQOS=vip James On Wed, Nov 11, 2015 at 4:16 AM, Danny Auble <d...@schedmd.com <mailto:d...@schedmd.com>> wrote: I'm guessing you also added the qos to your partition line in your slurm.conf as well. On November 10, 2015 5:08:24 PM PST, Paul Edmon <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>> wrote: I did that but it didn't pick it up. I must need to reconfigure again after I made the qos. I will have to try it again. Let you know how it goes. -Paul Edmon- On 11/10/2015 5:40 PM, Douglas Jacobsen wrote: Hi Paul, I did this by creating the qos, e.g. sacctmgr create qos part_whatever Then in slurm.conf setting qos=part_whatever in the "whatever" partition definition. scontrol reconfigure finally, set the limits on the qos: sacctmgr modify qos set MaxJobsPerUser=5 where name=part_whatever ... -Doug Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> - __o -- _ '\<,_ --(_)/ (_)__ On Tue, Nov 10, 2015 at 2:29 PM, Paul Edmon <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>> wrote: In 15.08 you are able to set QoS limits directly on a partition. So how do you actually accomplish this? I've tried a couple of ways, but no luck. I haven't seen a demo of how to do this anywhere either. My goal is to set up a partition with the following QoS parameters: MaxJobsPerUser=5 MaxSubmitJobsPerUser=5 MaxCPUsPerUser=128 Thanks for the info. -Paul Edmon- -- /James Oguya/ __ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. __
[slurm-dev] Partition QoS
In 15.08 you are able to set QoS limits directly on a partition. So how do you actually accomplish this? I've tried a couple of ways, but no luck. I haven't seen a demo of how to do this anywhere either. My goal is to set up a partition with the following QoS parameters: MaxJobsPerUser=5 MaxSubmitJobsPerUser=5 MaxCPUsPerUser=128 Thanks for the info. -Paul Edmon-
[slurm-dev] Re: Partition QoS
I did that but it didn't pick it up. I must need to reconfigure again after I made the qos. I will have to try it again. Let you know how it goes. -Paul Edmon- On 11/10/2015 5:40 PM, Douglas Jacobsen wrote: Re: [slurm-dev] Partition QoS Hi Paul, I did this by creating the qos, e.g. sacctmgr create qos part_whatever Then in slurm.conf setting qos=part_whatever in the "whatever" partition definition. scontrol reconfigure finally, set the limits on the qos: sacctmgr modify qos set MaxJobsPerUser=5 where name=part_whatever ... -Doug Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> - __o -- _ '\<,_ --(_)/ (_)__ On Tue, Nov 10, 2015 at 2:29 PM, Paul Edmon <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>> wrote: In 15.08 you are able to set QoS limits directly on a partition. So how do you actually accomplish this? I've tried a couple of ways, but no luck. I haven't seen a demo of how to do this anywhere either. My goal is to set up a partition with the following QoS parameters: MaxJobsPerUser=5 MaxSubmitJobsPerUser=5 MaxCPUsPerUser=128 Thanks for the info. -Paul Edmon-
[slurm-dev] Re: Interpreting
Mixed just means the node is partially full. It has nothing to do with the health of a node. Down/Drained/*(where the node name is followed by a *)/unavailable are all bad states. You will see compound states as well such as DOWN+DRAIN or IDLE+DRAIN or ALLOCATED+DRAIN. They indicate exactly what they are. Namely the IDLE+DRAIN means that the node has been DRAINED and is IDLE. ALLOCATED+DRAIN means the node is set to drain but is currently fully allocated. MIXED+DRAIN means the node is draining but is currently partially allocated. Hope this helps. Generally states that are in a problematic state will have their Reason field filled with something. -Paul Edmon- On 10/27/2015 05:03 AM, Всеволод Никоноров wrote: Hello, if a node is in a MIXED state, is it possible that there are "good" and "bad" states mixed? I mean, if a node is MIXED when it is partly allocated, this is OK. But is it possible that a node is, for example, partly DOWN or partly DRAINED, or partly in some other state that is not normal? I maintain a script in a monitoring system which checks whether a node is available to SLURM, and it consideres IDLE and ALLOCATED states as "good". I am going to add a MIXED state also, but I don't wont to miss a problem in the way I described. Thanks in advance! --О©╫ Vsevolod Nikonorov, JSC NIKIET О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫ О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫, О©╫О©╫О©╫ О©╫О©╫О©╫О©╫О©╫О©╫
[slurm-dev] Re: login node configuration?
For clarity, they should not need to talk to the compute nodes unless you intend to do interactive work. You should only need to talk to the master to submit jobs. -Paul Edmon- On 10/26/2015 9:45 PM, Paul Edmon wrote: What we did was that we just opened up port 6817 between the two VLAN's. So long as the traffic is routeable and they see the same slurm.conf that should work. All the login node needs is slurm.conf, and the slurm, slurm-munge, and slurm-plugin rpms. You don't need to run the slurm service to submit as all you need is the ability of the login node to talk to the master. -Paul Edmon- On 10/26/2015 8:02 PM, Liam Forbes wrote: Hello, I’m in the process of setting up SLURM 15.08.X for the first time. I’ve got a head node and ten compute nodes working fine for serial and parallel jobs. I’m struggling with figuring out how to configure a pair of login nodes to be able to get status and submit jobs. I’ve been searching through the documentation and googling, but I’ve not found a reference other than brief mentions of login nodes or submit hosts. Our pair of login nodes share an ethernet network with the head node. They do not, currently, share a network, ethernet or IB, with the compute nodes. When I execute SLURM commands like sinfo or squeue, using strace I can see them attempting to communicate with port 6817 on the head node. However, the head node doesn’t respond. lsof on the head node seems to indicate slurmctld is only listening on the private network shared with the compute nodes. Is there a reference for configuring login nodes to be able to query slurmctld and submit jobs? If not, would somebody with a similar configuration be willing to share some guidance? Do the login nodes need to be on the same network as the compute nodes, or at least the same network as slurmctld is listening on? Apologies if these questions have been answered somewhere else, I just haven’t found that documentation. Regards, -liam -There are uncountably more irrational fears than rational ones. -P. Dolan Liam Forbes lofor...@alaska.edu ph: 907-450-8618 fax: 907-450-8601 UAF Research Computing Systems Senior HPC Engineer LPIC1, CISSP
[slurm-dev] Re: login node configuration?
What we did was that we just opened up port 6817 between the two VLAN's. So long as the traffic is routeable and they see the same slurm.conf that should work. All the login node needs is slurm.conf, and the slurm, slurm-munge, and slurm-plugin rpms. You don't need to run the slurm service to submit as all you need is the ability of the login node to talk to the master. -Paul Edmon- On 10/26/2015 8:02 PM, Liam Forbes wrote: Hello, I’m in the process of setting up SLURM 15.08.X for the first time. I’ve got a head node and ten compute nodes working fine for serial and parallel jobs. I’m struggling with figuring out how to configure a pair of login nodes to be able to get status and submit jobs. I’ve been searching through the documentation and googling, but I’ve not found a reference other than brief mentions of login nodes or submit hosts. Our pair of login nodes share an ethernet network with the head node. They do not, currently, share a network, ethernet or IB, with the compute nodes. When I execute SLURM commands like sinfo or squeue, using strace I can see them attempting to communicate with port 6817 on the head node. However, the head node doesn’t respond. lsof on the head node seems to indicate slurmctld is only listening on the private network shared with the compute nodes. Is there a reference for configuring login nodes to be able to query slurmctld and submit jobs? If not, would somebody with a similar configuration be willing to share some guidance? Do the login nodes need to be on the same network as the compute nodes, or at least the same network as slurmctld is listening on? Apologies if these questions have been answered somewhere else, I just haven’t found that documentation. Regards, -liam -There are uncountably more irrational fears than rational ones. -P. Dolan Liam Forbes lofor...@alaska.edu ph: 907-450-8618 fax: 907-450-8601 UAF Research Computing Systems Senior HPC EngineerLPIC1, CISSP
[slurm-dev] Re: Slurmd restart without loosing jobs?
You should be able to do this with out losing any jobs (at least I've never lost any on any version of Slurm I have run). I do it all the time in our environment (about once a day) as our slurm.conf is in flux quite a bit. It should always preserve the running and pending state. The only issue I have ever seen where this becomes a problem is in fringe cases during major version upgrades, even then it is rare. -Paul Edmon- On 10/12/2015 3:57 AM, Robbert Eggermont wrote: Hello, Some modifications to the slurm.conf require me to restart the slurmd daemons on all nodes. Is there a way to do this without loosing any running jobs (and not having to drain the cluster)? Thanks, Robbert
[slurm-dev] Re: Slurmd restart without loosing jobs?
I've had this happen several times, but have never lost jobs due to it. Still one should always watch the logs on the master when restarting so you can catch typos immediately. We run a sanity check on our conf's before we push them (we use puppet for configuration control). Our post commit hook uses scontrol to run a test on the conf before pushing. This typically catches most errors. Not all though. -Paul Edmon- On 10/12/2015 12:41 PM, Antony Cleave wrote: While this is true be very, very careful when restarting the slurmd on the controller node. it's quite easy to miss a typo in one of the config files, e.g. an unexpected comma in topology.conf which can cause slurm to segfault or otherwise shut-down uncleanly. If this happens then the state of queued jobs is not guaranteed and if not spotted can end up draining the system. Antony On 12/10/2015 16:12, Paul Edmon wrote: You should be able to do this with out losing any jobs (at least I've never lost any on any version of Slurm I have run). I do it all the time in our environment (about once a day) as our slurm.conf is in flux quite a bit. It should always preserve the running and pending state. The only issue I have ever seen where this becomes a problem is in fringe cases during major version upgrades, even then it is rare. -Paul Edmon- On 10/12/2015 3:57 AM, Robbert Eggermont wrote: Hello, Some modifications to the slurm.conf require me to restart the slurmd daemons on all nodes. Is there a way to do this without loosing any running jobs (and not having to drain the cluster)? Thanks, Robbert
[slurm-dev] Re: the nodes in state down*
Typically that means that the master is having problems communicating with the nodes. I would check your networking, especially your ACLs. -Paul Edmon- On 10/08/2015 10:15 AM, Fany Pagés Díaz wrote: I have a cluster with 3 nodes, and yesterday isincorrectly turned off by electrical problems. When I started the cluster, the slurm doesn´t work correctly, the state of the nodes appear in down *. I put it to the idle state but put back in state down *, Any can help me? Thanks, Ing. Fany Pages Diaz
[slurm-dev] Re: the nodes in state down*
The only other thing I can think of it check that the node daemons are up and okay. -Paul Edmon- On 10/08/2015 11:04 AM, Fany Pagés Díaz wrote: My networks it looks fine, and the communication with the nodes too, the problems is with slurm, the slumd demon is failed, I never made any configuration with acls but I am going to check. if you have another idea, please let me know. thanks *De:*Paul Edmon [mailto:ped...@cfa.harvard.edu] *Enviado el:* jueves, 08 de octubre de 2015 10:20 *Para:* slurm-dev *Asunto:* [slurm-dev] Re: the nodes in state down* Typically that means that the master is having problems communicating with the nodes. I would check your networking, especially your ACLs. -Paul Edmon- On 10/08/2015 10:15 AM, Fany Pagés Díaz wrote: I have a cluster with 3 nodes, and yesterday isincorrectly turned off by electrical problems. When I started the cluster, the slurm doesn´t work correctly, the state of the nodes appear in down *. I put it to the idle state but put back in state down *, Any can help me? Thanks, Ing. Fany Pages Diaz
[slurm-dev] RE: slurm job priorities depending on the number of each user's jobs
There was some discussion of an analogous feature at the Slurm User Group meeting. http://bugs.schedmd.com/show_bug.cgi?id=1951 Basically jobs would have their fairshare penalized immediately once a job launched for the full runtime, rather than actual time used. Which would have a similar effect to what you are asking for. That said they aren't exactly analogous and I can see situations where one would want to do this sort of thing so that no one person can monopolize the queue. -Paul Edmon- On 10/6/2015 6:30 PM, Kumar, Amit wrote: Just wanted to jump the wagon, this feature would help us as well...!! Thank you, Amit -Original Message- From: gareth.willi...@csiro.au [mailto:gareth.willi...@csiro.au] Sent: Tuesday, October 06, 2015 5:23 PM To: slurm-dev Subject: [slurm-dev] RE: slurm job priorities depending on the number of each user's jobs Hi Markus, We would also like such a feature. We have a clunky but practical alternative where a cron job repeatedly runs to reset the 'nice' factor for the top few pending jobs for each user to spread priority, encourage interleaving of jobs starting and help each user to have at least one job running quickly. I can provide our script if you are interested. It requires allowing negative nice. Gareth -Original Message- From: Dr. Markus Stöhr [mailto:markus.sto...@tuwien.ac.at] Sent: Tuesday, 6 October 2015 12:18 AM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] slurm job priorities depending on the number of each user's jobs Dear all, we are currently looking at the details of the slurm priority multifactor plugin. Basically it is doing what it should, but there are, at least from our point of view, some limitations. The five contributions to priority (fairshare, qos, partition, job size and job age) are fixed at a time. So if a user has submitted a huge number of equal sized jobs, all jobs have nearly the same priority. If such a bunch of jobs has highest priority, nearly all of them might start simultanously, not allowing the start of jobs of other users. The situtation described above looks like this: prio / user / job_nr_of_user 500 X 1 500 X 2 500 X 3 500 X 4 500 X 5 500 X 6 400 Z 1 400 Z 2 400 Z 3 300 A 1 300 A 2 300 A 3 What we want to achieve is something like the following list, ie. only a few jobs of each user are getting the highest possible priority, all others are reduced in priority, relative to the other users' jobs: prio / user / job_nr_of_user 500 X 1 450 X 2 400 Z 1 300 A 1 250 X 3 250 X 4 250 X 5 250 X 6 220 Z 2 220 Z 3 200 A 2 200 A 3 Basically this would use the number of jobs per user and weight the job index (of every user) with a damping function. The questions are: 1) Can this be done with existing parameters (e.g. limits of QOS/Partitions)? 2) If a own multifactor priority plugin is the method of choice, has anyone maybe programmed one by him/herself? best regards, Markus -- = Dr. Markus Stöhr Zentraler Informatikdienst BOKU Wien / TU Wien Wiedner Hauptstraße 8-10 1040 Wien Tel. +43-1-58801-420754 Fax +43-1-58801-9420754 Email: markus.sto...@tuwien.ac.at =
[slurm-dev] cgroups, julia, and threads
We recently made the change in our slurm conf to enabling: TaskPlugin=task/cgroup Previously we were using task/none. We are using this in our cgroup.conf: CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" ConstrainCores=yes TaskAffinity=no ConstrainRAMSpace=yes ConstrainSwapSpace=yes ConstrainDevices=no The constraining of cores has gone pretty well but I got stumped by this one from one of our users. Any thoughts? "I am running a parallel optimization program that optimizes a function with several parameters. I am using Julia to parallelize the evaluation of this function using parallel map, although I believe as concerns my problem, the implementation details would be near similar to Python. Essentially, Julia is started with multiple other worker processes that can be communicated with via a single "brain" process. Typically, I run this script with 23 processes on 23 cores, with one "brain" process and 22 worker processes. The brain uses parallel map to distribute the function and parameter set to each worker, the worker evaluates the function and returns the result to the brain. The brain must wait for the return of all 22 workers before it can propose new parameter combinations. Within the function itself, I run a command-line Fortran program (dispatched using Julia's `run` command, which essentially uses the operating system `fork` and `exec` to spawn a child process) that takes in some of the parameters and then writes the result to file. This Fortran program is run in a blocking manner (synchronously), so each Julia worker idles while the Fortran program runs. When the Fortran program is finished the Julia script reads back in the result and then returns it to the brain process. This framework works very well on my local 24 core machine and up until last week when the cpu permissions were changed on Odyssey, this worked very well there as well. Prior to this change, the 22 Fortran programs would naturally spread evenly across 22 cores, utilizing 100% of each core. However, after the cpu permissions change on Odyssey, the 22 Fortran programs sometimes tend to clump onto the same core (examined using htop). For example, it is now common for at least one core to host up to 3 Fortran processes, each running at 33%, while several of my otherwise allocated cores idle at 0%. Most of the other ~15 Fortran processes are assigned one per core and have 100% utilization. When it comes time to run the Julia portion of the script, the 0% cores do return to full 100% utilization. However, because the Fortran program is my dominant time cost, and the brain must wait for all workers to finish, the 1/3 efficiency for a few workers means that my optimizer takes three times as long for the full ensemble. I am unclear why the OS will not reallocate these clumped processes to run on my otherwise available open cores. Paul was thinking that this might be related to the way SLURM is allocating threads to cores, and somehow my forked processes are being prevented from moving to the open cores. A similar example of what I think is happening is here: http://stackoverflow.com/questions/10019096/multiple-processes-stuck-on-same-cpu-after-multiple-forks-linux-c My run script is typically the following: #!/bin/bash #SBATCH -p general #partition #SBATCH -t 78:00:00 #running time #SBATCH --mem 2 #memory request per node #SBATCH -N 1 #ensure all jobs are on the same node #SBATCH -n 23 venus.jl -p 22 -r $SLURM_ARRAY_TASK_ID Where the venus.jl script is part of my code package here: https://github.com/iancze/JudithExcalibur/blob/master/scripts/venus.jl I have tried alternate sbatch submission options, 1. running with way more cores than I require (e.g. -n 30), so that some are definitely left idle. 2. replacing -n 23 with -c 23 3. using srun and --cpu_bind=cores 4. using srun and --cpu_bind=none although none have these have prevented the Fortran programs from clumping on at least a few cores. I would greatly appreciate your help in trying to figure out how to prevent this behavior under the newly changed core permissions." Thanks for the help. -Paul Edmon-
[slurm-dev] Slurmctld Thread Count
So by default the slurm thread count is 256. We were wondering how one would go about increasing this (i.e. what variables to change), what issues one might expect if we did, and also what the rationale was for 256 being the thread limit. We are hoping to improve the responsiveness of the scheduler by bumping this up if we can to 1024, or at least test to see what happens when we do. Thanks. -Paul Edmon-
[slurm-dev] Re: Slurm integration scripts with Matlab
So the new scripts are in the R2015b prerelease http://www.mathworks.com/downloads/web_downloads The ones we have are customized for our site. Mathworks recommends contacting them for assistance with customizing them for your own. -Paul Edmon- On 06/10/2015 09:53 AM, Sean McGrath wrote: Hi, +1 for running MATLAB on our cluster. Our notes on it are here: http://www.tchpc.tcd.ie/node/132 in the unlikely event they help. The module file just sets the following environment variables: root, MATLAB, MLM_LICENSE_FILE Regards Sean McGrath Trinity Centre for High Performance and Research Computing https://www.tchpc.tcd.ie/ On Wed, Jun 10, 2015 at 06:36:53AM -0700, John Desantis wrote: Hadrian, This reply will most likely not be of any help since it seems as if you're referring to the MATLAB distributed computing server, but I wanted to chime in and state that we are running MATLAB on our cluster. Users can either submit serial MATLAB jobs (with HPC using array tasks) or they can use a matlab pool up to either 12 to 16 cores on 1 node for SMP jobs. Integration scripts aren't needed in this set-up. All that is required is a normal submission script. John DeSantis 2015-06-09 17:24 GMT-04:00 Hadrian Djohari hx...@case.edu: Hi, We are in the process of adapting Slurm as our scheduler and we only found Matlab-Slurm integration scripts back from 2011 that seem to need some additional tweaking. Can someone with a running Matlab on their cluster help provide us with the information about this or also share the set of m-scripts that would work with Slurm both on the cluster and remotely? Thank you, -- Hadrian Djohari HPCC Manager Case Western Reserve University (W): 216-368-0395 (M): 216-798-7490
[slurm-dev] Re: Slurm integration scripts with Matlab
Huh. Let me ask around here and see if we can share what they gave us with the community. -Paul Edmon- On 6/10/2015 9:04 AM, Hadrian Djohari wrote: Re: [slurm-dev] Re: Slurm integration scripts with Matlab Hi Paul and others, We have contacted Mathworks and they only came back with the 2011 integration scripts. It seems they have a newer set now that they gave to Harvard? Does anyone else currently running Slurm not use Matlab at all? Hadrian On Tue, Jun 9, 2015 at 6:45 PM, Paul Edmon ped...@cfa.harvard.edu mailto:ped...@cfa.harvard.edu wrote: We've been working with Mathworks to get this sorted so that DCS and Matlab can work with Slurm. I think if you contact them you should be able to get the scripts they gave us. -Paul Edmon- On 6/9/2015 5:24 PM, Hadrian Djohari wrote: Hi, We are in the process of adapting Slurm as our scheduler and we only found Matlab-Slurm integration scripts back from 2011 that seem to need some additional tweaking. Can someone with a running Matlab on their cluster help provide us with the information about this or also share the set of m-scripts that would work with Slurm both on the cluster and remotely? Thank you, -- Hadrian Djohari HPCC Manager Case Western Reserve University (W): 216-368-0395 tel:216-368-0395 (M): 216-798-7490 tel:216-798-7490 -- Hadrian Djohari HPCC Manager Case Western Reserve University (W): 216-368-0395 (M): 216-798-7490
[slurm-dev] Re: Slurm integration scripts with Matlab
We've been working with Mathworks to get this sorted so that DCS and Matlab can work with Slurm. I think if you contact them you should be able to get the scripts they gave us. -Paul Edmon- On 6/9/2015 5:24 PM, Hadrian Djohari wrote: Slurm integration scripts with Matlab Hi, We are in the process of adapting Slurm as our scheduler and we only found Matlab-Slurm integration scripts back from 2011 that seem to need some additional tweaking. Can someone with a running Matlab on their cluster help provide us with the information about this or also share the set of m-scripts that would work with Slurm both on the cluster and remotely? Thank you, -- Hadrian Djohari HPCC Manager Case Western Reserve University (W): 216-368-0395 (M): 216-798-7490
[slurm-dev] Re: Slurm versions 14.11.6 is now available
As to performance concerns. We are running our production cluster which handles 1.5 million jobs per day with -O0 and gdb flags. No degraded performance. The reason for the -O0 is that the optimization can make debugging harder as it optimizes out some values. At least that's what has happened to us a couple of times when trying to debug the scheduler. So in reality the optimization did nothing in the first place. So any concerns about speed should be alleviated. -Paul Edmon- On 04/24/2015 03:31 PM, Chris Read wrote: Re: [slurm-dev] Re: Slurm versions 14.11.6 is now available On Fri, Apr 24, 2015 at 4:41 AM, Janne Blomqvist janne.blomqv...@aalto.fi mailto:janne.blomqv...@aalto.fi wrote: On 2015-04-24 02:03, Moe Jette wrote: Slurm version 14.11.6 is now available with quite a few bug fixes as listed below. Slurm downloads are available from http://slurm.schedmd.com/download.html * Changes in Slurm 14.11.6 == [snip] -- Enable compiling without optimizations and with debugging symbols by default. Disable this by configuring with --disable-debug. Always including debug symbols is good (the only cost is a little bit of disk space, should never really be a problem), but disabling optimization by default?? In our environment, slurmctld consumes a decent chunk of cpu time, I would loathe to see it getting a lot (?) slower. Typically, problems which are fixed by disabling optimization are due to violations of the C standard or such which for some reason just doesn't happen to trigger with -O0. Perhaps I'm being needlessly harsh here, but I'd prefer if the bugs were fixed properly rather than being papered over like this. +1 from me... -- Use standard statvfs(2) syscall if available, in preference to non-standard statfs. This is not actually such a good idea. Prior to Linux kernel 2.6.36 and glibc 2.13, the implementation of statvfs required checking all entries in /proc/mounts. If any of those other filesystems are not available (e.g. a hung NFS mount), the statvfs call would thus hang. See e.g. http://man7.org/linux/man-pages/man2/statvfs.2.html Not directly related to this change, there is also a bit of silliness in the statfs() code for get_tmp_disk(), namely that it assumes that the fs record size is the same as the memory page size. As of Linux 2.6 the struct statfs contains a field f_frsize which contains the correct record size. I suggest the attached patch which should fix both of these issues. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS NBE +358503841576 tel:%2B358503841576 || janne.blomqv...@aalto.fi mailto:janne.blomqv...@aalto.fi
[slurm-dev] Re: Slurm versions 14.11.6 is now available
Correction 50,000 jobs per day. That 1.5 million is per month. My bad. Still no degradation in performance seen in our environment. -Paul Edmon- On 04/24/2015 03:31 PM, Chris Read wrote: Re: [slurm-dev] Re: Slurm versions 14.11.6 is now available On Fri, Apr 24, 2015 at 4:41 AM, Janne Blomqvist janne.blomqv...@aalto.fi mailto:janne.blomqv...@aalto.fi wrote: On 2015-04-24 02:03, Moe Jette wrote: Slurm version 14.11.6 is now available with quite a few bug fixes as listed below. Slurm downloads are available from http://slurm.schedmd.com/download.html * Changes in Slurm 14.11.6 == [snip] -- Enable compiling without optimizations and with debugging symbols by default. Disable this by configuring with --disable-debug. Always including debug symbols is good (the only cost is a little bit of disk space, should never really be a problem), but disabling optimization by default?? In our environment, slurmctld consumes a decent chunk of cpu time, I would loathe to see it getting a lot (?) slower. Typically, problems which are fixed by disabling optimization are due to violations of the C standard or such which for some reason just doesn't happen to trigger with -O0. Perhaps I'm being needlessly harsh here, but I'd prefer if the bugs were fixed properly rather than being papered over like this. +1 from me... -- Use standard statvfs(2) syscall if available, in preference to non-standard statfs. This is not actually such a good idea. Prior to Linux kernel 2.6.36 and glibc 2.13, the implementation of statvfs required checking all entries in /proc/mounts. If any of those other filesystems are not available (e.g. a hung NFS mount), the statvfs call would thus hang. See e.g. http://man7.org/linux/man-pages/man2/statvfs.2.html Not directly related to this change, there is also a bit of silliness in the statfs() code for get_tmp_disk(), namely that it assumes that the fs record size is the same as the memory page size. As of Linux 2.6 the struct statfs contains a field f_frsize which contains the correct record size. I suggest the attached patch which should fix both of these issues. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS NBE +358503841576 tel:%2B358503841576 || janne.blomqv...@aalto.fi mailto:janne.blomqv...@aalto.fi
[slurm-dev] Upgrade Rollbacks
I've been curious about this for a bit. What is the procedure for rolling back a minor and major release of slurm in case something goes wrong? -Paul Edmon-
[slurm-dev] Re: Problems running job
Do you have all the ports open between all the compute nodes as well? Since slurm builds a tree to communicate all the nodes need to talk to every other node on those ports and do so with out a huge amount of latency. You might want to try to up your timeouts. -Paul Edmon- On 03/31/2015 10:28 AM, Jeff Layton wrote: Chris and David, Thanks for the help! I'm still trying to find out why the compute nodes are down or not responding. Any tips on where to start? How about open ports? Right now I have 6817 and 6818 open as per my slurm.conf. I also have 22 and 80 open as well as 111, 2049, and 32806. I'm using NFSv4 but don't know if that is causing the problem or not (I REALLY want to stick to NFSv4). Thanks! Jeff On 31/03/15 07:31, Jeff Layton wrote: Good afternoon! Hiya Jeff, [...] But it doesn't seem to run. Here is the output of sinfo and squeue: [...] Actually it does appear to get started (at least), but.. [ec2-user@ip-10-0-1-72 ec2-user]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 debug slurmtes ec2-user CG 0:00 1 ip-10-0-2-101 ...the CG state you see there is the completing state, i.e. the state when a job is finishing up. The system logs on the master node (contoller node) don't show too much: Mar 30 20:20:43 ip-10-0-1-72 slurmctld[7524]: _slurm_rpc_submit_batch_job JobId=2 usec=239 Mar 30 20:20:44 ip-10-0-1-72 slurmctld[7524]: sched: Allocate JobId=2 NodeList=ip-10-0-2-101 #CPUs=1 OK, node allocated. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Job finishes. Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: requeue JobID=2 State=0x8000 NodeCnt=1 per user/system request Mar 30 20:20:49 ip-10-0-1-72 slurmctld[7524]: job_complete: JobID=2 State=0x8000 NodeCnt=1 done Not sure of the implication of that requeue there, unless it's the transition to the CG state? Mar 30 20:22:30 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-[101-102] not responding Mar 30 20:22:33 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-102 not responding, setting DOWN Mar 30 20:25:53 ip-10-0-1-72 slurmctld[7524]: error: Nodes ip-10-0-2-101 not responding, setting DOW Now the nodes stop responding (not before). From these logs, it looks like the compute nodes are not responding to the control node (master node). Not sure how to debug this - any tips? I would suggest looking at the slurmd logs on the compute nodes to see if they report any problems, and check to see what state the processes are in - especially if they're stuck in a 'D' state waiting on some form of device I/O. I know some people have reported strange interactions between Slurm being on an NFSv4 mount (NFSv3 is fine). Good luck! Chris
[slurm-dev] --reboot
So point of information. My impression is that when you use this flag, the node/s are rebooted (or the script defined is run) and then the job runs as per normal. Is this correct? Because I think in our case it is rebooting the nodes but then the node is defined as closed due to unexpected reboot. Is there a way to suppress this when the node is rebooted by this flag? Obviously the reboot wasn't unexpected as slurm was aware of it due to the flag. -Paul Edmon-
[slurm-dev] Re: slurmctld thread number blowups leading to deadlock in 14.11.4
So we have had the same problem, usually due to the scheduler receiving tons of requests. Usually this is fixed by having the scheduler slow itself down by using defer, or the max_rpc_cnt options. We in particular use max_rpc_cnt=16. I actually did a test yesterday where I removed this and let it go with out defer as I was hoping that the newer version of slurm would be able to run with out it. In our environment the thread count saturated and slurm started to drop requests all over the place. This even caused some node_fail messages as nodes themselves couldn't talk to the master because it was so busy. It sounds like your problem is similar to ours in that we just have too much traffic and demand hitting the master (we run with about 1000 cores and 800 users hitting the scheduler). So I advise using defer or max_rpc_cnt. Max_rpc_cnt is good because when things are less busy the scheduler automatically switches out of defer state and starts to schedule things more quickly, so its adaptive. However an even better thing would be to have slurm more intelligently auto throttle itself (which max_rpc_cnt is a somewhat simple way of doing) when it receives ton of traffic due to completion, user queries or just general busyness. The threads tend to stack up when you have a bunch of blocking requests. sdiag can help you track some of that back, but some can't be prevented. So it would be good at least from the developer end if the number of blocking requests could be reduced. In particular in our case it would be good if user requests for information were responded to with a intelligent busy signal and an estimate of return time, as our users get a bit punchy if requests for information don't return immediately. Regardless, I would look at defer or max_rpc_cnt. -Paul Edmon- On 03/27/2015 07:32 AM, Mehdi Denou wrote: Hi, Using gdb you can retrieve which thread own the locks on the slurmctld internal structures (and block all the others). Then it will be easier to understand what is happening. Le 27/03/2015 12:24, Stuart Rankin a écrit : Hi, I am running slurm 14.11.4 on a 800 node RHEL6.6 general-purpose University cluster. Since upgrading from 14.03.3 we have been seeing the following problem and I'd appreciate any advice (maybe it's a bug but maybe I'm missing something obvious). Occasionally the number of slurmctld threads starts to rise rapidly until it hits the hard coded 256 limit and stays there. The threads are in the futex_ state according to ps and logging stops (nothing out of the ordinary leaps out in the log before this happens). Naturally slurm clients then start failing with timeout messages (which isn't trivial since it is causing some not very resilient user pipelines to fail). This condition has persisted for several hours during the night without being detected. However there is a simple workaround, which is to send a STOP signal to slurmctld process, wait a few seconds, then resume it - this clears the logjam. Merely attaching a debugger has the same effect! I feel this must be a clue as to the root cause. I have already tried setting CommitDelay in slurmdbd.conf, increasing MessageTimeout, setting HealthCheckNodeState=CYCLE and decreasing/increasing bf_yield_interval/bf_yield_sleep without any apparent impact (please see slurm.conf attached). Any advice would be gratefully received. Many thanks - Stuart
[slurm-dev] Re: slurm on NFS for a cluster?
Yeah, we use puppet and yum to manage our stack. Works pretty well and scales nicely. -Paul Edmon- On 03/26/2015 11:46 AM, Jason Bacon wrote: +1 for using package managers in general. On our CentOS clusters, I do the munge and slurm installs using pkgsrc (+ pkgsrc-wip). http://acadix.biz/pkgsrc.php I use Yum for most system services and for libraries required by commercial software, and pkgsrc for all the latest open source software. I rsync a small pkgsrc tree to the local drive of each node for munge, slurm, and a few basic tools, and keep separate, more extensive tree on the NFS share for scientific software. The current slurm package is pretty outdated, but we'll bring it up-to-date soon. Regards, Jason On 3/26/15 4:23 AM, Paddy Doyle wrote: +1 for local installs. We build the RPMs and put them in a local repo (Scientific Linux 6), and so installing/upgrading via Salt/Puppet/Ansible etc is quite scalable. It works for us, but of course YMMV. Paddy On Tue, Mar 24, 2015 at 06:46:46PM -0700, Paul Edmon wrote: Yeah, we've been running CentOS 6 and slurm in this fashion for about a year and a half on about a thousand machines and haven't really had a problem with this. Though I don't know if this method scales indefinitely. We just have a symlink back to our conf from /etc/slurm/slurm.conf. We then control the version via RPM installs. -Paul Edmon- On 3/24/2015 4:22 PM, Jason Bacon wrote: I ran one of our CentOS clusters this way for about a year and found it to be more trouble than it was worth. I recently reconfigured it to run all system services from local disks so that nodes are as independent of each other as possible. Assuming you have ssh keys on all the nodes, syncing slurm.conf and other files is a snap using a simple shell script. We only use NFS for data files and user applications at this point. Of course, if your compute nodes don't have local disks, that's another story. Jason On 03/24/15 14:42, Jeff Layton wrote: Good afternoon, I apologies for the newb question but I'm setting up slurm for the first time in a very long time. I've got a small cluster of a master node and 4 compute nodes. I'd like to install slurm on an NFS file system that is exported from the master node and mounted on the compute nodes. I've been reading a bit about this but does anyone have recommendations on what to watch out for? Thanks! Jeff
[slurm-dev] Re: slurm on NFS for a cluster - Part II
All of them should be owned by munge. Further more for security's sake I would make them all only accessible to munge, at least the etc one. -Paul Edmon- On 03/25/2015 10:29 AM, Jeff Layton wrote: I assume the same is true for /var/log/munge and /var/run/munge? How about /etc/munge? Thanks! Jeff Yea, that folder and the files inside needs to be owned by munge. -Paul Edmon- On 03/25/2015 09:54 AM, Jeff Layton wrote: Good morning, Thanks for all of the advice in regard to slurm on NFS. I've started on my slurm quest by installing munge but I'm having some trouble. I'm not sure this is the right place to ask about munge but here goes. I'm building and install munge with the following options: ./configure --exec-prefix=/share/ec2-user --prefix= \ --sysconfdir=/etc/ --sharedstatedir=/var/ This allows me to store the local state information in /var and the configuration information in /etc, but everything else is stored in /share/ec2-user/ (NFS shared directory). The build goes fine but when I try to start the munge service I get the following: [ec2-user@ip-10-0-1-72 munge-0.5.11]$ sudo service munge start Starting MUNGE: munged (failed). munged: Error: Failed to check logfile /var/log/munge/munged.log: Permission denied The permissions on /var/log/munge are: [ec2-user@ip-10-0-1-72 munge-0.5.11]$ sudo ls -lstar /var/log/munge total 8 4 drwxr-xr-x 6 root root 4096 Mar 25 13:39 .. 4 drwx-- 2 root root 4096 Mar 25 13:39 I'm not sure but should this directory be owned by user munge? TIA! Jeff
[slurm-dev] Re: slurm on NFS for a cluster - Part II
Yea, that folder and the files inside needs to be owned by munge. -Paul Edmon- On 03/25/2015 09:54 AM, Jeff Layton wrote: Good morning, Thanks for all of the advice in regard to slurm on NFS. I've started on my slurm quest by installing munge but I'm having some trouble. I'm not sure this is the right place to ask about munge but here goes. I'm building and install munge with the following options: ./configure --exec-prefix=/share/ec2-user --prefix= \ --sysconfdir=/etc/ --sharedstatedir=/var/ This allows me to store the local state information in /var and the configuration information in /etc, but everything else is stored in /share/ec2-user/ (NFS shared directory). The build goes fine but when I try to start the munge service I get the following: [ec2-user@ip-10-0-1-72 munge-0.5.11]$ sudo service munge start Starting MUNGE: munged (failed). munged: Error: Failed to check logfile /var/log/munge/munged.log: Permission denied The permissions on /var/log/munge are: [ec2-user@ip-10-0-1-72 munge-0.5.11]$ sudo ls -lstar /var/log/munge total 8 4 drwxr-xr-x 6 root root 4096 Mar 25 13:39 .. 4 drwx-- 2 root root 4096 Mar 25 13:39 I'm not sure but should this directory be owned by user munge? TIA! Jeff
[slurm-dev] Re: successful systemd service start on RHEL7?
I have tried building slurm 14.11.4 on CentOS7 but it never quite worked right. I'm not sure if it has been vetted for RHEL7 yet. I didn't dig too deeply though when I did build it as I just figured it wasn't ready for RHEL7. -Paul Edmon- On 03/24/2015 10:32 AM, Fred Liu wrote: Hi, Anyone successfully started systemd service on RHEL7? I failed like following: [root@cnlnx03 system]# systemctl start slurmctld Job for slurmctld.service failed. See 'systemctl status slurmctld.service' and 'journalctl -xn' for details. [root@cnlnx03 system]# systemctl status slurmctld.service slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled) Active: failed (Result: timeout) since Tue 2015-03-24 22:22:46 CST; 4min 32s ago Mar 24 22:21:05 cnlnx03 slurmctld[20561]: init_requeue_policy: kill_invalid_depend is set to 0 Mar 24 22:21:05 cnlnx03 slurmctld[20561]: Recovered state of 0 reservations Mar 24 22:21:05 cnlnx03 slurmctld[20561]: read_slurm_conf: backup_controller not specified. Mar 24 22:21:05 cnlnx03 slurmctld[20561]: Running as primary controller Mar 24 22:22:05 cnlnx03 slurmctld[20561]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_tim...pth=0 Mar 24 22:22:45 cnlnx03 systemd[1]: slurmctld.service operation timed out. Terminating. Mar 24 22:22:45 cnlnx03 slurmctld[20561]: Terminate signal (SIGINT or SIGTERM) received Mar 24 22:22:45 cnlnx03 slurmctld[20561]: Saving all slurm state Mar 24 22:22:46 cnlnx03 systemd[1]: Failed to start Slurm controller daemon. Mar 24 22:22:46 cnlnx03 systemd[1]: Unit slurmctld.service entered failed state. Hint: Some lines were ellipsized, use -l to show in full. [root@cnlnx03 system]# journalctl -xn -- Logs begin at Wed 2015-03-11 17:23:37 CST, end at Tue 2015-03-24 22:25:02 CST. -- Mar 24 22:22:45 cnlnx03 systemd[1]: slurmctld.service operation timed out. Terminating. Mar 24 22:22:45 cnlnx03 slurmctld[20561]: Terminate signal (SIGINT or SIGTERM) received Mar 24 22:22:45 cnlnx03 slurmctld[20561]: Saving all slurm state Mar 24 22:22:46 cnlnx03 slurmctld[20561]: layouts: all layouts are now unloaded. Mar 24 22:22:46 cnlnx03 systemd[1]: Failed to start Slurm controller daemon. -- Subject: Unit slurmctld.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[slurm-dev] Re: slurm on NFS for a cluster?
Yeah, we've been running CentOS 6 and slurm in this fashion for about a year and a half on about a thousand machines and haven't really had a problem with this. Though I don't know if this method scales indefinitely. We just have a symlink back to our conf from /etc/slurm/slurm.conf. We then control the version via RPM installs. -Paul Edmon- On 3/24/2015 4:22 PM, Jason Bacon wrote: I ran one of our CentOS clusters this way for about a year and found it to be more trouble than it was worth. I recently reconfigured it to run all system services from local disks so that nodes are as independent of each other as possible. Assuming you have ssh keys on all the nodes, syncing slurm.conf and other files is a snap using a simple shell script. We only use NFS for data files and user applications at this point. Of course, if your compute nodes don't have local disks, that's another story. Jason On 03/24/15 14:42, Jeff Layton wrote: Good afternoon, I apologies for the newb question but I'm setting up slurm for the first time in a very long time. I've got a small cluster of a master node and 4 compute nodes. I'd like to install slurm on an NFS file system that is exported from the master node and mounted on the compute nodes. I've been reading a bit about this but does anyone have recommendations on what to watch out for? Thanks! Jeff
[slurm-dev] Re: slurm on NFS for a cluster?
Yup, that's exactly what we do. We make sure to export it read only and make sure that it is synced and hard mounted. Not much else to it. -Paul Edmon- On 03/24/2015 03:43 PM, Jeff Layton wrote: Good afternoon, I apologies for the newb question but I'm setting up slurm for the first time in a very long time. I've got a small cluster of a master node and 4 compute nodes. I'd like to install slurm on an NFS file system that is exported from the master node and mounted on the compute nodes. I've been reading a bit about this but does anyone have recommendations on what to watch out for? Thanks! Jeff
[slurm-dev] Re: slurm on NFS for a cluster?
Interesting. Yeah we use v3 here. Hadn't tried out v4, and good thing we didn't then. -Paul Edmon- On 03/24/2015 04:05 PM, Uwe Sauter wrote: And if you are planning on using cgroups, don't use NFSv4. There are problems that cause the NFS client process to freeze (and with that freeze the node) when the cgroup removal script is called. Regards, Uwe Sauter Am 24.03.2015 um 20:50 schrieb Paul Edmon: Yup, that's exactly what we do. We make sure to export it read only and make sure that it is synced and hard mounted. Not much else to it. -Paul Edmon- On 03/24/2015 03:43 PM, Jeff Layton wrote: Good afternoon, I apologies for the newb question but I'm setting up slurm for the first time in a very long time. I've got a small cluster of a master node and 4 compute nodes. I'd like to install slurm on an NFS file system that is exported from the master node and mounted on the compute nodes. I've been reading a bit about this but does anyone have recommendations on what to watch out for? Thanks! Jeff
[slurm-dev] Re: SlurmDBD Archiving
Oh and for the record we are running 14.11.4 -Paul Edmon- On 03/10/2015 09:26 AM, Paul Edmon wrote: So when I tried to do an archive dump I got the following error. What does this mean? [root@holy-slurm01 slurm]# sacctmgr -i archive dump sacctmgr: error: slurmdbd: Getting response to message type 1459 sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error Problem dumping archive: Unspecified error It also caused the slurmdbd to crash so I had to restart it. Here is the log: Mar 10 08:18:21 holy-slurm01 slurmctld[20144]: slurmdbd: agent queue size 50 Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Mar 10 09:20:57 holy-slurm01 slurmctld[20144]: error: slurmdbd: Getting response to message type 1407 Mar 10 09:20:57 holy-slurm01 slurmctld[20144]: slurmdbd: reopening connection Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd: Sending message type 1472: 11: Connection refused Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused It did manage to dump: [root@holy-slurm01 archive]# ls -ltr total 160260 -rw--- 1 slurm slurm_users 163847476 Mar 10 09:16 odyssey_event_archive_2013-08-01T00:00:00_2014-08-31T23:59:59 -rw--- 1 slurm slurm_users253639 Mar 10 09:17 odyssey_suspend_archive_2013-08-01T00:00:00_2014-08-31T23:59:59 Is it safe to try again? -Paul Edmon- On 03/06/2015 03:07 PM, Paul Edmon wrote: Ah, okay, that was the command I was looking for. I wasn't sure how to force it. Thanks. -Paul Edmon- On 03/06/2015 01:43 PM, Danny Auble wrote: It looks like I might stand corrected though. It looks like you will have to wait for the month to go by before the purge starts. With a lot of jobs it may take a while depending on the speed of your disk and such. If you have the debugflag=DB_USAGE you should see the sql statements go by. This will be quite verbose, so it might not be that great of an idea. You can edit the slurmdbd.conf file with the flag and run sacctmgr reconfig to add or remove the flag. You can force a purge which in turn will force an archive with sacctmgr archive dump see http://slurm.schedmd.com/sacctmgr.html Danny On 03/06/2015 10:12 AM, Paul Edmon wrote: How long does that typically take? Because I have done it on our large job database and I have seen nothing yet. -Paul Edmon On 03/06/2015 01:07 PM, Danny Auble wrote: If you had older than 6 month data I would expect it to purge on restart of the slurmdbd. You will see a message in the log when the archive file is created. On 03/06/2015 09:58 AM, Paul Edmon wrote: Okay, that's what I suspected. We set it to 6 months. So I guess then the purge will happen on April 1st. -Paul Edmon- On 03/06/2015 12:33 PM, Danny Auble wrote: Paul, do you have Purge* set up in the slurmdbd.conf? Archiving takes place during the Purge process. If no Purge values are set archiving will never take place since nothing is ever purged. When Purge values are set the related archivings take place on an hourly, daily, and monthly basis depending on the units your purge values are set to. If PurgeJobs=2months the archive would take place at the beginning of each month. If it were set to 2hours it would happen each hour. The purge will also happen on slurmdbd startup as well if running things for the first time. Danny On 03/06/2015 08:20 AM, Paul Edmon wrote: So we recently turned this on to archive jobs older than 6 months. However when we restarted slurmdbd nothing happened, at least no file was deposited at the specified archive location. Is there a way to force it to purge? When is the archiving scheduled to be done? We definitely have jobs older than 6 months in the database, I'm just curious about the schedule of when the archiving is done. -Paul Edmon-
[slurm-dev] Re: SlurmDBD Archiving
So when I tried to do an archive dump I got the following error. What does this mean? [root@holy-slurm01 slurm]# sacctmgr -i archive dump sacctmgr: error: slurmdbd: Getting response to message type 1459 sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error Problem dumping archive: Unspecified error It also caused the slurmdbd to crash so I had to restart it. Here is the log: Mar 10 08:18:21 holy-slurm01 slurmctld[20144]: slurmdbd: agent queue size 50 Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Mar 10 09:20:57 holy-slurm01 slurmctld[20144]: error: slurmdbd: Getting response to message type 1407 Mar 10 09:20:57 holy-slurm01 slurmctld[20144]: slurmdbd: reopening connection Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd: Sending message type 1472: 11: Connection refused Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused It did manage to dump: [root@holy-slurm01 archive]# ls -ltr total 160260 -rw--- 1 slurm slurm_users 163847476 Mar 10 09:16 odyssey_event_archive_2013-08-01T00:00:00_2014-08-31T23:59:59 -rw--- 1 slurm slurm_users253639 Mar 10 09:17 odyssey_suspend_archive_2013-08-01T00:00:00_2014-08-31T23:59:59 Is it safe to try again? -Paul Edmon- On 03/06/2015 03:07 PM, Paul Edmon wrote: Ah, okay, that was the command I was looking for. I wasn't sure how to force it. Thanks. -Paul Edmon- On 03/06/2015 01:43 PM, Danny Auble wrote: It looks like I might stand corrected though. It looks like you will have to wait for the month to go by before the purge starts. With a lot of jobs it may take a while depending on the speed of your disk and such. If you have the debugflag=DB_USAGE you should see the sql statements go by. This will be quite verbose, so it might not be that great of an idea. You can edit the slurmdbd.conf file with the flag and run sacctmgr reconfig to add or remove the flag. You can force a purge which in turn will force an archive with sacctmgr archive dump see http://slurm.schedmd.com/sacctmgr.html Danny On 03/06/2015 10:12 AM, Paul Edmon wrote: How long does that typically take? Because I have done it on our large job database and I have seen nothing yet. -Paul Edmon On 03/06/2015 01:07 PM, Danny Auble wrote: If you had older than 6 month data I would expect it to purge on restart of the slurmdbd. You will see a message in the log when the archive file is created. On 03/06/2015 09:58 AM, Paul Edmon wrote: Okay, that's what I suspected. We set it to 6 months. So I guess then the purge will happen on April 1st. -Paul Edmon- On 03/06/2015 12:33 PM, Danny Auble wrote: Paul, do you have Purge* set up in the slurmdbd.conf? Archiving takes place during the Purge process. If no Purge values are set archiving will never take place since nothing is ever purged. When Purge values are set the related archivings take place on an hourly, daily, and monthly basis depending on the units your purge values are set to. If PurgeJobs=2months the archive would take place at the beginning of each month. If it were set to 2hours it would happen each hour. The purge will also happen on slurmdbd startup as well if running things for the first time. Danny On 03/06/2015 08:20 AM, Paul Edmon wrote: So we recently turned this on to archive jobs older than 6 months. However when we restarted slurmdbd nothing happened, at least no file was deposited at the specified archive location. Is there a way to force it to purge? When is the archiving scheduled to be done? We definitely have jobs older than 6 months in the database, I'm just curious about the schedule of when the archiving is done. -Paul Edmon-
[slurm-dev] Re: SlurmDBD Archiving
Ok, good to know. We've never purged this database and it has well over 34 million jobs in it. I was hoping to do so in a controlled way, but guess I will have to wait for the 1st. Out of curiousity is there a way to change which day of the month it does the archive on? Could that be added as a feature in a future release? -Paul Edmon- On 03/10/2015 11:18 AM, Danny Auble wrote: The fatal you received means your query lasted more than 15 minutes, mysql deemed it hung and aborted. You can increase the timeout for innodb_lock_wait_timeout in your my.cnf and try again, but that generally isn't a good idea. You can safely try again as many times as you would like and perhaps it will finish or just wait for the normal purge. This timeout should only occur when running the archive dump with sacctmgr, under normal purge with the dbd this wouldn't happen. On March 10, 2015 6:26:19 AM PDT, Paul Edmon ped...@cfa.harvard.edu wrote: So when I tried to do an archive dump I got the following error. What does this mean? [root@holy-slurm01 slurm]# sacctmgr -i archive dump sacctmgr: error: slurmdbd: Getting response to message type 1459 sacctmgr: error: slurmdbd: DBD_ARCHIVE_DUMP failure: No error Problem dumping archive: Unspecified error It also caused the slurmdbd to crash so I had to restart it. Here is the log: Mar 10 08:18:21 holy-slurm01 slurmctld[20144]: slurmdbd: agent queue size 50 Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction Mar 10 09:20:57 holy-slurm01 slurmdbd[47429]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program Mar 10 09:20:57 holy-slurm01 slurmctld[20144]: error: slurmdbd: Getting response to message type 1407 Mar 10 09:20:57 holy-slurm01 slurmctld[20144]: slurmdbd: reopening connection Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd: Sending message type 1472: 11: Connection refused Mar 10 09:21:02 holy-slurm01 slurmctld[20144]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused It did manage to dump: [root@holy-slurm01 archive]# ls -ltr total 160260 -rw--- 1 slurm slurm_users 163847476 Mar 10 09:16 odyssey_event_archive_2013-08-01T00:00:00_2014-08-31T23:59:59 -rw--- 1 slurm slurm_users253639 Mar 10 09:17 odyssey_suspend_archive_2013-08-01T00:00:00_2014-08-31T23:59:59 Is it safe to try again? -Paul Edmon- On 03/06/2015 03:07 PM, Paul Edmon wrote: Ah, okay, that was the command I was looking for. I wasn't sure how to force it. Thanks. -Paul Edmon- On 03/06/2015 01:43 PM, Danny Auble wrote: It looks like I might stand corrected though. It looks like you will have to wait for the month to go by before the purge starts. With a lot of jobs it may take a while depending on the speed of your disk and such. If you have the debugflag=DB_USAGE you should see the sql statements go by. This will be quite verbose, so it might not be that great of an idea. You can edit the slurmdbd.conf file with the flag and run sacctmgr reconfig to add or remove the flag. You can force a purge which in turn will force an archive with sacctmgr archive dump see http://slurm.schedmd.com/sacctmgr.html Danny On 03/06/2015 10:12 AM, Paul Edmon wrote: How long does that typically take? Because I have done it on our large job database and I have seen nothing yet. -Paul Edmon On 03/06/2015 01:07 PM, Danny Auble wrote: If you had older than 6 month data I would expect it to purge on restart of the slurmdbd. You will see a message in the log when the archive file is created. On 03/06/2015 09:58 AM, Paul Edmon wrote: Okay, that's what I suspected. We set it to 6 months. So I guess then the purge will happen on April 1st. -Paul Edmon- On 03/06/2015 12:33 PM, Danny Auble wrote: Paul, do you have Purge* set up in the slurmdbd.conf? Archiving takes place during the Purge process. If no Purge values are set archiving will never take place since nothing is ever purged. When Purge values are set the related archivings take place on an hourly, daily, and monthly basis depending on the units your purge values are set
[slurm-dev] Re: SlurmDBD Archiving
Okay, that's what I suspected. We set it to 6 months. So I guess then the purge will happen on April 1st. -Paul Edmon- On 03/06/2015 12:33 PM, Danny Auble wrote: Paul, do you have Purge* set up in the slurmdbd.conf? Archiving takes place during the Purge process. If no Purge values are set archiving will never take place since nothing is ever purged. When Purge values are set the related archivings take place on an hourly, daily, and monthly basis depending on the units your purge values are set to. If PurgeJobs=2months the archive would take place at the beginning of each month. If it were set to 2hours it would happen each hour. The purge will also happen on slurmdbd startup as well if running things for the first time. Danny On 03/06/2015 08:20 AM, Paul Edmon wrote: So we recently turned this on to archive jobs older than 6 months. However when we restarted slurmdbd nothing happened, at least no file was deposited at the specified archive location. Is there a way to force it to purge? When is the archiving scheduled to be done? We definitely have jobs older than 6 months in the database, I'm just curious about the schedule of when the archiving is done. -Paul Edmon-
[slurm-dev] Re: SlurmDBD Archiving
Ah, okay, that was the command I was looking for. I wasn't sure how to force it. Thanks. -Paul Edmon- On 03/06/2015 01:43 PM, Danny Auble wrote: It looks like I might stand corrected though. It looks like you will have to wait for the month to go by before the purge starts. With a lot of jobs it may take a while depending on the speed of your disk and such. If you have the debugflag=DB_USAGE you should see the sql statements go by. This will be quite verbose, so it might not be that great of an idea. You can edit the slurmdbd.conf file with the flag and run sacctmgr reconfig to add or remove the flag. You can force a purge which in turn will force an archive with sacctmgr archive dump see http://slurm.schedmd.com/sacctmgr.html Danny On 03/06/2015 10:12 AM, Paul Edmon wrote: How long does that typically take? Because I have done it on our large job database and I have seen nothing yet. -Paul Edmon On 03/06/2015 01:07 PM, Danny Auble wrote: If you had older than 6 month data I would expect it to purge on restart of the slurmdbd. You will see a message in the log when the archive file is created. On 03/06/2015 09:58 AM, Paul Edmon wrote: Okay, that's what I suspected. We set it to 6 months. So I guess then the purge will happen on April 1st. -Paul Edmon- On 03/06/2015 12:33 PM, Danny Auble wrote: Paul, do you have Purge* set up in the slurmdbd.conf? Archiving takes place during the Purge process. If no Purge values are set archiving will never take place since nothing is ever purged. When Purge values are set the related archivings take place on an hourly, daily, and monthly basis depending on the units your purge values are set to. If PurgeJobs=2months the archive would take place at the beginning of each month. If it were set to 2hours it would happen each hour. The purge will also happen on slurmdbd startup as well if running things for the first time. Danny On 03/06/2015 08:20 AM, Paul Edmon wrote: So we recently turned this on to archive jobs older than 6 months. However when we restarted slurmdbd nothing happened, at least no file was deposited at the specified archive location. Is there a way to force it to purge? When is the archiving scheduled to be done? We definitely have jobs older than 6 months in the database, I'm just curious about the schedule of when the archiving is done. -Paul Edmon-
[slurm-dev] Requeue Exit
So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon-
[slurm-dev] Re: Requeue Exit
Basically the node cuts out due to hardware issues and the jobs is requeued. I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node changes state as the jobs are running? On 03/03/2015 08:30 AM, Paul Edmon wrote: So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon-
[slurm-dev] Re: Requeue Exit
We are definitely using the default for that one. So it should be requeueing just fine. -Paul Edmon- On 03/03/2015 01:05 PM, Lipari, Don wrote: It looks like the governing config parameter would be: JobRequeue This option controls what to do by default after a node failure. If JobRequeue is set to a value of 1, then any batch job running on the failed node will be requeued for execution on different nodes. If JobRequeue is set to a value of 0, then any job running on the failed node will be terminated. Use the sbatch --no-requeue or --requeue option to change the default behavior for individual jobs. The default value is 1. According to the this, the job should be requeued and not held. -Original Message- From: Paul Edmon [mailto:ped...@cfa.harvard.edu] Sent: Tuesday, March 03, 2015 9:14 AM To: slurm-dev Subject: [slurm-dev] Re: Requeue Exit Basically the node cuts out due to hardware issues and the jobs is requeued. I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node changes state as the jobs are running? On 03/03/2015 08:30 AM, Paul Edmon wrote: So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon-
[slurm-dev] Re: Requeue Exit
In this case the Node was in a funny state where it couldn't resolve user id's. So right after the job tried to launch it failed and requeued. We just let the scheduler do what it will when it lists Node_fail. -Paul Edmon- On 03/03/2015 01:20 PM, David Bigagli wrote: How do you set your node down? If I run a job and then issue 'scontrol update node=prometeo state=down' the job is requeued in pend. Do you have an epilog? On 03/03/2015 10:12 AM, Paul Edmon wrote: We are definitely using the default for that one. So it should be requeueing just fine. -Paul Edmon- On 03/03/2015 01:05 PM, Lipari, Don wrote: It looks like the governing config parameter would be: JobRequeue This option controls what to do by default after a node failure. If JobRequeue is set to a value of 1, then any batch job running on the failed node will be requeued for execution on different nodes. If JobRequeue is set to a value of 0, then any job running on the failed node will be terminated. Use the sbatch --no-requeue or --requeue option to change the default behavior for individual jobs. The default value is 1. According to the this, the job should be requeued and not held. -Original Message- From: Paul Edmon [mailto:ped...@cfa.harvard.edu] Sent: Tuesday, March 03, 2015 9:14 AM To: slurm-dev Subject: [slurm-dev] Re: Requeue Exit Basically the node cuts out due to hardware issues and the jobs is requeued. I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node changes state as the jobs are running? On 03/03/2015 08:30 AM, Paul Edmon wrote: So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon-
[slurm-dev] Re: Requeue Exit
Ah, good to know. I do prefer that behavior, just didn't expect it. Thanks. -Paul Edmon- On 03/03/2015 02:00 PM, David Bigagli wrote: Ah ok, the job failed to launch in this case Slurm requeue the job in held state, the previous behaviour was to terminate the job. The reason for this is to avoid the job dispatch failure over and over. On 03/03/2015 10:53 AM, Paul Edmon wrote: In this case the Node was in a funny state where it couldn't resolve user id's. So right after the job tried to launch it failed and requeued. We just let the scheduler do what it will when it lists Node_fail. -Paul Edmon- On 03/03/2015 01:20 PM, David Bigagli wrote: How do you set your node down? If I run a job and then issue 'scontrol update node=prometeo state=down' the job is requeued in pend. Do you have an epilog? On 03/03/2015 10:12 AM, Paul Edmon wrote: We are definitely using the default for that one. So it should be requeueing just fine. -Paul Edmon- On 03/03/2015 01:05 PM, Lipari, Don wrote: It looks like the governing config parameter would be: JobRequeue This option controls what to do by default after a node failure. If JobRequeue is set to a value of 1, then any batch job running on the failed node will be requeued for execution on different nodes. If JobRequeue is set to a value of 0, then any job running on the failed node will be terminated. Use the sbatch --no-requeue or --requeue option to change the default behavior for individual jobs. The default value is 1. According to the this, the job should be requeued and not held. -Original Message- From: Paul Edmon [mailto:ped...@cfa.harvard.edu] Sent: Tuesday, March 03, 2015 9:14 AM To: slurm-dev Subject: [slurm-dev] Re: Requeue Exit Basically the node cuts out due to hardware issues and the jobs is requeued. I'm just trying to figure out why it sent them into a held state as opposed to just simply requeueing as normal. Thoughts? -Paul Edmon- On 03/03/2015 12:11 PM, David Bigagli wrote: There are no default values for these parameters, you have to configure your own. In your case do the prolog fails or the node changes state as the jobs are running? On 03/03/2015 08:30 AM, Paul Edmon wrote: So what are the default values for these two options? We recently updated to 14.11 and jobs that previously would have just requeued due to node failure are now going into a held state. *RequeueExit* Enables automatic job requeue for jobs which exit with the specified values. Separate multiple exit code by a comma. Jobs will be put back in to pending state and later scheduled again. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. *RequeueExitHold* Enables automatic requeue of jobs into pending state in hold, meaning their priority is zero. Separate multiple exit code by a comma. These jobs are put in the *JOB_SPECIAL_EXIT* exit state. Restarted jobs will have the environment variable *SLURM_RESTART_COUNT* set to the number of times the job has been restarted. -Paul Edmon-
[slurm-dev] Re: slurm.conf consistent across all nodes
You define blocks that have different designs. Such as: # Node1 NodeName=nodes[1-9],nodes[0-6] \ CPUs=64 RealMemory=258302 Sockets=4 CoresPerSocket=16 \ ThreadsPerCore=1 Feature=amd # Node2 NodeName=gpu0[1-8] \ CPUs=32 RealMemory=129013 Sockets=2 CoresPerSocket=16 \ ThreadsPerCore=1 Feature=intel Gres=gpu:2 -Paul Edmon- On 2/2/2015 1:09 PM, Bruce Roberts wrote: Yes. All nodes and their resources need to be defined in the slurm.conf on each node, not a different .conf on each node. On 02/02/2015 10:04 AM, Slurm User wrote: slurm.conf consistent across all nodes Hello The documentation says that slurm.conf must be consistent across all nodes. How do we handle parameters that are different on a per node basis? For example, RealMemory may be X on node 1, but Y on node 2. Am I missing something?
[slurm-dev] Re: slurm.conf consistent across all nodes
Yeah, that's good to get started for a conf, but then following the man page is the next step. -Paul Edmon- On 2/2/2015 1:29 PM, Slurm User wrote: Re: [slurm-dev] Re: slurm.conf consistent across all nodes Ian, Paul Thanks for your replies, that makes sense!!! I was using the configurator.html and it didn't allow multiple definitions. On Mon, Feb 2, 2015 at 10:23 AM, Paul Edmon ped...@cfa.harvard.edu mailto:ped...@cfa.harvard.edu wrote: You define blocks that have different designs. Such as: # Node1 NodeName=nodes[1-9],nodes[0-6] \ CPUs=64 RealMemory=258302 Sockets=4 CoresPerSocket=16 \ ThreadsPerCore=1 Feature=amd # Node2 NodeName=gpu0[1-8] \ CPUs=32 RealMemory=129013 Sockets=2 CoresPerSocket=16 \ ThreadsPerCore=1 Feature=intel Gres=gpu:2 -Paul Edmon- On 2/2/2015 1:09 PM, Bruce Roberts wrote: Yes. All nodes and their resources need to be defined in the slurm.conf on each node, not a different .conf on each node. On 02/02/2015 10:04 AM, Slurm User wrote: Hello The documentation says that slurm.conf must be consistent across all nodes. How do we handle parameters that are different on a per node basis? For example, RealMemory may be X on node 1, but Y on node 2. Am I missing something?
[slurm-dev] Re: Partition for unused resources until needed by any other partition
Yes, we have this set up here. Here is an example: # Serial Requeue PartitionName=serial_requeue Priority=1 \ PreemptMode=REQUEUE MaxTime=7-0 Default=YES MaxNodes=1 \ AllowGroups=cluster_users \ Nodes=blah # Priority PartitionName=priority Priority=10 \ AllowGroups=important_people \ Nodes=blah # JOB PREEMPTION PreemptType=preempt/partition_prio PreemptMode=REQUEUE Since serial_requeue is the lowest priority it gets scheduled last and if any jobs come in from the higher priority queue it requeues the lower priority jobs. -Paul Edmon- On 10/20/2014 02:03 PM, je...@schedmd.com wrote: This should help: http://slurm.schedmd.com/preempt.html Quoting Mikael Johansson mikael.johans...@iki.fi: Hello All, I've been scratching my head for a while now trying to figure this one out, which I would think would be a rather common setup. I would need to set up a partition (or whatever, maybe a partition is actually not the way to go) with the following properties: 1. If there are any unused cores on the cluster, jobs submitted to this one would use them, and immediately have access to them. 2. The jobs should only use these resources until _any_ other job in another partition needs them. In this case, the jobs should be preempted and requeued. So this should be some sort of shadow queue/partition, that shouldn't affect the scheduling of other jobs on the cluster, but just use up any free resources that momentarily happen to be available. So SLURM should just continue scheduling everything else normally, and treat the cores used by this shadow queue as free resources, and then just immediately cancel and requeue any jobs there, when a real job starts. If anyone has something like this set up, example configs would be very welcome, as of course all other suggestions and ideas. Cheers, Mikael J. http://www.iki.fi/~mpjohans/
[slurm-dev] Re: Partition for unused resources until needed by any other partition
I advise using the following SchedulerParameters, partition_job_depth, and bf_max_job_part. This will force the scheduler to schedule jobs for each partition. Otherwise it will take a strictly top down approach. This is what we run: # default_queue_depth should be some multiple of the partition_job_depth, # ideally number_of_partitions * partition_job_depth. SchedulerParameters=default_queue_depth=5700,partition_job_depth=100,bf_interval=1,bf_continue,bf_window=2880,bf_resolution=3600,bf_max_job_test=5,bf_max_job_part=5,b f_max_job_user=1,bf_max_job_start=100,max_rpc_cnt=8 These parameters work well for a cluster of 50,000 cores, 57 queues, and about 40,000 jobs per day. We are running 14.03.8 -Paul Edmon- On 10/20/2014 02:19 PM, Mikael Johansson wrote: Hello, Yeah, I looked at that, and have now four partitions defined like this: PartitionName=shortNodes=node[005-026] Default=YES MaxNodes=6 MaxTime=02:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=no PreemptMode=off PartitionName=medium Nodes=node[009-026] Default=NO MaxNodes=4 MaxTime=168:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=no PreemptMode=off PartitionName=long Nodes=node[001-004] Default=NO MaxNodes=4 MaxTime=744:00:00 AllowGroups=ALL Priority=2 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=no PreemptMode=off PartitionName=backfill Nodes=node[001-026] Default=NO MaxNodes=10 MaxTime=168:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=no PreemptMode=requeue And I've set up: PreemptType=preempt/partition_prio PreemptMode=requeue PriorityType=priority/multifactor PriorityWeightFairshare=1 PriorityWeightAge=2000 SelectType=select/cons_res It works so far as when a job in backfill gets running, it will be requeued when a job in one of the other partitions start. The problem is that there's plenty of free cores on the cluster that don't get assigned jobs from backfill. If I understand things correctly, this is because there are jobs with a higher priority queueing in the other partitions. So I would maybe need a mechanism that increases the priority of the backfill jobs while queueing, but then immediately decreases it when the jobs start? Cheers, Mikael J. http://www.iki.fi/~mpjohans/ On Mon, 20 Oct 2014, je...@schedmd.com wrote: This should help: http: //slurm.schedmd.com/preempt.html Quoting Mikael Johansson mikael.johans...@iki.fi: Hello All, I've been scratching my head for a while now trying to figure this one out, which I would think would be a rather common setup. I would need to set up a partition (or whatever, maybe a partition is actually not the way to go) with the following properties: 1. If there are any unused cores on the cluster, jobs submitted to this one would use them, and immediately have access to them. 2. The jobs should only use these resources until _any_ other job in another partition needs them. In this case, the jobs should be preempted and requeued. So this should be some sort of shadow queue/partition, that shouldn't affect the scheduling of other jobs on the cluster, but just use up any free resources that momentarily happen to be available. So SLURM should just continue scheduling everything else normally, and treat the cores used by this shadow queue as free resources, and then just immediately cancel and requeue any jobs there, when a real job starts. If anyone has something like this set up, example configs would be very welcome, as of course all other suggestions and ideas. Cheers, Mikael J. http: //www.iki.fi/~mpjohans/ -- Morris Moe Jette CTO, SchedMD LLC
[slurm-dev] 14.03 FlexLM
If memory serves I thought that 14.03 was supposed to support hooking into FlexLM licensing. However, I can't find any documentation on that. Was that pushed off to a future release? -Paul Edmon-
[slurm-dev] Job Array sacct feature request
I don't know if this has been done in the newer versions of slurm but it would be good to have sacct be able to list both the JobID and the index of the Job Array if it is a job array. Thanks. -Paul Edmon-
[slurm-dev] Re: QoS Feature Requests
Thanks. For the info. The spillover stuff would be handy, but I can definitely see the difficulties with the coding of it. Though a similar mechanism exists for the partitions where you can list multiple partitions and it will execute on the one that will go first. Could this be imported into QoS? -Paul Edmon- On 5/21/2014 6:52 PM, je...@schedmd.com wrote: Quoting Paul Edmon ped...@cfa.harvard.edu: We have just started using QoS here and I was curious about a few features which would make our lives easier. 1. Spillover/overflow: Essentially if you use up one QoS you would spill over into your next lower priority QoS. For instance if you used up your groups QoS but still had jobs and there were idle cycles your jobs that were pending for your high priority QoS would go to the low priority normal QoS. There isn't a great way to do this today. Each job is associated with a single QOS. One possibility would be to submit one job to each QOS and then whichever job started first would kill the others. A job submit plugin could probably handle the multiple submissions (e.g. if the --qos option has multiple comma-separated names, then submit one job for each QOS). Offhand I'm not sure what would be a good way to identify and purge the extra jobs. Some variation of the --depend=singleton logic would probably do the trick. 2. Gres: Adding number of GPU's or other Gres quantities to the QoS that can be used. This has been discussed, but not implemented yet. 3. Requeue/No Requeue: There are some partitions we want to allow QoS to requeue, others we don't. For instance we have a general queue which we don't want requeue on, but we also have a backfill queue that we do permit it on. If the QoS could kill the backfill jobs first to find space, and just wait on the general queue that would be great. We haven't experimented with QoS Requeue but we may in the future so this is just looking forward. You can configure differtent preemption mechanisms and preempt by either QoS or partition. Take a look at: http://slurm.schedmd.com/preempt.html For example, you might enable QoS high to requeue jobs in QoS low, but wait for jobs in QoS medium. There is no mechanism to configure QoS high to preempt jobs by partition. We were also wondering if jobs asking for Gres could get higher priority on those nodes, such that they can grab the GPU's and leave the CPU's for everyone else. After all the Gres resources are usually scarcer than the CPU resouces and we would hate for a Gres resource to idle just because all the CPU jobs took up the slots. -Paul Edmon- This has also been discussed, but not implemented yet. One option might be to use a job_submit plugin to adjust a job's nice option based upon GRES. There is a partition parameter MaxCPUsPerNode that might be useful to limit the number of CPUs that are consumed on each node by each partition. You would probably require a separate partition/queue for GPU jobs for that to work well, so that would probably not work for you. Let me know if you need help pursuing these options. Moe Jette
[slurm-dev] Re: Removing Job from Slurm Database
Well more like the naive ones namely: sacctmgr delete job JobID How do you set the endtime? Do you do that via scontrol? -Paul Edmon- On 04/21/2014 10:14 PM, Danny Auble wrote: What are the obvious ones? I would expect setting the end time to the start time and state to 4 (I think that is a completed state) should do it. On April 21, 2014 6:54:22 PM PDT, Paul Edmon ped...@cfa.harvard.edu wrote: Sure I can hunt that info down. So what would be the command to remove the job from the DB? I tried the obvious ones I could think of but with not effect. -Paul Edmon- On 4/21/2014 4:31 PM, Danny Auble wrote: Paul, you should be able to remove the job with no issue. The real question is why is it still running in the database instead of completed. If you happen to have any logs on the job and the information from the database it would be nice to look at since what you are describing shouldn't be possible. I know others have seen this before but no one has found a reproducer yet or any evidence on how the state was achieved. Let me know if you have anything like this. Thanks, Danny On 04/21/14 13:05, Paul Edmon wrote: Is there a way to delete a JobID and it's relevant data from the slurm database? I have a user that I want to remove but there is a job which slurm thinks is not complete that is preventing me. I want slurm to just remove that job data as it shouldn't impact anything. -Paul Edmon-
[slurm-dev] Re: Removing Job from Slurm Database
Thanks. Sorry forgot about that thread. I'm wagering that the jobs got orphaned due to timing out. Essentially they actually launched but the didn't successfully update the database because it was busy. -Paul Edmon- On 04/22/2014 12:15 PM, Danny Auble wrote: Paul I think this was covered in this thread https://groups.google.com/forum/#!searchin/slurm-devel/time_start/slurm-devel/nf7JxV91F40/KUsS1AmyWRYJ The just of it is you have to go into the database and manually update the record. If you know the jobid or the db_inx you can do something like this update $CLUSTER_job_table set state=3 time_end=time_start where time_end=0 and id_job=$JOBID; That should make it go away from the check. Knowing why the job didn't finish in the database would be very good to know though as this shouldn't happen. Danny On 04/22/14 06:48, Paul Edmon wrote: Well more like the naive ones namely: sacctmgr delete job JobID How do you set the endtime? Do you do that via scontrol? -Paul Edmon- On 04/21/2014 10:14 PM, Danny Auble wrote: What are the obvious ones? I would expect setting the end time to the start time and state to 4 (I think that is a completed state) should do it. On April 21, 2014 6:54:22 PM PDT, Paul Edmon ped...@cfa.harvard.edu wrote: Sure I can hunt that info down. So what would be the command to remove the job from the DB? I tried the obvious ones I could think of but with not effect. -Paul Edmon- On 4/21/2014 4:31 PM, Danny Auble wrote: Paul, you should be able to remove the job with no issue. The real question is why is it still running in the database instead of completed. If you happen to have any logs on the job and the information from the database it would be nice to look at since what you are describing shouldn't be possible. I know others have seen this before but no one has found a reproducer yet or any evidence on how the state was achieved. Let me know if you have anything like this. Thanks, Danny On 04/21/14 13:05, Paul Edmon wrote: Is there a way to delete a JobID and it's relevant data from the slurm database? I have a user that I want to remove but there is a job which slurm thinks is not complete that is preventing me. I want slurm to just remove that job data as it shouldn't impact anything. -Paul Edmon-
[slurm-dev] srun and node unknown state
Occassionally when we reset the master some of our nodes go into an unknown state or take a bit to get back in contact with the master. If srun is being launched on the nodes at that time it tends to make it hang which causes the mpirun dependent on the srun being launched to fail. Even stranger the sbatch that originally launched the srun keeps running and not failing out right. Is there a way to prevent srun from failing but rather just have it wait until the master comes back? Or is the timeout the only way to set this? Or if this isn't possible can we have the parent sbatch die with an error rather than have srun just hang up? Thanks for any insight. -Paul Edmon-
[slurm-dev] Removing Job from Slurm Database
Is there a way to delete a JobID and it's relevant data from the slurm database? I have a user that I want to remove but there is a job which slurm thinks is not complete that is preventing me. I want slurm to just remove that job data as it shouldn't impact anything. -Paul Edmon-
[slurm-dev] Re: srun and node unknown state
For a relevant error: Apr 20 17:45:19 itc041 slurmd[50099]: launch task 9285091.51 request from 56441.33234@10.242.58.34 (port 704) Apr 20 17:45:19 itc041 slurmstepd[9850]: switch NONE plugin loaded Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherProfile NONE plugin loaded Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherEnergy NONE plugin loaded Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherInfiniband NONE plugin loaded Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherFilesystem NONE plugin loaded Apr 20 17:45:19 itc041 slurmstepd[9850]: Job accounting gather LINUX plugin loaded Apr 20 17:45:19 itc041 slurmstepd[9850]: task NONE plugin loaded Apr 20 17:45:19 itc041 slurmstepd[9850]: Checkpoint plugin loaded: checkpoint/none Apr 20 17:45:19 itc041 slurmd[itc041][9850]: debug level = 2 Apr 20 17:45:19 itc041 slurmd[itc041][9850]: task 0 (9855) started 2014-04-20T17:45:19 Apr 20 17:45:19 itc041 slurmd[itc041][9850]: auth plugin for Munge (http://code.google.com/p/munge/) loaded Apr 20 17:51:54 itc041 slurmd[itc041][9850]: task 0 (9855) exited with exit code 0. Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: slurm_receive_msg: Socket timed out on send/recv operation Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: Rank 0 failed sending step completion message directly to slurmctld (0.0.0.0:0), retrying Apr 20 17:53:59 itc041 slurmd[itc041][9850]: error: Abandoning IO 60 secs after job shutdown initiated Apr 20 17:54:35 itc041 slurmd[itc041][9850]: Rank 0 sent step completion message directly to slurmctld (0.0.0.0:0) Apr 20 17:54:35 itc041 slurmd[itc041][9850]: done with job Is there anyway to prevent this? When this fails it creates a Zombie task that holds the job still open. I think part of the reason why is that the user is looping over mpirun's like this: do i=1,1000 mpirun -np 64 ./executable enddo Each run lasts about 5 minutes. If one of the mpirun's fails to launch the entire thing hangs. It would be better if srun kept trying instead of just failing. -Paul Edmon- On 4/16/2014 11:16 PM, Paul Edmon wrote: Occassionally when we reset the master some of our nodes go into an unknown state or take a bit to get back in contact with the master. If srun is being launched on the nodes at that time it tends to make it hang which causes the mpirun dependent on the srun being launched to fail. Even stranger the sbatch that originally launched the srun keeps running and not failing out right. Is there a way to prevent srun from failing but rather just have it wait until the master comes back? Or is the timeout the only way to set this? Or if this isn't possible can we have the parent sbatch die with an error rather than have srun just hang up? Thanks for any insight. -Paul Edmon-