[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.
Ole Holm Nielsen via slurm-users writes: > Whether or not to enable Hyper-Threading (HT) on your compute nodes > depends entirely on the properties of applications that you wish to > run on the nodes. Some applications are faster without HT, others are > faster with HT. When HT is enabled, the "virtual CPU cores" obviously > will have only half the memory available per core. Another consideration is, if you keep HT enabled, do you want Slurm to hand out physical cores to jobs, or logical cpus (hyperthreads)? Again, what is best depends on your workload. On our systems, we tend to either turn off HT, or hand our cores. -- B/H signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: scrontab question
Sandor via slurm-users writes: > I am working out the details of scrontab. My initial testing is giving me > an unsolvable question If you have an unsolvable problem, you don't have a problem, you have a fact of life. :) > Within scrontab editor I have the following example from the slurm > documentation: > > 0,5,10,15,20,25,30,35,40,45,50,55 * * * * > /directory/subdirectory/crontest.sh - The command (/directory/...) should beon the same line as the time spec (0,5,...) - but that was perhaps just the email formatting. - Check for any UTF8 characters that look like ordinary ascii, for instance "unbreakable space". I tend to just pipe the text throuth "od -a". -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Convergence of Kube and Slurm?
Tim Wickberg via slurm-users writes: > [1] Slinky is not an acronym (neither is Slurm [2]), but loosely > stands for "Slurm in Kubernetes". And not at all inspired by Slinky Dog in Toy Story, I guess. :D -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Munge log-file fills up the file system to 100%
Jeffrey T Frey via slurm-users writes: >> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" >> is per user. > > The ulimit is a frontend to rusage limits, which are per-process restrictions > (not per-user). You are right; I sit corrected. :) (Except for number of procs and number of pending signals, according to "man setrlimit".) Then 1024 might not be so low for ulimit -n after all. -- Regard, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Munge log-file fills up the file system to 100%
Ole Holm Nielsen writes: > Hi Bjørn-Helge, > > That sounds interesting, but which limit might affect the kernel's > fs.file-max? For example, a user already has a narrow limit: > > ulimit -n > 1024 AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" is per user. Now that I think of it, fs.file-max of 65536 seems *very* low. On our CentOS-7-based clusters, we have in the order of tens of millions, and on our Rocky 9 based clusters, we have 9223372036854775807(!) Also a per-user limit of 1024 seems low to me; I think we have in the order of 200K files per user on most clusters. But if you have ulimit -n == 1024, then no user should be able to hit the fs.file-max limit, even if it is 65536. (Technically, 96 jobs from 96 users each trying to open 1024 files would do it, though.) > whereas the permitted number of user processes is a lot higher: > > ulimit -u > 3092846 I guess any process will have a few open files, which I believe count against the ulimit -n for each user (and fs.file-max). > I'm not sure how the number 3092846 got set, since it's not defined in > /etc/security/limits.conf. The "ulimit -u" varies quite a bit among > our compute nodes, so which dynamic service might affect the limits? There is a vague thing in my head saying that I've looked for this before, and found that the default value dependened on the size of the RAM of the machine. But the vague thing might of course be lying to me. :) -- Bjørn-Helge signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Munge log-file fills up the file system to 100%
Ole Holm Nielsen via slurm-users writes: > Therefore I believe that the root cause of the present issue is user > applications opening a lot of files on our 96-core nodes, and we need > to increase fs.file-max. You could also set a limit per user, for instance in /etc/security/limits.d/. Then users would be blocked from opening unreasonably many files. One could use this to find which applications are responsible, and try to get them fixed. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds
We've been running one cluster with SlurmdTimeout = 1200 sec for a couple of years now, and I haven't seen any problems due to that. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Starting a job after a file is created in previous job (dependency looking for soluton)
Amjad Syed via slurm-users writes: > I need to submit a sequence of up to 400 jobs where the even jobs depend on > the preceeding odd job to finish and every odd job depends on the presence > of a file generated by the preceding even job (availability of the file for > the first of those 400 jobs is guaranteed). How about letting each even job submit the next odd job after it has created the file, and also the following even job, with a dependency on the odd job? You would obviuosly have to keep track of how many jobs you've submitted so you can stop after 400 jobs. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?
This isn't answering your question, but I strongly suggest you build Slurm from source. You can use the provided slurm.spec file to make rpms (we do) or use "configure + make". Apart from being able to upgrade whenever a new version is out (especially important for security!), you can tailor the rpms/build to your needs (IB? SlingShot? Nvidia? etc.). -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Re: [slurm-users] propose environment variables SLURM_STDOUT, SLURM_STDERR, SLURM_STDIN
I would find that useful, yes. Especially if the variables were made available for the Prolog and Epilog scripts. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] slurm.conf
LEROY Christine 208562 writes: > Is there an env variable in SLURM to tell where the slurm.conf is? > We would like to have on the same client node, 2 type of possible submissions > to address 2 different cluster. According to man sbatch: SLURM_CONFThe location of the Slurm configuration file. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] SLURM Reservation for GPU
Bjørn-Helge Mevik writes: > (Unfortunately, the page is so "wisely" created that it is impossible > to cut'n'paste from it.) That turned out to be a PEBKAC. :) cut'n'paste *is* possible. :) -- B/H signature.asc Description: PGP signature
Re: [slurm-users] SLURM Reservation for GPU
Minulakshmi S writes: > I am not able to find any supporting statements in Release Notes ... could > you please point. https://www.schedmd.com/news.php, the "Slurm version 23.11.0 is now available" section, the seventh bullet point. (Unfortunately, the page is so "wisely" created that it is impossible to cut'n'paste from it.) : "Notably, this permits reservations to now reserve GRES directly." -- Cheers, B/H signature.asc Description: PGP signature
Re: [slurm-users] SLURM Reservation for GPU
I believe support for this was implemented in 23.11.0. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Releasing stale allocated TRES
"Schneider, Gerald" writes: > Is there any way to release the allocation manually? I've only seen this once on our clusters, and that time it helped just restarting slurmctld. If this is a recurring problem, perhaps it will help to upgrade Slurm. You are running quite an old version. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] --partition requests ignored in scripts
"Bunis, Dan" writes: > My colleagues and I have noticed that our compute cluster seems to > ignore '--partition' requests when we give them as '#SBATCH > --partition=' inside of our scripts, but it respects > them when given in-line within our sbatch calls as 'sbatch > --partition= script.sh'. Based on some googling, it > seems that both methods are meant to work, so I'm wondering if it's > known what can cause the in-script methodology to NOT work for > schedulers where the in-line methodology DOES work? My suspicion is that there is an environment variable SBATCH_PARTITION set in your shells. Such a variable will override the #SBATCH directive, but not the command line switch. From man sbatch: INPUT ENVIRONMENT VARIABLES [...] NOTE: Environment variables will override any options set in a batch script, and command line options will override any environment variables. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] RES: multiple srun commands in the same SLURM script
Paulo Jose Braga Estrela writes: > Hi, > > I think that you have a syntax error in your bash script. The "&" > means that you want to send a process to background not that you want > to run many commands in parallel. To run commands in a serial fashion > you should use cmd && cmd2, then the cmd2 will only be executed if the > command 1 return 0 as exit code. > > To run commands in parallel with srun you should set the number of > tasks to 4, so srun will spawn 4 tasks of the same command. Take a > look at the examples section in srun > docs. (https://slurm.schedmd.com/srun.html) Well, if you look at Example 7 in that section: Example 7: This example shows a script in which Slurm is used to provide resource management for a job by executing the various job steps as processors become available for their dedicated use. $ cat my.script #!/bin/bash srun -n4 prog1 & srun -n3 prog2 & srun -n1 prog3 & srun -n1 prog4 & wait which is what OP tries to do. It is mainly for running *different* programs in parallel inside a job. If one wants to run *the same* program in parallel, then a single srun is indeed the recommended way. I think the main problem is that the original job script only asks for a single CPU, so the sruns will only run one at a time. Try adding --ntasks-per-node=4 or similar. Note that exactly how to run different programs in parallel with srun has changed quite a bit in the recent versions, and the example above is for the latest version, so check the srun man page for your version. (And unfortunately, the documentation in the srun man page has not always been correct, so you might need to experiment. For instance, I believe Example 7 above is missing `--exact` or `SLURM_EXACT`. :) ) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)
Taras Shapovalov writes: > Oh, does this mean that no one should use Slurm versions <= 21.08 any more? That of course depends on your security requirements, but I wouldn't have used those older versions in production any more, at least. (We actually did upgrade from 21.08 to 23.02 on a couple of our clusters due to this.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)
Taras Shapovalov writes: > Are the older versions affected as well? Yes, all older versjons are affected. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] New member , introduction
Welcome! :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] question about configuration in slurm.conf
Felix writes: > I have at my site the following work nodes > > awn001 ... awn099 > > and then it continues awn100 ... awn199 I presume you meant awn-001 etc, not awn001. If not, replace "awn-" with "awn" below. > How can I configure this line > > PartitionName=debug Nodes=awn-0[01-32,46-77,95-99] Default=YES > MaxTime=INFINITE State=UP > > so that it can contain the nodes from 001 to 199 > PartitionName=debug Nodes=awn-0[01-32,46-77,95-99],awn-1[00-99] ... or PartitionName=debug Nodes=awn-[001-032,046-077,095-199] ... should work. I'd personally use the second one. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
[slurm-users] Transport from SLC to Provo?
Dear all, I'm going to SLUG in Provo in September. My flight lands in Salt Lake City Airport (SLC) at 7 pm on Sunday 10. I was planning to go by bus or train from SLC to Provo, but apparently both bus and train have stopped running by that time on Sundays. Does anyone know about any alternative way to get to Provo on a Sunday night? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
[slurm-users] No coffee allowed on BYU campus(!) Suggestions for alternatives?
I've signed up for SLUG 2023, which is on Brigham Young University. I noticed on the Agenda (https://slurm.schedmd.com/slurm_ug_agenda.html) that "coffee is not provided on campus, so be sure to get your morning caffeine before arriving." Following a whole day of lectures without coffee when you're jet-lagged 8 hours and have spent 15 hours travelling, is not going to be easy, so I thought I'd bring a thermos flask and get it filled with coffee in the hotel or in a coffeeshop. But now I discovered https://dancecamps.byu.edu/content/byu-honor-code, which says "no smoking or drinking of alcohol, coffee, or tea is permitted on the BYU campus, though other caffeinated beverages are allowed." So, any suggestions for "other caffeinated beverages" I'd be able to buy and bring with me to the sessions? -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Job step do not take the hole allocation
Hei, Ole! :) Ole Holm Nielsen writes: > Can anyone she light on the relationship between Tommi's > slurm_cli_pre_submit function and the ones defined in the > cli_filter_plugins page? I think the *_p_* functions are functions you need to implement if you write a cli plugin in C. When you write a cli plugin-script in Lua, you write Lua functions called slurm_cli_setup_defaults, slurm_cli_pre_submit, etc in the Lua code, and then the C-code of the Lua plugin itself implements the *_p_* functions (I believe). That said, I too found it hard to find any documentation of the Lua plugin. Eventually, I found an example script in the Slurm source code (etc/cli_filter.lua.example), which I've taken as a starting point for my cli filter plugin scripts. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Limit run time of interactive jobs
Ole Holm Nielsen writes: > On 5/8/23 08:39, Bjørn-Helge Mevik wrote: >> Angel de Vicente writes: >> >>> But one possible way to something similar is to have a partition only >>> for interactive jobs and a different partition for batch jobs, and then >>> enforce that each job uses the right partition. In order to do this, I >>> think we can use the Lua contrib module (check the job_submit.lua >>> example). >> Wouldn't it be simpler to just refuse too long interactive jobs in >> job_submit.lua? > > This sounds like a good idea, but how would one identify an > interactive job in the job_submit.lua script? Good question. :) I merely guessed it is possible. :) > A solution was suggested in > https://serverfault.com/questions/1090689/how-can-i-set-up-interactive-job-only-or-batch-job-only-partition-on-a-slurm-clu >> Interactive jobs have no script and job_desc.script will be empty / > not set. > > So maybe something like this code snippet? > > if job_desc.script == NIL then That sounds like it should work, yes. (But perhaps double check that jobs submitted with "sbatch --wrap" or taking the job script from stdin (if that is still possible) get job_descr.script set.) -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Limit run time of interactive jobs
Angel de Vicente writes: > But one possible way to something similar is to have a partition only > for interactive jobs and a different partition for batch jobs, and then > enforce that each job uses the right partition. In order to do this, I > think we can use the Lua contrib module (check the job_submit.lua > example). Wouldn't it be simpler to just refuse too long interactive jobs in job_submit.lua? -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] [EXT] Submit sbatch to multiple partitions
"Ozeryan, Vladimir" writes: > You should be able to specify both partitions in your sbatch submission > script, unless there is some other configuration preventing this. But Slurm will still only run the job in *one* of the partitions - it will never "pool" two partitions and let the job run on all nodes. All nodes of a job must belong to the same partition. (Another thing I found out recently is that if you specify multiple partitions for an array job, then all array subjobs will run in the same partition.) As Ole suggests: creating a "super partition" containing all nodes will work. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Preventing --exclusive on a per-partition basis
I'd simply add a test like and job_desc.partition == "the_partition" to the test for exclusiveness. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] srun jobfarming hassle question
"Ohlerich, Martin" writes: > Hello Björn-Helge. > > > Sigh ... > > First of all, of course, many thanks! This indeed helped a lot! Good! > b) This only works if I have to specify --mem for a task. Although > manageable, I wonder why one needs to be that restrictive. In > principle, in the use case outlined, one task could use a bit less > memory, and the other may require a bit more the half of the node's > available memory. (So clearly this isn't always predictable.) I only > hope that in such cases the second task does not die from OOM ... (I > will know soon, I guess.) As I understand it, Slurm (at least cgroups) will only kill a step if it uses more memory *in total* on a node than the job got allocated to the node. So if a job has 10 GiB allocated on a node, and a step runs two tasks there, one task could use 9 GiB and the other 1 GiB without the step being killed. You can inspect the memory limits that are in effect in cgroups (v1) in /sys/fs/cgroup/memory/slurm/uid_/job_ (usual location, at least). -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] srun jobfarming hassle question
"Ohlerich, Martin" writes: > Dear Colleagues, > > > already for quite some years now are we again and again facing issues on our > clusters with so-called job-farming (or task-farming) concepts in Slurm jobs > using srun. And it bothers me that we can hardly help users with requests in > this regard. > > > From the documentation > (https://slurm.schedmd.com/srun.html#SECTION_EXAMPLES), it reads like this. > > ---> > > ... > > #SBATCH --nodes=?? > > ... > > srun -N 1 -n 2 ... prog1 &> log.1 & > > srun -N 1 -n 1 ... prog2 &> log.2 & Unfortunately, that part of the documentation is not quite up-to-date. The semantics of srun has changed a little the last couple of years/Slurm versions, so today, you have to use "srun --exact ...". From "man srun" (version 21.08): --exact Allow a step access to only the resources requested for the step. By default, all non-GRES resources on each node in the step allocation will be used. This option only applies to step allocations. NOTE: Parallel steps will either be blocked or rejected until requested step resources are available unless --over‐ lap is specified. Job resources can be held after the com‐ pletion of an srun command while Slurm does job cleanup. Step epilogs and/or SPANK plugins can further delay the release of step resources. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] job_container/tmpfs and autofs
In my opinion, the problem is with autofs, not with tmpfs. Autofs simply doesn't work well when you are using detached fs name spaces and bind mounting. We ran into this problem years ago (with an inhouse spank plugin doing more or less what tmpfs does), and ended up simply not using autofs. I guess you could try using systemd's auto-mounting features, but I have no idea if they work better than autofs in situations like this. We ended up using a system where the prolog script mounts any needed file systems, and then the healthcheck script unmounts file systems that are no longer needed. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to read job accounting data long output? `sacct -l`
Marcus Wagner writes: > That depends on what is meant with formatting argument. Yes, they could surely have defined that. > etc. And I would assume, that -S, -E and -T are filtering options, not > formatting options. I'd describe -T as a formatting option: -T, --truncate Truncate time. So if a job started before --starttime the start time would be truncated to --starttime. The same for end time and --endtime. As I read this, it changes how a job is written, it does not select jobs. > But getting sometimes no steps for a job (if in a larger JSON-output > with many jobs) and then getting the steps, if one asks specifically > for that jobid. That is something I would call broken. That sounds worse, yes. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] How to read job accounting data long output? `sacct -l`
Marcus Wagner writes: > it it important to know, that the json output seems to be broken. > > First of all, it does not (compared to the normal output) obey to the > truncate option -T. > But more important, I saw a job, where in a "day output" (-S -E > ) no steps were recorded. > Using sacct -j --json instead showed that job WITH steps. It is hard to call it "broken" when it is documented behaviour: --jsonDump job information as JSON. All other formatting arguments will be ignored -- Cheers, Bjørn-Helge signature.asc Description: PGP signature
Re: [slurm-users] How to read job accounting data long output? `sacct -l`
Chandler Sobel-Sorenson writes: > Perhaps there is a way to import it into a spreadsheet? You can use `sacct -P -l`, which gives you a '|' separated output, which should be possible to import in a spread sheet. (Personally I only use `-l` when I'm looking for the name of an attribute and am to lazy to read the man page. Then I use -o to specify what I want returned.) Also, in newer versions at least, there is --json and --yaml to give you output which you can parse with other tools (or read, if you really want :). -- Cheers, Bjørn-Helge Mevik signature.asc Description: PGP signature
Re: [slurm-users] Test Suite problems related to requesting tasks
"Groner, Rob" writes: > For your "special testing config", do you just mean the > slurm.conf/gres.conf/*.conf files? Yes. > So when you want to test a new > version of slurm, you replace the conf files and then restart all of > the daemons? Exactly. (We usually don't do this on our production cluster, but on test clusters, were we can change setups as we please. :) ) -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Test Suite problems related to requesting tasks
"Groner, Rob" writes: > I'm wondering OVERALL if the test suite is supposed to work on ANY > working slurm system. I could not find any documentation on how the > slurm configuration and nodes were required to be setup in order for > the test to workno indication that the test suite requires a > particular configuration in order to run successfully. My experience is that the test suite makes some assumptions about the setup, so it will not work with just any config. And as you, I haven't found any documentation about what it expects. > So in other > words, I can't tell if these failing tests are a result of an actual > problem, or a result of the way our cluster is configured, I tend to look at what the failing tests do, and try to figure out what they expect in terms of the config. It's a bit of work, but (at least for our setups) there haven't been too many cases. (And there are a couple of tests I've never understood why they fail. :) ) We have a special config for running the test suite, modified so that most of the assumptions are met. For instance, we turn off the job submit plugin, any prologs/epilogs, healthcheck script, configless mode; and keep things like SchedulerParameters at their default value. > and if it's because of how our cluster is configured, then is it > unreasonable to think I can make use of the test suite? If the failing test is for some feature that we don't use, a different setup than what we have in production, or something that is not essential for our clusters, we just ignore the test. We have a small list of known "please ignore" test failures. :) -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Accounting core-hours usages
Sushil Mishra writes: > Dear all, > > I am pretty new to system administration and looking for some help > setup slumdb or maridb in a GPU cluster. We bought a machine but the vendor > simply installed slurm and did not install any database for accounting. I > tried installing MariaDB and then slurmdb as described in the manual but > looks like I am missing something. I wonder if someone can help us with > this off the list? Perhaps the eminent guide of Ole Nielsen can help you: https://wiki.fysik.dtu.dk/niflheim/SLURM -- Regards, Bjørn-Helge Mevik signature.asc Description: PGP signature
Re: [slurm-users] Use cases for "include" in slurm.conf?
Ole Holm Nielsen writes: > Can anyone shed light on the use cases for "include" in slurm.conf? Until we switched to configless mode, we used to have all partition and node definitions in a separate file, with an include in slurm.conf. The idea was to keep the things that most frequently was changed in a separate file. Also, it made it easier to keep the config of several clusters in sync (running diff on their slurm.conf files wouldn't be cluttered with node definition differences). Similarly, I also saw that on our BULL Atos cluster, Atos had separated out information generated from their database into separate files. That way those parts could be updated when the database changed. Somewhat related, for Slurmdbd, we have a separate slurmdbd_auth.conf file with the username and password for the slurm sql db, so that we can keep the slurmdbd.conf file in a git repo without spreading the password around. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to debug a prolog script?
Davide DelVento writes: >> I'm curious: What kind of disruption did it cause for your production >> jobs? > > All jobs failed and went in pending/held with "launch failed requeued > held" status, all nodes where the jobs were scheduled went draining. > > The logs only said "error: validate_node_specs: Prolog or job env > setup failure on node , draining the node". I guess if they said > "-bash: /path/to/prolog: Permission denied" I would have caught the > problem myself. But that is not a problem caused by having things like exec &> /root/prolog_slurmd.$$ in the script, as you indicated. It is a problem caused by the prolog script file not being executable. > In hindsight it is obvious, but I don't think even the documentation > mentions that, does it? After all you can execute a file with a > non-executable with with "sh filename", so I made the incorrect > assumption that slurm would have invoked the prolog that way. Slurm prologs can be written in any language - we used to have perl prolog scripts. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to debug a prolog script?
Davide DelVento writes: > Does it need the execution permission? For root alone sufficient? slurmd runs as root, so it only need exec perms for root. >> > 2. How to debug the issue? >> I'd try capturing all stdout and stderr from the script into a file on the >> compute >> node, for instance like this: >> >> exec &> /root/prolog_slurmd.$$ >> set -x # To print out all commands > > Do you mean INSIDE the prologue script itself? Yes, inside the prolog script itself. > Yes, this is what I'd have done, if it weren't so disruptive of all my > production jobs, hence I had to turn it off before wrecking havoc too > much. I'm curious: What kind of disruption did it cause for your production jobs? We use this in our slurmd prologs (and similar in epilogs) on all our production clusters, and have not seen any disruption due to it. (We do have things like ## Remove log file if we got this far: rm -f /root/prolog_slurmd.$$ at the bottom of the scripts, though, so as to remove the log file when the prolog succeeded.) > Sure, but even "just executing" there is stdout and stderr which could > be captured and logged rather than thrown away and force one to do the > above. True. But slurmd doesn't, so... > How do you "install the prolog scripts there"? Isn't the prolog > setting in slurm.conf global? I just overwrite the prolog script file itself on the node. We don't have them on a shared file system, though. If you have the prologs on a shared file system, you'd have to override the slurm config on the compute node itself. This can be done in several ways, for instance by starting slurmd with the "-f " option. >> (Otherwise one could always >> set up a small cluster of VMs and use that for simpler testing.) > > Yes, but I need to request that cluster of VM to IT, have the same OS > installed and configured (and to be 100% identical, it needs to be > RHEL so license paid), and everything sync'ed with the actual > cluster I know it'd be very useful, but sadly we don't have the > resources to do that, so unfortunately this is not an option for me. I totally agree that VMs instead of a physical test cluster is never going to be 100 % the same, but some things can be tested even though the setups are not exactly the same (for instance, in my experience, CentOS and Rocky are close enough to RHEL for most slurm-related things). One takes what one have. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to debug a prolog script?
Davide DelVento writes: > 2. How to debug the issue? I'd try capturing all stdout and stderr from the script into a file on the compute node, for instance like this: exec &> /root/prolog_slurmd.$$ set -x # To print out all commands before any other commands in the script. The "prolog_slurmd." will then contain a log of all commands executed in the script, along with all output (stdout and stderr). If there is no "prolog_slurmd." file after the job has been scheduled, then as has been pointed out by others, slurm wasn't able to exec the prolog at all. > Even increasing the debug level the > slurmctld.log contains simply a "error: validate_node_specs: Prolog or > job env setup failure on node xxx, draining the node" message, without > even a line number or anything. Slurm only executes the prolog script. It doesn't parse it or evaluate it itself, so it has no way of knowing what fails inside the script. > 3. And more generally, how to debug a prolog (and epilog) script > without disrupting all production jobs? Unfortunately we can't have > another slurm install for testing, is there a sbatch option to force > utilizing a prolog script which would not be executed for all the > other jobs? Or perhaps making a dedicated queue? I tend to reserve a node, install the updated prolog scripts there, and run test jobs asking for that reservation. (Otherwise one could always set up a small cluster of VMs and use that for simpler testing.) -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Cgroup task plugin fails if ConstrainRAMSpace and ConstrainKmemSpace are enabled
This doesn't answer your question, but still: I'd be wary about using ConstrainKmemSpace at all. At least in the kernels on RedHat/CentOS <= 7.9, there is a bug in that eventually prevents Slurm from starting new job steps on a node, and the node has to be rebooted to be usable again. See for instance https://bugs.schedmd.com/show_bug.cgi?id=5507. (The bug report is old, but we got that result on a system with RHEL 7.7 earlier this year.) -- B/H signature.asc Description: PGP signature
Re: [slurm-users] "slurmd -C" reduce by xx GB or yy %
"Eg. Bo." writes: > as far as I understand it's good practice to lower the RealMemory > value reported by slurmd -C by a given amount of GB or by > percentage.What's the best approach to calculate a given target value, > for different HW types? I tend to use a small C program that mallocs and fills a large array, and see how big I can make the array before the node starts to swap. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage
Miguel Oliveira writes: > Hi Bjørn-Helge, > > Long time! Hi Miguel! Yes, definitely a long time! :D > Why not? You can have multiple QoSs and you have other techniques to change > priorities according to your policies. A job can only run in a single QoS, so if you submit a job with "sbatch --qos=devel ..." it will no longer be running in the account QoS and thus its usage will not be recorded in that QoS. If that is ok, then no problem, but if you want all jobs of an account to be limited by the TRESMins limit, then you cannot use other QoS'es than the account QoSes (except for partition QoSes). -- Bjørn-Helge signature.asc Description: PGP signature
Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage
Miguel Oliveira writes: > It is not exactly true that you have no solution to limit projects. If > you implement each project as an account then you can create an > account qos with the NoDecay flags. > This will not affect associations so priority and fair share are not impacted. Yes, that will work. But it has the drawback that you cannot use QoS'es for *anything else*, like a QoS for development jobs or similar. So either way it is a trade-off. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage
writes: > TRESRaw cpu is lower than before as I'm alone on the system an no other job > was submitted. > Any explanation of this ? I'd guess you have turned on FairShare priorities. Unfortunately, in Slurm the same internal variables are used for fairshare calculations as for GrpTRESMins (and similar), so when fair share priorities are in use, slurm will reduce accumulated GrpTRESMins over time. This means that it is impossible(*) to use GrpTRESMins limits and fairshare priorities at the same time. (*) It is possible to tell slurm *not* to reduce the accumulated TRESMins of a QoS, so you can technically use GrpTRESMins limits on a qos, and fair share priorites on the accounts and/or users. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage
writes: > I thought job's cpu TRESRaw = nb of reserved core X walltime (mn) It is the "TRES billing cost" x walltime. What the TRES billing cost of a job is depends on how you've set up the TRESBillingWeights on the partitions, and whethery you've defined PriorityFlags=MAX_TRES or not. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Need to restart slurmctld for gres jobs to start
tluchko writes: > Jobs only sit in the queue with RESOURCES as the REASON when we > include the flag --gres=bandwidth:ib. If we remove the flag, the jobs > run fine. But we need the flag to ensure that we don't get a mix of IB > and ethernet nodes because they fail in this case. This doesn't answer your real question, but couldn't you just use features for ib and ethernet. Jobs wanting nodes with ib would then specify --constraint=ib, etc. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"
Per Lönnborg writes: > I "forgot" to tell our version because it´s a bit embarrising - 19.05.8... Haha! :D -- B/H signature.asc Description: PGP signature
Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"
Per Lönnborg writes: > Greetings, God dag! > is there a way to lower the log rate on error messages in slurmctld for nodes > with hardware errors? You don't say which version of Slurm you are running, but I think this was changed in 21.08, so the node will only try to register once if it has too little memory, thus only giving one such message. (The node will then hva state "inval" in sinfo.) -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Strange memory limit behavior with --mem-per-gpu
Paul Raines writes: > Basically, it appears using --mem-per-gpu instead of just --mem gives > you unlimited memory for your job. > > $ srun --account=sysadm -p rtx8000 -N 1 --time=1-10:00:00 > --ntasks-per-node=1 --cpus-per-task=1 --gpus=1 --mem-per-gpu=8G > --mail-type=FAIL --pty /bin/bash > rtx-07[0]:~$ find /sys/fs/cgroup/memory/ -name job_$SLURM_JOBID > /sys/fs/cgroup/memory/slurm/uid_5829/job_1134067 > rtx-07[0]:~$ cat > /sys/fs/cgroup/memory/slurm/uid_5829/job_1134067/memory.limit_in_bytes > 1621419360256 > > That is a limit of 1.5TB which is all the memory on rtx-07, not > the 8G I effectively asked for at 1 GPU and 8G per GPU. Which version of Slurm is this? We noticed a behaviour similar to this on Slurm 20.11.8, but when we tested it on 21.08.1, we couldn't reproduce it. (We also noticed an issue with --gpus-per-task that appears to have been fixed in 21.08.) -- B/H signature.asc Description: PGP signature
Re: [slurm-users] srun and --cpus-per-task
Hermann Schwärzler writes: > Do you happen to know if there is a difference between setting CPUs > explicitely like you do it and not setting it but using > "ThreadsPerCore=1"? > > My guess is that there is no difference and in both cases only the > physical cores are "handed out to jobs". But maybe I am wrong? I don't think we've ever tried that. But I'd be sceptical about "lying" to Slurm about the actual hardware structure - it migth confuse the cpu binding if Slurm and the kernel has different pictures of the hardware. -- Bjørn-Helge signature.asc Description: PGP signature
Re: [slurm-users] Disable exclusive flag for users
pankajd writes: > We have slurm 21.08.6 and GPUs in our compute nodes. We want to restrict / > disable the use of "exclusive" flag in srun for users. How should we do it? Two options would be to use the CLI_filter plugin or the job_submit plugin. If you want the enforcement to be guaranteed, then the job_submit plugin is the place (cli_filter can be circumvented by user). For instance, in job_submit.lua: if job_desc.shared == 0 or job_desc.shared == 2 or job_desc.shared == 3 then slurm.user_msg ("Warning! Please do not use --exclusive unless you really know what you are doing. Your job might be accounted for more CPUs than it actually uses, sometimes many times more. There are better ways to specify using whole nodes, for instance using all cpus on the node or all memory on the node.") end or in cli_filter.lua: is_bad_exclusive = { exclusive = true, user = true, mcs = true } if is_bad_exclusive[options["exclusive"]] then slurm.log_info("Warning! Please do not use --exclusive unless you really know what you are doing. Your job might be accounted for more CPUs than it actually uses, sometimes many times more. There are better ways to specify using whole nodes, for instance using all cpus on the node or all memory on the node.") end (both of these just warn, though, but should be easy to change into rejecting the job.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] srun and --cpus-per-task
For what it's worth, we have a similar setup, with one crucial difference: we are handing out physical cores to jobs, not hyperthreads, and we are *not* seeing this behaviour: $ srun --cpus-per-task=1 -t 10 --mem-per-cpu=1g -A nnk -q devel echo foo srun: job 5371678 queued and waiting for resources srun: job 5371678 has been allocated resources foo $ srun --cpus-per-task=3 -t 10 --mem-per-cpu=1g -A nnk -q devel echo foo srun: job 5371680 queued and waiting for resources srun: job 5371680 has been allocated resources foo We have SelectType=select/cons_tres SelectTypeParameters=CR_CPU_Memory and node definitions like NodeName=DEFAULT CPUs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=182784 Gres=localscratch:330G Weight=1000 (so we set CPUs to the number of *physical cores*, not *hyperthreads*). -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] monitoring and update regime for Power Saving nodes
David Simpson writes: > * When you want to make changes to slurm.conf (or anything else) to > a node which is down due to power saving (during a > maintenance/reservation) what is your approach? Do you end up with 2 > slurm.confs (one for power saving and one that keeps everything up, to > work on during the maintenance)? For the slurm.conf part, I'd suggest using the "configless" mode - that way at least the slurm config will always be up-to-date. See, e.g., https://slurm.schedmd.com/configless_slurm.html -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Problems with sun and TaskProlog
"Putnam, Harry" writes: > /opt/slurm/task_epilog > > #!/bin/bash > mytmpdir=/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID > rm -Rf $mytmpdir > exit; This might not be the reason for what you observe, but I believe deleting the scratch dir in the task epilog is not a good idea. The task epilog is run after every "srun" or "mpirun" inside a job, which means that the scratch dir will be created and deleted for each job step. On our systems, we create the scratch dir in the (slurmd) Prolog, set the environment variable in the TaskProlog, and delete the dir in the (slurmd) Epilog. That way the dir is just created and deleted once. > I am not sure I understand what constitutes a job step. In practice, every run of srun or mpirun creates a job step, and the job script itself counts as a job step. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information
Ole Holm Nielsen writes: > As Brian Andrus said, you must upgrade Slurm by at most 2 major > versions, and that includes slurmd's as well! Don't do a "direct > upgrade" of slurmd by more than 2 versions! That should only be an issue if you have running jobs during the upgrade, shouldn't it? As I understand it, without any running jobs, you can do pretty much what you want on the compute nodes. Or am I missing something here? -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?
This might not apply to your setup, but historically when we've seen similar behaviour, it was often due to the affected compute nodes missing from /etc/hosts on some *other* compute nodes. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Questions about default_queue_depth
David Henkemeyer writes: > 3) Is there a way to see the order of the jobs in the queue? Perhaps > squeue lists the jobs in order? squeue -S -p Sort jobs in descending priority order. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Is this a known error?
Sean McGrath writes: > I'm seeing something similar. > > slurmdbd version is 21.08.4 > > All the slurmd's & slurmctld's are version 20.11.8 > > This is what is in the slurmdbd.log > > [2021-12-07T17:16:50.001] error: unpack_header: protocol_version 8704 not > supported I believe 8704 corresponds to 19.05.x, which is no longer accepted in 21.08.x. > Can anyone advise how to identify the clients that are generating those > errors please? I don't think slurmd connects directly to slurmdbd, so perhaps it is some frontend node or machine outside the cluster itself which has the slurm commands installed and is doing requests to slurmdbd (sacct, sacctmgr, etc.)? With SlurmdbdDebug set to debug or higher, new client connections will be logged with [2021-12-08T09:00:07.992] debug: REQUEST_PERSIST_INIT: CLUSTER:saga VERSION:9472 UID:51568 IP:10.2.3.185 CONN:8 in slurmdbd.log. But perhaps that will not happen if slurmdbd fails to unpack the header? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links
Adrian Sevcenco writes: > On 01.12.2021 10:25, Bjørn-Helge Mevik wrote: > >> In the end we had to give up >> using automount, and implement a manual procedure that mounts/umounts >> the needed nfs areas. > > Thanks a lot for info! manual as in "script" or as in "systemd.mount service"? Script. We mount (if needed) in the prolog. Then in the healthcheck (run every 5 mins), we check if a job is still running on the node that needs the mount, and unmounts if not. (We could have done it in the epilog, but feared it could lead to a lot of mount/umount cycles if a set of jobs failed immediately. Hence we put it in the healthcheck script instead.) I don't have much experience with the systemd.mount service, but it is possible it would work fine (and be less hackish than our solution :). > Also, the big and the only advantage that autofs had over static mounts was > that whenever there was a problem with the server, after the passing of the > glitch > the autofs would re-mount the target... That's in theory. :) Our experience in practice is that if the client is actively using the nfs mounted are when the problem arises, you will often have to reboot the client to resolve the disk waits. (I *think* it has something to do with nfs using longer and longer timeouts when it cannot reach the server, so eventually it will take too long to time out and return an error to the running applications.) > I'm not very sure that a static nfs mount have this capability ... did you > baked in > your manual procedure also a recovery part? No, we simply pretend it will not happen. :) In fact, I think we've only had this type of problems once or twice in the last four-five years. But this might be because we only mount the homedirs with nfs, so most of the time, the jobs are not actively using the nfs mounted area. (The most activity happen in BeeGFS or GPFS mounted areas.) -- Bjørn-Helge signature.asc Description: PGP signature
Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links
Adrian Sevcenco writes: > Hi! Does anyone know what could the the cause of such error? > I have a shared home, slurm 20.11.8 and i try a simple script in the submit > directory > which is in the home that is nfs shared... We had the "Too many levels of symbolic links" error some years ago, while using a combination of automounting nfs areas and private fs name spaces to get a private /tmp for each job. In the end we had to give up using automount, and implement a manual procedure that mounts/umounts the needed nfs areas. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Per-job TMPDIR: how to lookup gres allocation in prolog?
Mark Dixon writes: > Unfortunately, I've not found anything in the Prolog environment (or > stored on disk under /var/spool/slurmd) containing the gres > allocations for the job. [...] > Is there a better way to get the job's gres information from within > the prolog, please? We are using basically the same setup, and have not found any other way than running "scontrol show job ..." in the prolog (even though it is not recommended). I have yet to see any problems arising from it, but YMMW. If you find a different way, please share it with the list! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Warning: can't honor --ntasks-per-node
Ginés Guerrero writes: > Hi, > > If I submit this script: > > #!/bin/bash > #SBATCH --get-user-env > #SBATCH -p slims > #SBATCH -N 2 > #SBATCH -n 40 > #SBATCH --ntasks-per-node=20 > #SBATCH -o log > #SBATCH -e log > > /bin/env > > srun hostname > > I get the warning: “can't honor --ntasks-per-node set to 20 which > doesn't match the requested tasks 40 with the number of requested > nodes 1. Ignoring –ntasks-per-node”. Are you using IntelMPI? I've seen this type of warning in some situations with IntelMPI. In all our cases, "srun hostname" or "mpirun hostname" shows that it *does* honor --ntasks-per-node. (So we generally just ask our users to check with "srun hostname", and ignore the warning if it works as expected.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Bug when I run "sinfo --states=idle"
David Henkemeyer writes: > I just noticed today that when I run "sinfo --states=idle", I get all the > idle nodes, plus an additional node that is in the "DRAIN" state (notice > how xavier6 is showing up below, even though its not in the idle state): I *think* this could be because if you drain an idle node, it gets the state "IDLE+DRAIN", and then "sinfo --states=idle" will include it. For instance: # sinfo --states=idle PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up 7-00:00:00 33 resv c1-[21-22,24-28,47-52,56],c2-[15-18,21-23,36-39,43,54-56],c3-[9-10,13,15] bigmem up 14-00:00:0 1 idle c3-55 accelup 14-00:00:0 0n/a optimist up infinite 33 resv c1-[21-22,24-28,47-52,56],c2-[15-18,21-23,36-39,43,54-56],c3-[9-10,13,15] optimist up infinite 1 idle c3-55 # drain c3-55 bhmtest # scontrol show node c3-55 NodeName=c3-55 Arch=x86_64 CoresPerSocket=20 [...] State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1000 Owner=N/A MCS_label=N/A [...] # sinfo --states=idle normal* up 7-00:00:00 33 resv c1-[21-22,24-28,47-52,56],c2-[15-18,21-23,36-39,43,54-56],c3-[9-10,13,15] bigmem up 14-00:00:0 1 drain c3-55 accelup 14-00:00:0 0n/a optimist up infinite 1 drain c3-55 optimist up infinite 33 resv c1-[21-22,24-28,47-52,56],c2-[15-18,21-23,36-39,43,54-56],c3-[9-10,13,15] -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command
Amjad Syed writes: > We have users who have have defined unix secondary id on our login nodes. > > vas20xhu@login01 ~]$ groups > BIO_pg BIO_AFMAKAY_LAB_USERS > > But when we run interactive and go to compute node , the user does not > have secondary group of BIO_AFMAKAY_LAB_USERS > > vas20xhu@c0077 ~]$ groups > BIO_pg [...] > When we ssh directly into node without using interactive script there are > no issues with groups. Have you set up your Slurm to be NSS provider for user and group info? I believe that will only send primary group to the job step processes. See the enable_nss_slurm LaunchParameters in man slurm.conf, and the URL in that description. -- Regards, Bjørn-Helge Mevik signature.asc Description: PGP signature
Re: [slurm-users] Is this a known error?
Andreas Davour writes: > [2021-09-17T08:53:49.166] error: unpack_header: protocol_version 8448 > not supported > [2021-09-17T08:53:49.166] error: unpacking header > [2021-09-17T08:53:49.166] error: destroy_forward: no init > [2021-09-17T08:53:49.166] error: slurm_receive_msg_and_forward: > Message receive failure > [2021-09-17T08:53:49.176] error: service_connection: > slurm_receive_msg: Message receive failure > > Anyone seen that before, or immediately see that I did something wrong? Sounds a lot like you have a different version of Slurm installed on some compute node(s). -- B/H signature.asc Description: PGP signature
Re: [slurm-users] FreeMem is not equal to (RealMem - AllocMem)
Pavel Vashchenkov writes: > There is a line "RealMemory=257433 AllocMem=155648 FreeMem=37773 > Sockets=2 Boards=1" > > > My question is: Why there is so few FreeMem (37 GB instead of expected > 100 GB (RealMem - AllocMem))? If I recall correctly, RealMem is what you have configured in slurm.conf, and AllocMem is how much Slurm has allocated to jobs, while FreeMem is how much ram is unused on the machine. So RealMem and AllocMem do not necessarily correspond to what "free" or "top" reports. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] draining nodes due to failed killing of task?
Adrian Sevcenco writes: > Having just implemented some triggers i just noticed this: > > NODELISTNODES PARTITION STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT > AVAIL_FE REASON > alien-0-47 1alien*draining 48 48:1:1 193324 214030 1 > rack-0,4 Kill task failed > alien-0-56 1alien* drained 48 48:1:1 193324 214030 1 > rack-0,4 Kill task failed > > i was wondering why a node is drained when killing of task fails I guess the heuristic is that something is wrong with the node, so it should not run more jobs. Like Disk-waits or similar that might require a reboot. > and how can i disable it? (i use cgroups) I don't know how to disable it, but it can be tuned with: UnkillableStepTimeout The length of time, in seconds, that Slurm will wait before deciding that processes in a job step are unkillable (after they have been signaled with SIGKILL) and execute Unkill‐ ableStepProgram. The default timeout value is 60 seconds. If exceeded, the compute node will be drained to prevent future jobs from being scheduled on the node. (Note though, that according to https://bugs.schedmd.com/show_bug.cgi?id=11103 it should not be set higher than 127 s.) You might also want to look at this setting to find out what is going on on the machine when Slurm cannot kill the job step: UnkillableStepProgram If the processes in a job step are determined to be unkill‐ able for a period of time specified by the UnkillableStepTi‐ meout variable, the program specified by UnkillableStepPro‐ gram will be executed. By default no program is run. See section UNKILLABLE STEP PROGRAM SCRIPT for more informa‐ tion. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Building SLURM with X11 support
Thekla Loizou writes: > Also, when compiling SLURM in the config.log I get: > > configure:22291: checking whether Slurm internal X11 support is enabled > configure:22306: result: > > The result is empty. I read that X11 is build by default so I don't > expect a special flag to be given during compilation time right? My guess is that some X development library is missing. Perhaps look in the configure script for how this test was done (typically it will try to compile something with those devel libraries, and fail). Then see which package contains that library, install it and try again. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] schedule mixed nodes first
Durai Arasan writes: > Is there a way of improving this situation? E.g. by not blocking IDLE nodes > with jobs that only use a fraction of the 8 GPUs? Why are single GPU jobs > not scheduled to fill already MIXED nodes before using IDLE ones? > > What parameters/configuration need to be adjusted for this to be enforced? There are two SchedulerParameters you could experiment with (from man slurm.conf): bf_busy_nodes When selecting resources for pending jobs to reserve for future execution (i.e. the job can not be started immediately), then pref‐ erentially select nodes that are in use. This will tend to leave currently idle resources available for backfilling longer running jobs, but may result in allocations having less than optimal network topology. This option is currently only supported by the select/cons_res and select/cons_tres plugins (or select/cray_aries with SelectTypeParameters set to "OTHER_CONS_RES" or "OTHER_CONS_TRES", which layers the select/cray_aries plugin over the select/cons_res or select/cons_tres plugin respectively). pack_serial_at_end If used with the select/cons_res or select/cons_tres plugin, then put serial jobs at the end of the available nodes rather than using a best fit algorithm. This may reduce resource fragmentation for some workloads. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] How can I get complete field values with without specify the length
"xiaojingh...@163.com" writes: > I am doing a parsing job on slurm fields. Sometimes when one field is > too long, slum will limit the length with a “+”. You don't say which slurm command you are trying to parse the output from, but if it is sacctmgr, it has an option --parsable2(*) specifically designed for parsing output, and which does not truncate long field values. (*) There is also --parsable, but that puts an extra "|" at the end of the line, so I prefer --parsable2. -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository
Thanks for the heads-up, Ole! -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Set a ramdom offset when starting node health check in SLURM
You can also check out HealthCheckNodeState=CYCLE man slurm.conf: "Rather than running the health check program on all nodes at the same time, cycle through running on all compute nodes through the course of the HealthCheckInterval. May be combined with the various node state options." -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Slurm User Group Meeting (SLUG'20) Agenda Posted
Just wondering, will we get our t-shirts by email? :D -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] GrpMEMRunMins equivalent?
Corey Keasling writes: > And thank you also for the solution, I hadn't tried that > syntax. Interesting that GrpCPURunMins works while GrpMemRunMins does > not. Historic reasons: GrpCPURunMins has been there a long time. Instead of adding GrpMemRunMins, GrpGPURunMins, etc. for all the TRES on can specify, they chose to add GrpTRESRunMins instead. > I also noticed that if the limit is specified as > GrpTRESRunMins=Memory=1000,Cpu=2000 only the CPU portion takes effect -- > the Memory= portion is silently dropped. And, specifying Memory=1000 > by itself results in 'Unknown option: grptresrunmins=memory=1000'. > Only Mem= works, and it works in both instances. My bad. Mem is correct, Memory is my false memory. :) > In fact, it looks > like any unknown option is silently ignored so long as at least one > correctly named TRES appears in the list. Interesting to know! -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] GrpMEMRunMins equivalent?
Corey Keasling writes: > The documentation only refers to GrpGRESRunMins, but I can't figure > out what I might substitute for GRES that means Memory in the same way > that substituting CPU means, well, CPUs. Google turns up precisely > nothing for GrpMemRunMins... Am I missing something? GrpTRESRunMins For instance: GrpTRESRunMins=Memory=1000,Cpu=2000 See man sacctmgr for details. -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to get command from a finished job
Gestió Servidors writes: > For example, with "scontrol show jobid" I can know what command has > been submited, its workir, the stderr file and the stdout one. This > information, I think, cannot be obtained when the job is finished and > I run "sacct". The workdir is available with sacct, IIRC. For other types of information, I believe you can add code to your job_submit.lua that stores it in the job's AdminComment field, which sacct can display. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to trap a SIGINT signal in a child process of a batch ?
Jean-mathieu CHANTREIN writes: > But that is not enough, it is also necessary to use srun in > test.slurm, because the signals are sent to the child processes only > if they are also children in the JOB sense. Good to know! -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to trap a SIGINT signal in a child process of a batch ?
Jean-mathieu CHANTREIN writes: > test.sh: > > #!/bin/bash > > function sig_handler() > { > echo "Executable interrupted" > exit 2 > } > > trap 'sig_handler' SIGINT > > echo "BEGIN" > sleep 200 > echo "END" Note that bash does not interrupt any running command (except "wait") when it receives a trapped signal, so the "sleep 200" will not be interrupted. The "wait" command is special; it will be interrupted. From man bash: If bash is waiting for a command to complete and receives a signal for which a trap has been set, the trap will not be executed until the command completes. When bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed. So try using sleep 200 & wait instead. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] log rotation for slurmctld.
Marcus Wagner writes: > by concidence, I have stumbled today over the troubleshooting slides > from slug 2019. > > SchedMD there explicitly tells us to use SIGUSR2 instead of restart / > reload / reconfig / SIGHUP. Right, I forgot about that. :) SIGUSR2 will not even do a reconfigure, just reopend the log file. Thanks for the reminder! -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] log rotation for slurmctld.
navin srivastava writes: > can i move the log file to some other location and then restart.reload of > slurm service will start a new log file. Yes, restarting it will start a new log file if the old one is moved away. However, also reconfig will do, and you can trigger that by sending the process a HUP signal. That way you don't have to restart the daemon. We have this in our logrotate file: postrotate ## Using the newer feature of reconfig when getting a SIGHUP. kill -hup $(ps -C slurmctld h -o pid) kill -hup $(ps -C slurmdbd h -o pid) endscript (That is for both slurmctld.log and slurmdbd.log.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Question about SacctMgr....
Ole Holm Nielsen writes: > You may use the (undocumented) format=... option to select only the A while ago, after meticulous study of the man page, I discovered that the format option is not actually undocumented, it is just very well hidden. :) All that "man sacctmgr" says about it is GLOBAL FORMAT OPTION When using the format option for listing various fields you can put a %NUMBER afterwards to specify how many characters should be printed. e.g. format=name%30 will print 30 characters of field name right justified. A -30 will print 30 characters left justified. (in addition to using it in a couple of examples). :) -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] memory in job_submit.lua
Marcus Wagner writes: > does anyone know how to detect in the lua submission script, if the > user used --mem or --mem-per-cpu? > > And also, if it is possible to "unset" this setting? Something like this should work: if job_desc.pn_min_memory ~= slurm.NO_VAL64 then -- --mem or --mem-per-cpu was used; unset it job_desc.pn_min_memory = slurm.NO_VAL64 end > The reason is, we want to remove all memory thingies set by the user > for exclusive jobs. We just reject jobs if they use a setting we don't allow -- that avoids jobs running differently than what the user believed. For instance: -- Bigmem jobs should specify memory, no other job should if job_desc.pn_min_memory == slurm.NO_VAL64 then -- If bigmem: fail if job_desc.partition == "bigmem" then slurm.log_info( "bigmem job from uid %d without memory specification: Denying.", job_desc.user_id) slurm.user_msg("--mem or --mem-per-cpu required for bigmem jobs") return 2044 -- Signal ESLURM_INVALID_TASK_MEMORY end else -- If not bigmem: fail if job_desc.partition ~= "bigmem" then slurm.log_info( "non-bigmem job from uid %d with memory specification: Denying.", job_desc.user_id) slurm.user_msg("Memory specification only allowed for bigmem jobs") return 2044 -- Signal ESLURM_INVALID_TASK_MEMORY end end -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Question on how to make slurm aware of a CVMFS revision
"Klein, Dennis" writes: > * Can I (and if yes, how can I) update the GRES count dynamically > (The idea would be to monitor the revision changes on all cvmfs > mountpoints with a simple daemon process on each worker node which > then notifies slurm on a revision change)? Perhaps the daemon process could simply run "scontrol update node= ..." when it detects a change? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Slurm 18.08.8 --mem-per-cpu + --exclusive = strange behavior
Beatrice Charton writes: > Hi, > > We have a strange behaviour of Slurm after updating from 18.08.7 to > 18.08.8, for jobs using --exclusive and --mem-per-cpu. > > Our nodes have 128GB of memory, 28 cores. > $ srun --mem-per-cpu=3 -n 1 --exclusive hostname > => works in 18.08.7 > => doesn’t work in 18.08.8 I'm actually surprised it _worked_ in 18.08.7. At one time - long before v 18.08, the behaviour was changed when using --exclusive: In order to account the job for all cpus on the node, the number of cpus asked for with --ntasks would simply be multiplied with with "#cpus-on-node / --ntasks" (so in your case: 28). Unfortunately, that also means that the memory the job requires per node is "#cpus-on-node / --ntasks" multiplied with --mem-per-cpu (in your case 28 * 3 MiB ~= 820 GiB). For this reason, we tend to ban --exclusive on our clusters (or at least warn about it). I haven't looked at the code for a long time, so I don't know whether this is still the current behaviour, but every time I've tested, I've seen the same problem. I believe I've tested on 19.05 (but I might remember wrong). -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] RHEL8 support
Taras Shapovalov writes: > Do I understand correctly that Slurm19 is not compatible with rhel8? It is > not in the list https://slurm.schedmd.com/platforms.html It says "RedHat Enterprise Linux 7 (RHEL7), CentOS 7, Scientific Linux 7 (and newer)" Perhaps that includes RHEL8, and CentOS 8, not only Scientific Linux 8? -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Sacct selecting jobs outside range
Brian Andrus writes: > When running a report to try and get jobs that start during a particular > day, sacct is returning a number of jobs that show as starting/ending > outside the range. > What could cause this? sacct selects jobs that were eligible to run (including actually running) between --starttime and --endtime. (If you add --state, it will select jobs that were in that state between the times.) So _any_ job that were running between --starttime and --endtime will be listed, even if it started before --starttime and/or ended after --endtime. Basically, you can think of [--starttime, --endtime] as a window in time, and sacct will list the jobs that were in the requested state(s) sometime inside that window. It will not care in which states the jobs were outside the window. At least, this is how I have come to think of it. IMHO, the sacct manual is a bit difficult to understand sometimes. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?
Matthew BETTINGER writes: > Just curious if this option or oom setting (which we use) can leave > the nodes in CG "completing" state. I don't think so. As far as I know, jobs go into completing state when Slurm is cancelling them or when they exit on their own, and stays in that state until any epilogs are run. In my experience, the most typical reasons for jobs hanging in CG are disk system failures or other failures leading to either the job processes or the epilog processes hanging in "disk wait". -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?
Marcus Boden writes: > you're looking for KillOnBadExit in the slurm.conf: > KillOnBadExit [...] > this should terminate the job if a step or a process gets oom-killed. That is a good tip! But as I read the documentation (I haven't tested it), it will only kill the job step itself, it will not kill the whole job. Also, it will only have effect for things started with srun, mpirun or similar. However, in combination with "set -o errexit", I believe most OOM kills would get the job itself terminated. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?
Juergen Salk writes: > that is interesting. We have a very similar setup as well. However, in > our Slurm test cluster I have noticed that it is not the *job* that > gets killed. Instead, the OOM killer terminates one (or more) > *processes* Yes, that is how the kernel OOM killer works. This is why we always tell users to use "set -o errexit" in their job scripts. Then at least the job script exits as soon as one of its processes are killed. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?
Jean-mathieu CHANTREIN writes: > I tried using, in slurm.conf > TaskPlugin=task/affinity, task/cgroup > SelectTypeParameters=CR_CPU_Memory > MemLimitEnforce=yes > > and in cgroup.conf: > CgroupAutomount=yes > ConstrainCores=yes > ConstrainRAMSpace=yes > ConstrainSwapSpace=yes > MaxSwapPercent=10 > TaskAffinity=no We have a very similar setup, the biggest difference being that we have MemLimitEnforce=no, and leave the killing to the kernel's cgroup. For us, jobs are killed as they should. Here are a couple of things you could check: - Does it work if you remove the space in "TaskPlugin=task/affinity, task/cgroup"? (Slurm can be quite picky when reading slurm.conf). - See in slurmd.log on the node(s) of the job if cgroup actually gets activated and starts limit memory for the job, or if there are any errors related to cgroup. - While a job is running, see in the cgroup memory directory (typically /sys/fs/cgroup/memory/slurm/uid_/job_ for the job (on the compute node). Does the values there, for instance memory.limit_in_bytes and memory.max_usage_in_bytes, make sense? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] OverSubscribe parameter
Espen Tangen writes: > Hi all, I need a bullet proof way of checking the setting of the > OverSubscribe parameter from within a runscript. Perhaps squeue -o %h -j $SLURM_JOB_ID is what you are looking for. According to squeue(1): %hCan the compute resources allocated to the job be over subscribed by other jobs. The resources to be over subscribed can be nodes, sockets, cores, or hyperthreads depending upon configuration. The value will be "YES" if the job was submitted with the oversub‐ scribe option or the partition is configured with OverSubscribe=Force, "NO" if the job requires exclusive node access, "USER" if the allocated compute nodes are dedicated to a single user, "MCS" if the allocated compute nodes are dedicated to a single security class (See MCSPlugin and MCSParameters con‐ figuration parameters for more information), "OK" otherwise (typically allocated dedicated CPUs), (Valid for jobs only) -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] slurm config :: set up a workdir for each job
Adrian Sevcenco writes: > Hi! Is there a method for setting up a work directory unique for each > job from a system setting? and than clean that up? > > can i use somehow the prologue and epilogue sections? slurmd prolog and epilog scripts are commonly used to do this, yes. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Up-to-date agenda for SLUG 2019?
Tim Wickberg writes: > Thanks for the reminder. The final version is online now. Thanks! > (The only important change is that the time for dinner has been filled > in, and the schedule is no longer marked as preliminary.) Hey, Squatters Pub! I was actually considering it for dinner tonight. :) ... hmm ... Does anyone know if the Porcupine Pub & Grill is an ok place? It is conveniently close to the Guest House. > See you folks tomorrow! Cheers! -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
[slurm-users] Up-to-date agenda for SLUG 2019?
The agenda on https://slurm.schedmd.com/slurm_ug_agenda.html is still called "Preliminary Schedule", and has not been updated since July 19. Is this the latest agenda, or is there a newer one somewhere? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?
Juergen Salk writes: > We are also going to implement disk quotas for the amount of local > scratch space that has been allocated for the job by means of generic > resources (e.g. `--gres=scratch:100´ for 100GB). This is especially > important when several users share a node. Indeed. > This leads me to ask how you plan to determine the amount of local > scratch space allocated for the job from within its prolog and epilog > scripts. [...] > I already thought about running `scontrol show job $SLURM_JOB_ID´ from > within the prolog/epilog scripts in order to get that piece of > information. This is exactly what we do. :) > This line could eventually be parsed to get the amount of scratch > allocated for this job (and then further used to increase/decrease the > quota limits for the corresponding $SLURM_JOB_USER in the > prolog/epilog scripts). If you use separate directories for each job, and use "project" quotas (a.k.a folder quotas), then you don't have to adjust the quota when a new job arrives, even if it is from the same user. > However, this still looks kind of clumsily to me and I wonder, if I > have just overlooked a more obvious, cleaner or more robust solution. Nope.(We could have done something more elegant like writing a small Perl utility that extracted just the needed parts, but never got around to it.) I _think_ another option would be to write a SPANK plugin for the gres, and let that create/remove the scratch directory and set the quota, but I haven't looked into that. That would probably count as a more elegant solution. > Since this is probably not an unusual requirement, I suppose this is > something that many other sites have already solved for > themselves. No? Yes, please, let us know how you've solved this! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?
Ole Holm Nielsen writes: > I figured that other sites need the free disk space feature as well > :-) :) > How do you dynamically update your gres=localtmp resource according to > the current disk free space? I mean, there is already a TmpFS disk > space size defined in slurm.conf, so how does your gres=localtmp > differ from TmpFS? We simply define the total "count" in the NodeLines for the compute nodes, like Nodename=c11-[1-36] Gres=localtmp:170 ... for nodes with 170 GB disk. Then Slurm will do the rest; it will keep track of these 170 localtmp "units" and not hand out more than that to jobs. The jobs just specify --gres=localtmp:50 for 50 "units". (Slurm doesn't know how much disk there is, or even that "localtmp" refers to disk space, it only keeps count of the units in the Gres definition, so we could have chosen MB as units (or multipla of Pi, if we really wanted :) ). So we don't use the TmpFS setting at all. In our prolog, when a job has asked for "localtmp", we create a directory for the job (/localscratch/$SLURM_JOB_ID), and set an environment variable $LOCALTMP to that directory, so the user can do "cp mydata $LOCALTMP" etc. in the jobs script. Then in the epilog, we delete the area. The new thing we are looking into, then, is to set a "project" quota (A.K.A folder quota) for the $LOCALTMP directory, and clear the quota afterwards. xfs supports this, and ext4 with recent enough version of the e2fsprogs toolkit. > With "scontrol show node xxx" we get the node memory values such as > "RealMemory=256000 AllocMem=24 FreeMem=160056". Similarly it > would be great to augment the TmpDisk with a FreeDisk parameter, for > example "TmpDisk=14 FreeDisk=9". That would have been nice, yes. > Would a Slurm modification be required to include a FreeDisk > parameter, and then change the meaning of "sbatch --tmp=xxx" to refer > to the FreeDisk in stead of TmpDisk size? I think it will, yes. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?
We are facing more or less the same problem. We have historically defined a Gres "localtmp" with the number of GB initially available on local disk, and then jobs ask for --gres=localtmp:50 or similar. That prevents slurm from allocating jobs on the cluster if they ask for more disk than is currently "free" -- in the sense of "not handed out to a job". But it doesn't prevent jobs from using more than they have asked for, so the disk might have less (real) free space than slurm thinks. As far as I can see, cgroups does not support limiting used disk space, only amount of IO/s and similar. We are currently considering using file system quotas for enforcing this. Our localtmp disk is a separate xfs partition, and the idea is to make the prolog set up a "project" disk quota for the job on the localtmp file system, and the epilog to remove it again. I'm not 100% sure we will make it work, but I'm hopeful. Fingers crossed! :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Slurm 19.05 --workdir non existent?
Christopher Benjamin Coffey writes: > It seems that --workdir= is no longer a valid option in batch jobs and > srun in 19.05, and has been replaced by --chdir. I didn't see a change > log about this, did I miss it? Going through the man pages it seems it > hasn't existed for some time now actually! Maybe not since before > 17.11 series. When did this happen?! From the NEWS file: * Changes in Slurm 17.11.0rc1 == [...] -- Change --workdir in sbatch to be --chdir as in all other commands (salloc, srun). > I guess I'll have to write a > jobsubmit rule to overwrite this in the meantime till we get users > trained differently. I think you would have to write a SPANK plugin that implements the --workdir switch. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature