[slurm-users] Re: scrontab question

2024-05-08 Thread Bjørn-Helge Mevik via slurm-users
k like ordinary ascii, for instance "unbreakable space". I tend to just pipe the text throuth "od -a". -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailing list -- slu

[slurm-users] Re: Convergence of Kube and Slurm?

2024-05-07 Thread Bjørn-Helge Mevik via slurm-users
Tim Wickberg via slurm-users writes: > [1] Slinky is not an acronym (neither is Slurm [2]), but loosely > stands for "Slurm in Kubernetes". And not at all inspired by Slinky Dog in Toy Story, I guess. :D -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Compu

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-17 Thread Bjørn-Helge Mevik via slurm-users
:) (Except for number of procs and number of pending signals, according to "man setrlimit".) Then 1024 might not be so low for ulimit -n after all. -- Regard, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signatu

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Bjørn-Helge Mevik via slurm-users
Ole Holm Nielsen writes: > Hi Bjørn-Helge, > > That sounds interesting, but which limit might affect the kernel's > fs.file-max? For example, a user already has a narrow limit: > > ulimit -n > 1024 AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n" is per user. Now that I

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Bjørn-Helge Mevik via slurm-users
/limits.d/. Then users would be blocked from opening unreasonably many files. One could use this to find which applications are responsible, and try to get them fixed. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP

[slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds

2024-02-12 Thread Bjørn-Helge Mevik via slurm-users
We've been running one cluster with SlurmdTimeout = 1200 sec for a couple of years now, and I haven't seen any problems due to that. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users

[slurm-users] Re: Starting a job after a file is created in previous job (dependency looking for soluton)

2024-02-06 Thread Bjørn-Helge Mevik via slurm-users
-- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?

2024-01-31 Thread Bjørn-Helge Mevik via slurm-users
can tailor the rpms/build to your needs (IB? SlingShot? Nvidia? etc.). -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an emai

Re: [slurm-users] propose environment variables SLURM_STDOUT, SLURM_STDERR, SLURM_STDIN

2024-01-21 Thread Bjørn-Helge Mevik
I would find that useful, yes. Especially if the variables were made available for the Prolog and Epilog scripts. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] slurm.conf

2024-01-18 Thread Bjørn-Helge Mevik
the Slurm configuration file. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] SLURM Reservation for GPU

2023-12-04 Thread Bjørn-Helge Mevik
Bjørn-Helge Mevik writes: > (Unfortunately, the page is so "wisely" created that it is impossible > to cut'n'paste from it.) That turned out to be a PEBKAC. :) cut'n'paste *is* possible. :) -- B/H signature.asc Description: PGP signature

Re: [slurm-users] SLURM Reservation for GPU

2023-12-04 Thread Bjørn-Helge Mevik
Minulakshmi S writes: > I am not able to find any supporting statements in Release Notes ... could > you please point. https://www.schedmd.com/news.php, the "Slurm version 23.11.0 is now available" section, the seventh bullet point. (Unfortunately, the page is so "wisely" created that it is

Re: [slurm-users] SLURM Reservation for GPU

2023-11-29 Thread Bjørn-Helge Mevik
I believe support for this was implemented in 23.11.0. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Releasing stale allocated TRES

2023-11-23 Thread Bjørn-Helge Mevik
n old version. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] --partition requests ignored in scripts

2023-11-08 Thread Bjørn-Helge Mevik
ions will override any environment variables. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] RES: multiple srun commands in the same SLURM script

2023-11-01 Thread Bjørn-Helge Mevik
srun has changed quite a bit in the recent versions, and the example above is for the latest version, so check the srun man page for your version. (And unfortunately, the documentation in the srun man page has not always been correct, so you might need to experiment. For instance, I believe Example 7 above i

Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-16 Thread Bjørn-Helge Mevik
couple of our clusters due to this.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-12 Thread Bjørn-Helge Mevik
Taras Shapovalov writes: > Are the older versions affected as well? Yes, all older versjons are affected. -- B/H signature.asc Description: PGP signature

Re: [slurm-users] New member , introduction

2023-09-30 Thread Bjørn-Helge Mevik
Welcome! :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] question about configuration in slurm.conf

2023-09-26 Thread Bjørn-Helge Mevik
d work. I'd personally use the second one. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-users] Transport from SLC to Provo?

2023-08-14 Thread Bjørn-Helge Mevik
alternative way to get to Provo on a Sunday night? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-users] No coffee allowed on BYU campus(!) Suggestions for alternatives?

2023-07-04 Thread Bjørn-Helge Mevik
ing or drinking of alcohol, coffee, or tea is permitted on the BYU campus, though other caffeinated beverages are allowed." So, any suggestions for "other caffeinated beverages" I'd be able to buy and bring with me to the sessions? -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for

Re: [slurm-users] Job step do not take the hole allocation

2023-06-30 Thread Bjørn-Helge Mevik
Hei, Ole! :) Ole Holm Nielsen writes: > Can anyone she light on the relationship between Tommi's > slurm_cli_pre_submit function and the ones defined in the > cli_filter_plugins page? I think the *_p_* functions are functions you need to implement if you write a cli plugin in C. When you

Re: [slurm-users] Limit run time of interactive jobs

2023-05-08 Thread Bjørn-Helge Mevik
Ole Holm Nielsen writes: > On 5/8/23 08:39, Bjørn-Helge Mevik wrote: >> Angel de Vicente writes: >> >>> But one possible way to something similar is to have a partition only >>> for interactive jobs and a different partition for batch jobs, and then >>

Re: [slurm-users] Limit run time of interactive jobs

2023-05-08 Thread Bjørn-Helge Mevik
odule (check the job_submit.lua > example). Wouldn't it be simpler to just refuse too long interactive jobs in job_submit.lua? -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] [EXT] Submit sbatch to multiple partitions

2023-04-17 Thread Bjørn-Helge Mevik
quot; containing all nodes will work. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Preventing --exclusive on a per-partition basis

2023-03-22 Thread Bjørn-Helge Mevik
I'd simply add a test like and job_desc.partition == "the_partition" to the test for exclusiveness. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] srun jobfarming hassle question

2023-01-18 Thread Bjørn-Helge Mevik
her 1 GiB without the step being killed. You can inspect the memory limits that are in effect in cgroups (v1) in /sys/fs/cgroup/memory/slurm/uid_/job_ (usual location, at least). -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] srun jobfarming hassle question

2023-01-18 Thread Bjørn-Helge Mevik
ile Slurm does job cleanup. Step epilogs and/or SPANK plugins can further delay the release of step resources. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Bjørn-Helge Mevik
. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] How to read job accounting data long output? `sacct -l`

2022-12-15 Thread Bjørn-Helge Mevik
Marcus Wagner writes: > That depends on what is meant with formatting argument. Yes, they could surely have defined that. > etc. And I would assume, that -S, -E and -T are filtering options, not > formatting options. I'd describe -T as a formatting option: -T, --truncate

Re: [slurm-users] How to read job accounting data long output? `sacct -l`

2022-12-14 Thread Bjørn-Helge Mevik
Marcus Wagner writes: > it it important to know, that the json output seems to be broken. > > First of all, it does not (compared to the normal output) obey to the > truncate option -T. > But more important, I saw a job, where in a "day output" (-S -E > ) no steps were recorded. > Using sacct

Re: [slurm-users] How to read job accounting data long output? `sacct -l`

2022-12-13 Thread Bjørn-Helge Mevik
am to lazy to read the man page. Then I use -o to specify what I want returned.) Also, in newer versions at least, there is --json and --yaml to give you output which you can parse with other tools (or read, if you really want :). -- Cheers, Bjørn-Helge Mevik signature.asc Description:

Re: [slurm-users] Test Suite problems related to requesting tasks

2022-10-26 Thread Bjørn-Helge Mevik
"Groner, Rob" writes: > For your "special testing config", do you just mean the > slurm.conf/gres.conf/*.conf files? Yes. > So when you want to test a new > version of slurm, you replace the conf files and then restart all of > the daemons? Exactly. (We usually don't do this on our

Re: [slurm-users] Test Suite problems related to requesting tasks

2022-10-25 Thread Bjørn-Helge Mevik
"Groner, Rob" writes: > I'm wondering OVERALL if the test suite is supposed to work on ANY > working slurm system. I could not find any documentation on how the > slurm configuration and nodes were required to be setup in order for > the test to workno indication that the test suite requires

Re: [slurm-users] Accounting core-hours usages

2022-10-11 Thread Bjørn-Helge Mevik
installing MariaDB and then slurmdb as described in the manual but > looks like I am missing something. I wonder if someone can help us with > this off the list? Perhaps the eminent guide of Ole Nielsen can help you: https://wiki.fysik.dtu.dk/niflheim/SLURM -- Regards, Bjørn-Helge Mevik signat

Re: [slurm-users] Use cases for "include" in slurm.conf?

2022-09-21 Thread Bjørn-Helge Mevik
repo without spreading the password around. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] How to debug a prolog script?

2022-09-18 Thread Bjørn-Helge Mevik
> non-executable with with "sh filename", so I made the incorrect > assumption that slurm would have invoked the prolog that way. Slurm prologs can be written in any language - we used to have perl prolog scripts. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] How to debug a prolog script?

2022-09-16 Thread Bjørn-Helge Mevik
e not exactly the same (for instance, in my experience, CentOS and Rocky are close enough to RHEL for most slurm-related things). One takes what one have. :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] How to debug a prolog script?

2022-09-16 Thread Bjørn-Helge Mevik
Davide DelVento writes: > 2. How to debug the issue? I'd try capturing all stdout and stderr from the script into a file on the compute node, for instance like this: exec &> /root/prolog_slurmd.$$ set -x # To print out all commands before any other commands in the script. The

Re: [slurm-users] Cgroup task plugin fails if ConstrainRAMSpace and ConstrainKmemSpace are enabled

2022-08-22 Thread Bjørn-Helge Mevik
This doesn't answer your question, but still: I'd be wary about using ConstrainKmemSpace at all. At least in the kernels on RedHat/CentOS <= 7.9, there is a bug in that eventually prevents Slurm from starting new job steps on a node, and the node has to be rebooted to be usable again. See for

Re: [slurm-users] "slurmd -C" reduce by xx GB or yy %

2022-08-11 Thread Bjørn-Helge Mevik
C program that mallocs and fills a large array, and see how big I can make the array before the node starts to swap. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-24 Thread Bjørn-Helge Mevik
Miguel Oliveira writes: > Hi Bjørn-Helge, > > Long time! Hi Miguel! Yes, definitely a long time! :D > Why not? You can have multiple QoSs and you have other techniques to change > priorities according to your policies. A job can only run in a single QoS, so if you submit a job with "sbatch

Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-24 Thread Bjørn-Helge Mevik
pacted. Yes, that will work. But it has the drawback that you cannot use QoS'es for *anything else*, like a QoS for development jobs or similar. So either way it is a trade-off. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Desc

Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-23 Thread Bjørn-Helge Mevik
writes: > TRESRaw cpu is lower than before as I'm alone on the system an no other job > was submitted. > Any explanation of this ? I'd guess you have turned on FairShare priorities. Unfortunately, in Slurm the same internal variables are used for fairshare calculations as for GrpTRESMins

Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-23 Thread Bjørn-Helge Mevik
lags=MAX_TRES or not. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Need to restart slurmctld for gres jobs to start

2022-06-03 Thread Bjørn-Helge Mevik
tluchko writes: > Jobs only sit in the queue with RESOURCES as the REASON when we > include the flag --gres=bandwidth:ib. If we remove the flag, the jobs > run fine. But we need the flag to ensure that we don't get a mix of IB > and ethernet nodes because they fail in this case. This doesn't

Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Bjørn-Helge Mevik
Per Lönnborg writes: > I "forgot" to tell our version because it´s a bit embarrising - 19.05.8... Haha! :D -- B/H signature.asc Description: PGP signature

Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Bjørn-Helge Mevik
ster once if it has too little memory, thus only giving one such message. (The node will then hva state "inval" in sinfo.) -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Strange memory limit behavior with --mem-per-gpu

2022-04-08 Thread Bjørn-Helge Mevik
Paul Raines writes: > Basically, it appears using --mem-per-gpu instead of just --mem gives > you unlimited memory for your job. > > $ srun --account=sysadm -p rtx8000 -N 1 --time=1-10:00:00 > --ntasks-per-node=1 --cpus-per-task=1 --gpus=1 --mem-per-gpu=8G > --mail-type=FAIL --pty /bin/bash >

Re: [slurm-users] srun and --cpus-per-task

2022-03-25 Thread Bjørn-Helge Mevik
Hermann Schwärzler writes: > Do you happen to know if there is a difference between setting CPUs > explicitely like you do it and not setting it but using > "ThreadsPerCore=1"? > > My guess is that there is no difference and in both cases only the > physical cores are "handed out to jobs". But

Re: [slurm-users] Disable exclusive flag for users

2022-03-25 Thread Bjørn-Helge Mevik
s many times more. There are better ways to specify using whole nodes, for instance using all cpus on the node or all memory on the node.") end (both of these just warn, though, but should be easy to change into rejecting the job.) -- Regards, Bjørn-Helge Mevik, dr. scient, Departm

Re: [slurm-users] srun and --cpus-per-task

2022-03-25 Thread Bjørn-Helge Mevik
=CR_CPU_Memory and node definitions like NodeName=DEFAULT CPUs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=182784 Gres=localscratch:330G Weight=1000 (so we set CPUs to the number of *physical cores*, not *hyperthreads*). -- Regards, Bjørn-Helge Mevik, dr. scient, Department

Re: [slurm-users] monitoring and update regime for Power Saving nodes

2022-02-23 Thread Bjørn-Helge Mevik
verything up, to > work on during the maintenance)? For the slurm.conf part, I'd suggest using the "configless" mode - that way at least the slurm config will always be up-to-date. See, e.g., https://slurm.schedmd.com/configless_slurm.html -- Regards, Bjørn-Helge Mevik, dr. scient, Departme

Re: [slurm-users] Problems with sun and TaskProlog

2022-02-11 Thread Bjørn-Helge Mevik
"Putnam, Harry" writes: > /opt/slurm/task_epilog > > #!/bin/bash > mytmpdir=/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID > rm -Rf $mytmpdir > exit; This might not be the reason for what you observe, but I believe deleting the scratch dir in the task epilog is not a good idea. The task epilog is run

Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-04 Thread Bjørn-Helge Mevik
upgrade, shouldn't it? As I understand it, without any running jobs, you can do pretty much what you want on the compute nodes. Or am I missing something here? -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-01 Thread Bjørn-Helge Mevik
This might not apply to your setup, but historically when we've seen similar behaviour, it was often due to the affected compute nodes missing from /etc/hosts on some *other* compute nodes. -- B/H signature.asc Description: PGP signature

Re: [slurm-users] Questions about default_queue_depth

2022-01-12 Thread Bjørn-Helge Mevik
David Henkemeyer writes: > 3) Is there a way to see the order of the jobs in the queue? Perhaps > squeue lists the jobs in order? squeue -S -p Sort jobs in descending priority order. -- B/H signature.asc Description: PGP signature

Re: [slurm-users] Is this a known error?

2021-12-08 Thread Bjørn-Helge Mevik
68 IP:10.2.3.185 CONN:8 in slurmdbd.log. But perhaps that will not happen if slurmdbd fails to unpack the header? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links

2021-12-03 Thread Bjørn-Helge Mevik
Adrian Sevcenco writes: > On 01.12.2021 10:25, Bjørn-Helge Mevik wrote: > >> In the end we had to give up >> using automount, and implement a manual procedure that mounts/umounts >> the needed nfs areas. > > Thanks a lot for info! manual as in "script" o

Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links

2021-12-01 Thread Bjørn-Helge Mevik
Adrian Sevcenco writes: > Hi! Does anyone know what could the the cause of such error? > I have a shared home, slurm 20.11.8 and i try a simple script in the submit > directory > which is in the home that is nfs shared... We had the "Too many levels of symbolic links" error some years ago,

Re: [slurm-users] Per-job TMPDIR: how to lookup gres allocation in prolog?

2021-11-17 Thread Bjørn-Helge Mevik
? We are using basically the same setup, and have not found any other way than running "scontrol show job ..." in the prolog (even though it is not recommended). I have yet to see any problems arising from it, but YMMW. If you find a different way, please share it with the list! -- Regar

Re: [slurm-users] Warning: can't honor --ntasks-per-node

2021-11-17 Thread Bjørn-Helge Mevik
situations with IntelMPI. In all our cases, "srun hostname" or "mpirun hostname" shows that it *does* honor --ntasks-per-node. (So we generally just ask our users to check with "srun hostname", and ignore the warning if it works as expected.) -- Regards, Bjørn-Helge

Re: [slurm-users] Bug when I run "sinfo --states=idle"

2021-10-29 Thread Bjørn-Helge Mevik
David Henkemeyer writes: > I just noticed today that when I run "sinfo --states=idle", I get all the > idle nodes, plus an additional node that is in the "DRAIN" state (notice > how xavier6 is showing up below, even though its not in the idle state): I *think* this could be because if you drain

Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2021-09-21 Thread Bjørn-Helge Mevik
group to the job step processes. See the enable_nss_slurm LaunchParameters in man slurm.conf, and the URL in that description. -- Regards, Bjørn-Helge Mevik signature.asc Description: PGP signature

Re: [slurm-users] Is this a known error?

2021-09-17 Thread Bjørn-Helge Mevik
Andreas Davour writes: > [2021-09-17T08:53:49.166] error: unpack_header: protocol_version 8448 > not supported > [2021-09-17T08:53:49.166] error: unpacking header > [2021-09-17T08:53:49.166] error: destroy_forward: no init > [2021-09-17T08:53:49.166] error: slurm_receive_msg_and_forward: >

Re: [slurm-users] FreeMem is not equal to (RealMem - AllocMem)

2021-09-14 Thread Bjørn-Helge Mevik
Pavel Vashchenkov writes: > There is a line "RealMemory=257433 AllocMem=155648 FreeMem=37773 > Sockets=2 Boards=1" > > > My question is: Why there is so few FreeMem (37 GB instead of expected > 100 GB (RealMem - AllocMem))? If I recall correctly, RealMem is what you have configured in

Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-09 Thread Bjørn-Helge Mevik
rogram is run. See section UNKILLABLE STEP PROGRAM SCRIPT for more informa‐ tion. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Building SLURM with X11 support

2021-05-28 Thread Bjørn-Helge Mevik
Thekla Loizou writes: > Also, when compiling SLURM in the config.log I get: > > configure:22291: checking whether Slurm internal X11 support is enabled > configure:22306: result: > > The result is empty. I read that X11 is build by default so I don't > expect a special flag to be given during

Re: [slurm-users] schedule mixed nodes first

2021-05-17 Thread Bjørn-Helge Mevik
Durai Arasan writes: > Is there a way of improving this situation? E.g. by not blocking IDLE nodes > with jobs that only use a fraction of the 8 GPUs? Why are single GPU jobs > not scheduled to fill already MIXED nodes before using IDLE ones? > > What parameters/configuration need to be adjusted

Re: [slurm-users] How can I get complete field values with without specify the length

2021-03-08 Thread Bjørn-Helge Mevik
tion --parsable2(*) specifically designed for parsing output, and which does not truncate long field values. (*) There is also --parsable, but that puts an extra "|" at the end of the line, so I prefer --parsable2. -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, Universit

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-24 Thread Bjørn-Helge Mevik
Thanks for the heads-up, Ole! -- B/H signature.asc Description: PGP signature

Re: [slurm-users] Set a ramdom offset when starting node health check in SLURM

2020-11-27 Thread Bjørn-Helge Mevik
tions." -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Slurm User Group Meeting (SLUG'20) Agenda Posted

2020-08-31 Thread Bjørn-Helge Mevik
Just wondering, will we get our t-shirts by email? :D -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] GrpMEMRunMins equivalent?

2020-06-06 Thread Bjørn-Helge Mevik
Interesting to know! -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] GrpMEMRunMins equivalent?

2020-06-04 Thread Bjørn-Helge Mevik
ng something? GrpTRESRunMins For instance: GrpTRESRunMins=Memory=1000,Cpu=2000 See man sacctmgr for details. -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] How to get command from a finished job

2020-04-30 Thread Bjørn-Helge Mevik
ir is available with sacct, IIRC. For other types of information, I believe you can add code to your job_submit.lua that stores it in the job's AdminComment field, which sacct can display. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] How to trap a SIGINT signal in a child process of a batch ?

2020-04-21 Thread Bjørn-Helge Mevik
Jean-mathieu CHANTREIN writes: > But that is not enough, it is also necessary to use srun in > test.slurm, because the signals are sent to the child processes only > if they are also children in the JOB sense. Good to know! -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for

Re: [slurm-users] How to trap a SIGINT signal in a child process of a batch ?

2020-04-21 Thread Bjørn-Helge Mevik
ltin to return immediately with an exit status greater than 128, immediately after which the trap is executed. So try using sleep 200 & wait instead. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] log rotation for slurmctld.

2020-03-16 Thread Bjørn-Helge Mevik
configure, just reopend the log file. Thanks for the reminder! -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] log rotation for slurmctld.

2020-03-13 Thread Bjørn-Helge Mevik
script (That is for both slurmctld.log and slurmdbd.log.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Question about SacctMgr....

2020-02-28 Thread Bjørn-Helge Mevik
aracters left justified. (in addition to using it in a couple of examples). :) -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] memory in job_submit.lua

2020-02-27 Thread Bjørn-Helge Mevik
gmem" then slurm.log_info( "non-bigmem job from uid %d with memory specification: Denying.", job_desc.user_id) slurm.user_msg("Memory specification only allowed for bigmem jobs") return 2044 -- Signal ESLURM_INVALID_TASK_MEMO

Re: [slurm-users] Question on how to make slurm aware of a CVMFS revision

2020-02-27 Thread Bjørn-Helge Mevik
)? Perhaps the daemon process could simply run "scontrol update node= ..." when it detects a change? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Slurm 18.08.8 --mem-per-cpu + --exclusive = strange behavior

2019-12-12 Thread Bjørn-Helge Mevik
to ban --exclusive on our clusters (or at least warn about it). I haven't looked at the code for a long time, so I don't know whether this is still the current behaviour, but every time I've tested, I've seen the same problem. I believe I've tested on 19.05 (but I might remember wrong). -

Re: [slurm-users] RHEL8 support

2019-10-28 Thread Bjørn-Helge Mevik
L8, and CentOS 8, not only Scientific Linux 8? -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Sacct selecting jobs outside range

2019-10-17 Thread Bjørn-Helge Mevik
it difficult to understand sometimes. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-10 Thread Bjørn-Helge Mevik
that state until any epilogs are run. In my experience, the most typical reasons for jobs hanging in CG are disk system failures or other failures leading to either the job processes or the epilog processes hanging in "disk wait". -- Regards, Bjørn-Helge Mevik, dr. scient, Departm

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Bjørn-Helge Mevik
ep itself, it will not kill the whole job. Also, it will only have effect for things started with srun, mpirun or similar. However, in combination with "set -o errexit", I believe most OOM kills would get the job itself terminated. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Bjørn-Helge Mevik
killer works. This is why we always tell users to use "set -o errexit" in their job scripts. Then at least the job script exits as soon as one of its processes are killed. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Bjørn-Helge Mevik
ated to cgroup. - While a job is running, see in the cgroup memory directory (typically /sys/fs/cgroup/memory/slurm/uid_/job_ for the job (on the compute node). Does the values there, for instance memory.limit_in_bytes and memory.max_usage_in_bytes, make sense? -- Regards, Bjørn-Helge Mevik,

Re: [slurm-users] OverSubscribe parameter

2019-10-02 Thread Bjørn-Helge Mevik
formation), "OK" otherwise (typically allocated dedicated CPUs), (Valid for jobs only) -- Cheers, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] slurm config :: set up a workdir for each job

2019-09-20 Thread Bjørn-Helge Mevik
gards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Up-to-date agenda for SLUG 2019?

2019-09-16 Thread Bjørn-Helge Mevik
r tonight. :) ... hmm ... Does anyone know if the Porcupine Pub & Grill is an ok place? It is conveniently close to the Guest House. > See you folks tomorrow! Cheers! -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

[slurm-users] Up-to-date agenda for SLUG 2019?

2019-09-16 Thread Bjørn-Helge Mevik
The agenda on https://slurm.schedmd.com/slurm_ug_agenda.html is still called "Preliminary Schedule", and has not been updated since July 19. Is this the latest agenda, or is there a newer one somewhere? -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing,

Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-03 Thread Bjørn-Helge Mevik
es, please, let us know how you've solved this! -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-03 Thread Bjørn-Helge Mevik
> example "TmpDisk=14 FreeDisk=9". That would have been nice, yes. > Would a Slurm modification be required to include a FreeDisk > parameter, and then change the meaning of "sbatch --tmp=xxx" to refer > to the FreeDisk in stead of TmpDisk size? I think it will, yes. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-03 Thread Bjørn-Helge Mevik
t again. I'm not 100% sure we will make it work, but I'm hopeful. Fingers crossed! :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Slurm 19.05 --workdir non existent?

2019-08-15 Thread Bjørn-Helge Mevik
a > jobsubmit rule to overwrite this in the meantime till we get users > trained differently. I think you would have to write a SPANK plugin that implements the --workdir switch. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Bjørn-Helge Mevik
omething there. I'm currently trying to get my hands on the logs from the servers themselves to see they actually get the requests at the time when the sssd backend claims to make it. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature

  1   2   >