[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

2024-05-27 Thread Bjørn-Helge Mevik via slurm-users
Ole Holm Nielsen via slurm-users  writes:

> Whether or not to enable Hyper-Threading (HT) on your compute nodes
> depends entirely on the properties of applications that you wish to
> run on the nodes.  Some applications are faster without HT, others are
> faster with HT.  When HT is enabled, the "virtual CPU cores" obviously
> will have only half the memory available per core.

Another consideration is, if you keep HT enabled, do you want Slurm to
hand out physical cores to jobs, or logical cpus (hyperthreads)?  Again,
what is best depends on your workload.  On our systems, we tend to
either turn off HT, or hand our cores.

-- 
B/H


signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: scrontab question

2024-05-08 Thread Bjørn-Helge Mevik via slurm-users
Sandor via slurm-users  writes:

> I am working out the details of scrontab. My initial testing is giving me
> an unsolvable question

If you have an unsolvable problem, you don't have a problem, you have a
fact of life. :)  

> Within scrontab editor I have the following example from the slurm
> documentation:
>
> 0,5,10,15,20,25,30,35,40,45,50,55 * * * *
> /directory/subdirectory/crontest.sh

- The command (/directory/...) should beon the same line as the time
spec (0,5,...) - but that was perhaps just the email formatting.

- Check for any UTF8 characters that look like ordinary ascii, for
instance "unbreakable space".  I tend to just pipe the text throuth "od
-a".

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Convergence of Kube and Slurm?

2024-05-07 Thread Bjørn-Helge Mevik via slurm-users
Tim Wickberg via slurm-users  writes:

> [1] Slinky is not an acronym (neither is Slurm [2]), but loosely
> stands for "Slurm in Kubernetes".

And not at all inspired by Slinky Dog in Toy Story, I guess. :D

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-17 Thread Bjørn-Helge Mevik via slurm-users
Jeffrey T Frey via slurm-users  writes:

>> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
>> is per user.
>
> The ulimit is a frontend to rusage limits, which are per-process restrictions 
> (not per-user).

You are right; I sit corrected. :)

(Except for number of procs and number of pending signals, according to
"man setrlimit".)

Then 1024 might not be so low for ulimit -n after all.

-- 
Regard,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Bjørn-Helge Mevik via slurm-users
Ole Holm Nielsen  writes:

> Hi Bjørn-Helge,
>
> That sounds interesting, but which limit might affect the kernel's
> fs.file-max?  For example, a user already has a narrow limit:
>
> ulimit -n
> 1024

AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
is per user.

Now that I think of it, fs.file-max of 65536 seems *very* low.  On our
CentOS-7-based clusters, we have in the order of tens of millions, and
on our Rocky 9 based clusters, we have 9223372036854775807(!)

Also a per-user limit of 1024 seems low to me; I think we have in the
order of 200K files per user on most clusters.

But if you have ulimit -n == 1024, then no user should be able to hit
the fs.file-max limit, even if it is 65536.  (Technically, 96 jobs from
96 users each trying to open 1024 files would do it, though.)

> whereas the permitted number of user processes is a lot higher:
>
> ulimit -u
> 3092846

I guess any process will have a few open files, which I believe count
against the ulimit -n for each user (and fs.file-max).

> I'm not sure how the number 3092846 got set, since it's not defined in
> /etc/security/limits.conf.  The "ulimit -u" varies quite a bit among
> our compute nodes, so which dynamic service might affect the limits?

There is a vague thing in my head saying that I've looked for this
before, and found that the default value dependened on the size of the
RAM of the machine.  But the vague thing might of course be lying to
me. :)

-- 
Bjørn-Helge


signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Bjørn-Helge Mevik via slurm-users
Ole Holm Nielsen via slurm-users  writes:

> Therefore I believe that the root cause of the present issue is user
> applications opening a lot of files on our 96-core nodes, and we need
> to increase fs.file-max.

You could also set a limit per user, for instance in
/etc/security/limits.d/.  Then users would be blocked from opening
unreasonably many files.  One could use this to find which applications
are responsible, and try to get them fixed.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds

2024-02-12 Thread Bjørn-Helge Mevik via slurm-users
We've been running one cluster with SlurmdTimeout = 1200 sec for a
couple of years now, and I haven't seen any problems due to that.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Starting a job after a file is created in previous job (dependency looking for soluton)

2024-02-06 Thread Bjørn-Helge Mevik via slurm-users
Amjad Syed via slurm-users  writes:

> I need to submit a sequence of up to 400 jobs where the even jobs depend on
> the preceeding odd job to finish and every odd job depends on the presence
> of a file generated by the preceding even job (availability of the file for
> the first of those 400 jobs is guaranteed).

How about letting each even job submit the next odd job after it has
created the file, and also the following even job, with a dependency on
the odd job?  You would obviuosly have to keep track of how many jobs
you've submitted so you can stop after 400 jobs. :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?

2024-01-31 Thread Bjørn-Helge Mevik via slurm-users
This isn't answering your question, but I strongly suggest you build
Slurm from source.  You can use the provided slurm.spec file to make
rpms (we do) or use "configure + make".  Apart from being able to
upgrade whenever a new version is out (especially important for
security!), you can tailor the rpms/build to your needs (IB? SlingShot?
Nvidia? etc.).

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] propose environment variables SLURM_STDOUT, SLURM_STDERR, SLURM_STDIN

2024-01-21 Thread Bjørn-Helge Mevik
I would find that useful, yes.  Especially if the variables were made
available for the Prolog and Epilog scripts.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] slurm.conf

2024-01-18 Thread Bjørn-Helge Mevik
LEROY Christine 208562  writes:

> Is there an env variable in SLURM to tell where the slurm.conf is?
> We would like to have on the same client node, 2 type of possible submissions 
> to address 2 different cluster.

According to man sbatch:

   SLURM_CONFThe location of the Slurm configuration file.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] SLURM Reservation for GPU

2023-12-04 Thread Bjørn-Helge Mevik
Bjørn-Helge Mevik  writes:

> (Unfortunately, the page is so "wisely" created that it is impossible
> to cut'n'paste from it.)

That turned out to be a PEBKAC. :) cut'n'paste *is* possible. :)

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] SLURM Reservation for GPU

2023-12-04 Thread Bjørn-Helge Mevik
Minulakshmi S  writes:

> I am not able to find any supporting statements in Release Notes ... could
> you please point.

https://www.schedmd.com/news.php, the "Slurm version 23.11.0 is now
available" section, the seventh bullet point.  (Unfortunately, the page
is so "wisely" created that it is impossible to cut'n'paste from it.) :

  "Notably, this permits reservations to now reserve GRES directly."

-- 
Cheers,
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] SLURM Reservation for GPU

2023-11-29 Thread Bjørn-Helge Mevik
I believe support for this was implemented in 23.11.0.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Releasing stale allocated TRES

2023-11-23 Thread Bjørn-Helge Mevik
"Schneider, Gerald"  writes:

> Is there any way to release the allocation manually?

I've only seen this once on our clusters, and that time it helped just
restarting slurmctld.

If this is a recurring problem, perhaps it will help to upgrade Slurm.
You are running quite an old version.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] --partition requests ignored in scripts

2023-11-08 Thread Bjørn-Helge Mevik
"Bunis, Dan"  writes:

> My colleagues and I have noticed that our compute cluster seems to
> ignore '--partition' requests when we give them as '#SBATCH
> --partition=' inside of our scripts, but it respects
> them when given in-line within our sbatch calls as 'sbatch
> --partition= script.sh'.  Based on some googling, it
> seems that both methods are meant to work, so I'm wondering if it's
> known what can cause the in-script methodology to NOT work for
> schedulers where the in-line methodology DOES work?

My suspicion is that there is an environment variable SBATCH_PARTITION
set in your shells.  Such a variable will override the #SBATCH directive,
but not the command line switch.

From man sbatch:

INPUT ENVIRONMENT VARIABLES
[...]
NOTE: Environment variables will override any options set in a batch
script, and command line options will override any environment
    variables.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] RES: multiple srun commands in the same SLURM script

2023-11-01 Thread Bjørn-Helge Mevik
Paulo Jose Braga Estrela  writes:

> Hi,
>
> I think that you have a syntax error in your bash script. The "&"
> means that you want to send a process to background not that you want
> to run many commands in parallel. To run commands in a serial fashion
> you should use cmd && cmd2, then the cmd2 will only be executed if the
> command 1 return 0 as exit code.
>
> To run commands in parallel with srun you should set the number of
> tasks to 4, so srun will spawn 4 tasks of the same command. Take a
> look at the examples section in srun
> docs. (https://slurm.schedmd.com/srun.html)

Well, if you look at Example 7 in that section:

Example 7:
This example shows a script in which Slurm is used to provide resource 
management for a job by executing the various job steps as processors become 
available for their dedicated use. 

$ cat my.script
#!/bin/bash
srun -n4 prog1 &
srun -n3 prog2 &
srun -n1 prog3 &
srun -n1 prog4 &
wait

which is what OP tries to do.  It is mainly for running *different*
programs in parallel inside a job.  If one wants to run *the same*
program in parallel, then a single srun is indeed the recommended way.

I think the main problem is that the original job script only asks for a
single CPU, so the sruns will only run one at a time.  Try adding
--ntasks-per-node=4 or similar.

Note that exactly how to run different programs in parallel with srun
has changed quite a bit in the recent versions, and the example above is
for the latest version, so check the srun man page for your version.
(And unfortunately, the documentation in the srun man page has not
always been correct, so you might need to experiment.  For instance, I
believe Example 7 above is missing `--exact` or `SLURM_EXACT`. :) )

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-16 Thread Bjørn-Helge Mevik
Taras Shapovalov  writes:

> Oh, does this mean that no one should use Slurm versions <= 21.08 any more?

That of course depends on your security requirements, but I wouldn't
have used those older versions in production any more, at least.  (We
actually did upgrade from 21.08 to 23.02 on a couple of our clusters due
to this.)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-12 Thread Bjørn-Helge Mevik
Taras Shapovalov  writes:

> Are the older versions affected as well?

Yes, all older versjons are affected.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] New member , introduction

2023-09-30 Thread Bjørn-Helge Mevik
Welcome! :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] question about configuration in slurm.conf

2023-09-26 Thread Bjørn-Helge Mevik
Felix  writes:

> I have at my site the following work nodes
>
> awn001 ... awn099
>
> and then it continues awn100 ... awn199

I presume you meant awn-001 etc, not awn001.  If not, replace "awn-"
with "awn" below.

> How can I configure this line
>
> PartitionName=debug Nodes=awn-0[01-32,46-77,95-99] Default=YES
> MaxTime=INFINITE State=UP
>
> so that it can contain the nodes from 001 to 199
>

PartitionName=debug Nodes=awn-0[01-32,46-77,95-99],awn-1[00-99] ...

or

PartitionName=debug Nodes=awn-[001-032,046-077,095-199] ...

should work.  I'd personally use the second one.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


[slurm-users] Transport from SLC to Provo?

2023-08-14 Thread Bjørn-Helge Mevik
Dear all,

I'm going to SLUG in Provo in September.  My flight lands in Salt Lake
City Airport (SLC) at 7 pm on Sunday 10.  I was planning to go by bus or
train from SLC to Provo, but apparently both bus and train have stopped
running by that time on Sundays.

Does anyone know about any alternative way to get to Provo on a Sunday
night?

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


[slurm-users] No coffee allowed on BYU campus(!) Suggestions for alternatives?

2023-07-04 Thread Bjørn-Helge Mevik
I've signed up for SLUG 2023, which is on Brigham Young University.  I
noticed on the Agenda (https://slurm.schedmd.com/slurm_ug_agenda.html)
that "coffee is not provided on campus, so be sure to get your morning
caffeine before arriving."

Following a whole day of lectures without coffee when you're jet-lagged
8 hours and have spent 15 hours travelling, is not going to be easy, so I
thought I'd bring a thermos flask and get it filled with coffee in the
hotel or in a coffeeshop.

But now I discovered https://dancecamps.byu.edu/content/byu-honor-code,
which says "no smoking or drinking of alcohol, coffee, or tea is
permitted on the BYU campus, though other caffeinated beverages are
allowed."

So, any suggestions for "other caffeinated beverages" I'd be able to buy
and bring with me to the sessions?

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Job step do not take the hole allocation

2023-06-30 Thread Bjørn-Helge Mevik
Hei, Ole! :)

Ole Holm Nielsen  writes:

> Can anyone she light on the relationship between Tommi's
> slurm_cli_pre_submit function and the ones defined in the
> cli_filter_plugins page?

I think the *_p_* functions are functions you need to implement if you
write a cli plugin in C.  When you write a cli plugin-script in Lua, you write
Lua functions called slurm_cli_setup_defaults, slurm_cli_pre_submit,
etc in the Lua code, and then the C-code of the Lua plugin itself
implements the *_p_* functions (I believe).

That said, I too found it hard to find any documentation of the Lua
plugin.  Eventually, I found an example script in the Slurm source code
(etc/cli_filter.lua.example), which I've taken as a starting point for
my cli filter plugin scripts.

-- 
B/H



signature.asc
Description: PGP signature


Re: [slurm-users] Limit run time of interactive jobs

2023-05-08 Thread Bjørn-Helge Mevik
Ole Holm Nielsen  writes:

> On 5/8/23 08:39, Bjørn-Helge Mevik wrote:
>> Angel de Vicente  writes:
>> 
>>> But one possible way to something similar is to have a partition only
>>> for interactive jobs and a different partition for batch jobs, and then
>>> enforce that each job uses the right partition. In order to do this, I
>>> think we can use the Lua contrib module (check the job_submit.lua
>>> example).
>> Wouldn't it be simpler to just refuse too long interactive jobs in
>> job_submit.lua?
>
> This sounds like a good idea, but how would one identify an
> interactive job in the job_submit.lua script?

Good question. :)  I merely guessed it is possible. :)

> A solution was suggested in
> https://serverfault.com/questions/1090689/how-can-i-set-up-interactive-job-only-or-batch-job-only-partition-on-a-slurm-clu
>> Interactive jobs have no script and job_desc.script will be empty /
> not set.
>
> So maybe something like this code snippet?
>
> if job_desc.script == NIL then

That sounds like it should work, yes.  (But perhaps double check that jobs
submitted with "sbatch --wrap" or taking the job script from stdin (if
that is still possible) get job_descr.script set.)

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Limit run time of interactive jobs

2023-05-08 Thread Bjørn-Helge Mevik
Angel de Vicente  writes:

> But one possible way to something similar is to have a partition only
> for interactive jobs and a different partition for batch jobs, and then
> enforce that each job uses the right partition. In order to do this, I
> think we can use the Lua contrib module (check the job_submit.lua
> example).

Wouldn't it be simpler to just refuse too long interactive jobs in
job_submit.lua?

-- 
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] [EXT] Submit sbatch to multiple partitions

2023-04-17 Thread Bjørn-Helge Mevik
"Ozeryan, Vladimir"  writes:

> You should be able to specify both partitions in your sbatch submission 
> script, unless there is some other configuration preventing this.

But Slurm will still only run the job in *one* of the partitions - it
will never "pool" two partitions and let the job run on all nodes.  All
nodes of a job must belong to the same partition.  (Another thing I
found out recently is that if you specify multiple partitions for an
array job, then all array subjobs will run in the same partition.)

As Ole suggests: creating a "super partition" containing all nodes will
work.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Preventing --exclusive on a per-partition basis

2023-03-22 Thread Bjørn-Helge Mevik
I'd simply add a test like
  and job_desc.partition == "the_partition"
to the test for exclusiveness.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] srun jobfarming hassle question

2023-01-18 Thread Bjørn-Helge Mevik
"Ohlerich, Martin"  writes:

> Hello Björn-Helge.
>
>
> Sigh ...
>
> First of all, of course, many thanks! This indeed helped a lot!

Good!

> b) This only works if I have to specify --mem for a task. Although
> manageable, I wonder why one needs to be that restrictive. In
> principle, in the use case outlined, one task could use a bit less
> memory, and the other may require a bit more the half of the node's
> available memory. (So clearly this isn't always predictable.) I only
> hope that in such cases the second task does not die from OOM ... (I
> will know soon, I guess.)

As I understand it, Slurm (at least cgroups) will only kill a step if it
uses more memory *in total* on a node than the job got allocated to the
node.  So if a job has 10 GiB allocated on a node, and a step runs two
tasks there, one task could use 9 GiB and the other 1 GiB without the
step being killed.

You can inspect the memory limits that are in effect in cgroups (v1) in
/sys/fs/cgroup/memory/slurm/uid_/job_ (usual location, at
least).

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] srun jobfarming hassle question

2023-01-18 Thread Bjørn-Helge Mevik
"Ohlerich, Martin"  writes:

> Dear Colleagues,
>
>
> already for quite some years now are we again and again facing issues on our 
> clusters with so-called job-farming (or task-farming) concepts in Slurm jobs 
> using srun. And it bothers me that we can hardly help users with requests in 
> this regard.
>
>
> From the documentation 
> (https://slurm.schedmd.com/srun.html#SECTION_EXAMPLES), it reads like this.
>
> --->
>
> ...
>
> #SBATCH --nodes=??
>
> ...
>
> srun -N 1 -n 2 ... prog1 &> log.1 &
>
> srun -N 1 -n 1 ... prog2 &> log.2 &


Unfortunately, that part of the documentation is not quite up-to-date.
The semantics of srun has changed a little the last couple of
years/Slurm versions, so today, you have to use "srun --exact ...".  From
"man srun" (version 21.08):

   --exact
  Allow  a step access to only the resources requested for the
  step.  By default, all non-GRES resources on  each  node  in
  the  step  allocation will be used. This option only applies
  to step allocations.
  NOTE: Parallel steps will  either  be  blocked  or  rejected
  until  requested step resources are available unless --over‐
  lap is specified. Job resources can be held after  the  com‐
  pletion  of  an  srun  command while Slurm does job cleanup.
      Step epilogs and/or SPANK  plugins  can  further  delay  the
  release of step resources.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Bjørn-Helge Mevik
In my opinion, the problem is with autofs, not with tmpfs.  Autofs
simply doesn't work well when you are using detached fs name spaces and
bind mounting.  We ran into this problem years ago (with an inhouse
spank plugin doing more or less what tmpfs does), and ended up simply
not using autofs.

I guess you could try using systemd's auto-mounting features, but I have
no idea if they work better than autofs in situations like this.

We ended up using a system where the prolog script mounts any needed
file systems, and then the healthcheck script unmounts file systems that
are no longer needed.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] How to read job accounting data long output? `sacct -l`

2022-12-15 Thread Bjørn-Helge Mevik
Marcus Wagner  writes:

> That depends on what is meant with formatting argument.

Yes, they could surely have defined that.

> etc. And I would assume, that -S, -E and -T are filtering options, not
> formatting options.

I'd describe -T as a formatting option:

   -T, --truncate
 Truncate  time.   So if a job started before --starttime the 
start
 time would be truncated to --starttime.  The same for end time 
and
 --endtime.

As I read this, it changes how a job is written, it does not select
jobs.

> But getting sometimes no steps for a job (if in a larger JSON-output
> with many jobs) and then getting the steps, if one asks specifically
> for that jobid. That is something I would call broken.

That sounds worse, yes.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] How to read job accounting data long output? `sacct -l`

2022-12-14 Thread Bjørn-Helge Mevik
Marcus Wagner  writes:

> it it important to know, that the json output seems to be broken.
>
> First of all, it does not (compared to the normal output) obey to the 
> truncate option -T.
> But more important, I saw a job, where in a "day output" (-S  -E 
> ) no steps were recorded.
> Using sacct -j  --json instead showed that job WITH steps.

It is hard to call it "broken" when it is documented behaviour:

 --jsonDump job information as JSON. All other formatting arguments will be 
ignored

-- 
Cheers,
Bjørn-Helge


signature.asc
Description: PGP signature


Re: [slurm-users] How to read job accounting data long output? `sacct -l`

2022-12-13 Thread Bjørn-Helge Mevik
Chandler Sobel-Sorenson  writes:

> Perhaps there is a way to import it into a spreadsheet?

You can use `sacct -P -l`, which gives you a '|' separated output, which
should be possible to import in a spread sheet.

(Personally I only use `-l` when I'm looking for the name of an
attribute and am to lazy to read the man page.  Then I use -o to specify
what I want returned.)

Also, in newer versions at least, there is --json and --yaml to give you
output which you can parse with other tools (or read, if you really want :).

-- 
Cheers,
Bjørn-Helge Mevik


signature.asc
Description: PGP signature


Re: [slurm-users] Test Suite problems related to requesting tasks

2022-10-26 Thread Bjørn-Helge Mevik
"Groner, Rob"  writes:

> For your "special testing config", do you just mean the
> slurm.conf/gres.conf/*.conf files?

Yes.

> So when you want to test a new
> version of slurm, you replace the conf files and then restart all of
> the daemons?

Exactly.  (We usually don't do this on our production cluster, but on test
clusters, were we can change setups as we please. :) )

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Test Suite problems related to requesting tasks

2022-10-25 Thread Bjørn-Helge Mevik
"Groner, Rob"  writes:

> I'm wondering OVERALL if the test suite is supposed to work on ANY
> working slurm system. I could not find any documentation on how the
> slurm configuration and nodes were required to be setup in order for
> the test to workno indication that the test suite requires a
> particular configuration in order to run successfully.

My experience is that the test suite makes some assumptions about the
setup, so it will not work with just any config.  And as you, I haven't
found any documentation about what it expects.

> So in other
> words, I can't tell if these failing tests are a result of an actual
> problem, or a result of the way our cluster is configured,

I tend to look at what the failing tests do, and try to figure out what
they expect in terms of the config.  It's a bit of work, but (at least
for our setups) there haven't been too many cases.  (And there are a
couple of tests I've never understood why they fail. :) )

We have a special config for running the test suite, modified so that
most of the assumptions are met.  For instance, we turn off the job
submit plugin, any prologs/epilogs, healthcheck script, configless mode;
and keep things like SchedulerParameters at their default value.

> and if it's because of how our cluster is configured, then is it
> unreasonable to think I can make use of the test suite?

If the failing test is for some feature that we don't use, a different
setup than what we have in production, or something that is not
essential for our clusters, we just ignore the test.  We have a small list
of known "please ignore" test failures. :)

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Accounting core-hours usages

2022-10-11 Thread Bjørn-Helge Mevik
Sushil Mishra  writes:

> Dear all,
>
> I am pretty new to system administration and looking for some help
> setup slumdb or maridb in a GPU cluster. We bought a machine but the vendor
> simply installed slurm and did not install any database for accounting. I
> tried installing MariaDB and then slurmdb as described in the manual but
> looks like I am missing something. I wonder if someone can help us with
> this off the list?

Perhaps the eminent guide of Ole Nielsen can help you:
https://wiki.fysik.dtu.dk/niflheim/SLURM

-- 
Regards,
Bjørn-Helge Mevik


signature.asc
Description: PGP signature


Re: [slurm-users] Use cases for "include" in slurm.conf?

2022-09-21 Thread Bjørn-Helge Mevik
Ole Holm Nielsen  writes:

> Can anyone shed light on the use cases for "include" in slurm.conf?

Until we switched to configless mode, we used to have all partition and
node definitions in a separate file, with an include in slurm.conf.  The
idea was to keep the things that most frequently was changed in a
separate file.  Also, it made it easier to keep the config of several
clusters in sync (running diff on their slurm.conf files wouldn't be
cluttered with node definition differences).

Similarly, I also saw that on our BULL Atos cluster, Atos had separated
out information generated from their database into separate files.  That
way those parts could be updated when the database changed.

Somewhat related, for Slurmdbd, we have a separate slurmdbd_auth.conf
file with the username and password for the slurm sql db, so that we can
keep the slurmdbd.conf file in a git repo without spreading the password
around.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] How to debug a prolog script?

2022-09-18 Thread Bjørn-Helge Mevik
Davide DelVento  writes:

>> I'm curious: What kind of disruption did it cause for your production
>> jobs?
>
> All jobs failed and went in pending/held with "launch failed requeued
> held" status, all nodes where the jobs were scheduled went draining.
>
> The logs only said "error: validate_node_specs: Prolog or job env
> setup failure on node , draining the node". I guess if they said
> "-bash: /path/to/prolog: Permission denied" I would have caught the
> problem myself.

But that is not a problem caused by having things like

exec &> /root/prolog_slurmd.$$

in the script, as you indicated.  It is a problem caused by the prolog
script file not being executable.

> In hindsight it is obvious, but I don't think even the documentation
> mentions that, does it? After all you can execute a file with a
> non-executable with with "sh filename", so I made the incorrect
> assumption that slurm would have invoked the prolog that way.

Slurm prologs can be written in any language - we used to have perl
prolog scripts. :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] How to debug a prolog script?

2022-09-16 Thread Bjørn-Helge Mevik
Davide DelVento  writes:

> Does it need the execution permission? For root alone sufficient?

slurmd runs as root, so it only need exec perms for root.

>> > 2. How to debug the issue?
>> I'd try capturing all stdout and stderr from the script into a file on the 
>> compute
>> node, for instance like this:
>>
>> exec &> /root/prolog_slurmd.$$
>> set -x # To print out all commands
>
> Do you mean INSIDE the prologue script itself?

Yes, inside the prolog script itself.

> Yes, this is what I'd have done, if it weren't so disruptive of all my
> production jobs, hence I had to turn it off before wrecking havoc too
> much.

I'm curious: What kind of disruption did it cause for your production
jobs?

We use this in our slurmd prologs (and similar in epilogs) on all our
production clusters, and have not seen any disruption due to it.  (We do
have things like

## Remove log file if we got this far:
rm -f /root/prolog_slurmd.$$

at the bottom of the scripts, though, so as to remove the log file when
the prolog succeeded.)

> Sure, but even "just executing" there is stdout and stderr which could
> be captured and logged rather than thrown away and force one to do the
> above.

True.  But slurmd doesn't, so...

> How do you "install the prolog scripts there"? Isn't the prolog
> setting in slurm.conf global?

I just overwrite the prolog script file itself on the node.  We
don't have them on a shared file system, though.  If you have the
prologs on a shared file system, you'd have to override the slurm config
on the compute node itself.  This can be done in several ways, for
instance by starting slurmd with the "-f "
option.

>> (Otherwise one could always
>> set up a small cluster of VMs and use that for simpler testing.)
>
> Yes, but I need to request that cluster of VM to IT, have the same OS
> installed and configured (and to be 100% identical, it needs to be
> RHEL so license paid), and everything sync'ed with the actual
> cluster I know it'd be very useful, but sadly we don't have the
> resources to do that, so unfortunately this is not an option for me.

I totally agree that VMs instead of a physical test cluster is never
going to be 100 % the same, but some things can be tested even though
the setups are not exactly the same (for instance, in my experience,
CentOS and Rocky are close enough to RHEL for most slurm-related
things).  One takes what one have. :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] How to debug a prolog script?

2022-09-16 Thread Bjørn-Helge Mevik
Davide DelVento  writes:

> 2. How to debug the issue?

I'd try capturing all stdout and stderr from the script into a file on the 
compute
node, for instance like this:

exec &> /root/prolog_slurmd.$$
set -x # To print out all commands

before any other commands in the script.  The "prolog_slurmd." will
then contain a log of all commands executed in the script, along with
all output (stdout and stderr).  If there is no "prolog_slurmd."
file after the job has been scheduled, then as has been pointed out by
others, slurm wasn't able to exec the prolog at all.

> Even increasing the debug level the
> slurmctld.log contains simply a "error: validate_node_specs: Prolog or
> job env setup failure on node xxx, draining the node" message, without
> even a line number or anything.

Slurm only executes the prolog script.  It doesn't parse it or evaluate
it itself, so it has no way of knowing what fails inside the script.

> 3. And more generally, how to debug a prolog (and epilog) script
> without disrupting all production jobs? Unfortunately we can't have
> another slurm install for testing, is there a sbatch option to force
> utilizing a prolog script which would not be executed for all the
> other jobs? Or perhaps making a dedicated queue?

I tend to reserve a node, install the updated prolog scripts there, and
run test jobs asking for that reservation.  (Otherwise one could always
set up a small cluster of VMs and use that for simpler testing.)

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Cgroup task plugin fails if ConstrainRAMSpace and ConstrainKmemSpace are enabled

2022-08-22 Thread Bjørn-Helge Mevik
This doesn't answer your question, but still: I'd be wary about using
ConstrainKmemSpace at all.  At least in the kernels on RedHat/CentOS <=
7.9, there is a bug in that eventually prevents Slurm from starting new
job steps on a node, and the node has to be rebooted to be usable
again.  See for instance https://bugs.schedmd.com/show_bug.cgi?id=5507.
(The bug report is old, but we got that result on a system with RHEL 7.7
earlier this year.)

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] "slurmd -C" reduce by xx GB or yy %

2022-08-11 Thread Bjørn-Helge Mevik
"Eg. Bo."  writes:

> as far as I understand it's good practice to lower the RealMemory
> value reported by slurmd -C by a given amount of GB or by
> percentage.What's the best approach to calculate a given target value,
> for different HW types?

I tend to use a small C program that mallocs and fills a large array,
and see how big I can make the array before the node starts to swap.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-24 Thread Bjørn-Helge Mevik
Miguel Oliveira  writes:

> Hi Bjørn-Helge,
>
> Long time!

Hi Miguel!  Yes, definitely a long time!  :D

> Why not? You can have multiple QoSs and you have other techniques to change 
> priorities according to your policies.

A job can only run in a single QoS, so if you submit a job with "sbatch
--qos=devel ..." it will no longer be running in the account QoS and
thus its usage will not be recorded in that QoS.  If that is ok, then no
problem, but if you want all jobs of an account to be limited by the
TRESMins limit, then you cannot use other QoS'es than the account QoSes
(except for partition QoSes).

-- 
Bjørn-Helge


signature.asc
Description: PGP signature


Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-24 Thread Bjørn-Helge Mevik
Miguel Oliveira  writes:

> It is not exactly true that you have no solution to limit projects. If
> you implement each project as an account then you can create an
> account qos with the NoDecay flags.
> This will not affect associations so priority and fair share are not impacted.

Yes, that will work.  But it has the drawback that you cannot use QoS'es
for *anything else*, like a QoS for development jobs or similar.  So
either way it is a trade-off.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-23 Thread Bjørn-Helge Mevik
 writes:

> TRESRaw cpu is lower than before as I'm alone on the system an no other job 
> was submitted. 
> Any explanation of this ? 

I'd guess you have turned on FairShare priorities.  Unfortunately, in
Slurm the same internal variables are used for fairshare calculations as
for GrpTRESMins (and similar), so when fair share priorities are in use,
slurm will reduce accumulated GrpTRESMins over time.  This means that it
is impossible(*) to use GrpTRESMins limits and fairshare
priorities at the same time.

(*) It is possible to tell slurm *not* to reduce the accumulated
TRESMins of a QoS, so you can technically use GrpTRESMins limits on a
qos, and fair share priorites on the accounts and/or users.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-23 Thread Bjørn-Helge Mevik
 writes:

> I thought job's cpu TRESRaw = nb of reserved core X walltime (mn) 

It is the "TRES billing cost" x walltime.  What the TRES billing cost of
a job is depends on how you've set up the TRESBillingWeights on the
partitions, and whethery you've defined PriorityFlags=MAX_TRES or not.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Need to restart slurmctld for gres jobs to start

2022-06-03 Thread Bjørn-Helge Mevik
tluchko  writes:

> Jobs only sit in the queue with RESOURCES as the REASON when we
> include the flag --gres=bandwidth:ib. If we remove the flag, the jobs
> run fine. But we need the flag to ensure that we don't get a mix of IB
> and ethernet nodes because they fail in this case.

This doesn't answer your real question, but couldn't you just use
features for ib and ethernet.  Jobs wanting nodes with ib would then
specify --constraint=ib, etc.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Bjørn-Helge Mevik
Per Lönnborg  writes:

> I "forgot" to tell our version because it´s a bit embarrising - 19.05.8...

Haha! :D

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Bjørn-Helge Mevik
Per Lönnborg  writes:

> Greetings,

God dag!

> is there a way to lower the log rate on error messages in slurmctld for nodes 
> with hardware errors? 

You don't say which version of Slurm you are running, but I think this
was changed in 21.08, so the node will only try to register once if it
has too little memory, thus only giving one such message.  (The node
will then hva state "inval" in sinfo.)

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Strange memory limit behavior with --mem-per-gpu

2022-04-08 Thread Bjørn-Helge Mevik
Paul Raines  writes:

> Basically, it appears using --mem-per-gpu instead of just --mem gives
> you unlimited memory for your job.
>
> $ srun --account=sysadm -p rtx8000 -N 1 --time=1-10:00:00
> --ntasks-per-node=1 --cpus-per-task=1 --gpus=1 --mem-per-gpu=8G
> --mail-type=FAIL --pty /bin/bash
> rtx-07[0]:~$ find /sys/fs/cgroup/memory/ -name job_$SLURM_JOBID
> /sys/fs/cgroup/memory/slurm/uid_5829/job_1134067
> rtx-07[0]:~$ cat 
> /sys/fs/cgroup/memory/slurm/uid_5829/job_1134067/memory.limit_in_bytes
> 1621419360256
>
> That is a limit of 1.5TB which is all the memory on rtx-07, not
> the 8G I effectively asked for at 1 GPU and 8G per GPU.

Which version of Slurm is this?  We noticed a behaviour similar to this
on Slurm 20.11.8, but when we tested it on 21.08.1, we couldn't
reproduce it.  (We also noticed an issue with --gpus-per-task that
appears to have been fixed in 21.08.)

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] srun and --cpus-per-task

2022-03-25 Thread Bjørn-Helge Mevik
Hermann Schwärzler  writes:

> Do you happen to know if there is a difference between setting CPUs
> explicitely like you do it and not setting it but using
> "ThreadsPerCore=1"?
>
> My guess is that there is no difference and in both cases only the
> physical cores are "handed out to jobs". But maybe I am wrong?

I don't think we've ever tried that.  But I'd be sceptical about "lying"
to Slurm about the actual hardware structure - it migth confuse the cpu
binding if Slurm and the kernel has different pictures of the hardware.

-- 
Bjørn-Helge


signature.asc
Description: PGP signature


Re: [slurm-users] Disable exclusive flag for users

2022-03-25 Thread Bjørn-Helge Mevik
pankajd  writes:

> We have slurm 21.08.6 and GPUs in our compute nodes. We want to restrict /
> disable the use of "exclusive" flag in srun for users. How should we do it?

Two options would be to use the CLI_filter plugin or the job_submit
plugin.  If you want the enforcement to be guaranteed, then the
job_submit plugin is the place (cli_filter can be circumvented by user).

For instance, in job_submit.lua:

   if job_desc.shared == 0 or job_desc.shared == 2 or job_desc.shared == 3 then
slurm.user_msg ("Warning! Please do not use --exclusive unless you 
really know what you are doing.  Your job might be accounted for more CPUs than 
it actually uses, sometimes many times more.  There are better ways to specify 
using whole nodes, for instance using all cpus on the node or all memory on the 
node.")
end

or in cli_filter.lua:

   is_bad_exclusive = { exclusive = true, user = true, mcs = true }
   if is_bad_exclusive[options["exclusive"]] then
  slurm.log_info("Warning! Please do not use --exclusive unless you really 
know what you are doing.  Your job might be accounted for more CPUs than it 
actually uses, sometimes many times more.  There are better ways to specify 
using whole nodes, for instance using all cpus on the node or all memory on the 
node.")
   end

(both of these just warn, though, but should be easy to change into
rejecting the job.)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] srun and --cpus-per-task

2022-03-25 Thread Bjørn-Helge Mevik
For what it's worth, we have a similar setup, with one crucial
difference: we are handing out physical cores to jobs, not hyperthreads,
and we are *not* seeing this behaviour:

$ srun --cpus-per-task=1 -t 10 --mem-per-cpu=1g -A nnk -q devel echo foo
srun: job 5371678 queued and waiting for resources
srun: job 5371678 has been allocated resources
foo
$ srun --cpus-per-task=3 -t 10 --mem-per-cpu=1g -A nnk -q devel echo foo
srun: job 5371680 queued and waiting for resources
srun: job 5371680 has been allocated resources
foo

We have

SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory

and node definitions like

NodeName=DEFAULT CPUs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 
RealMemory=182784 Gres=localscratch:330G Weight=1000

(so we set CPUs to the number of *physical cores*, not *hyperthreads*).

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] monitoring and update regime for Power Saving nodes

2022-02-23 Thread Bjørn-Helge Mevik
David Simpson  writes:

>   * When you want to make changes to slurm.conf (or anything else) to
> a node which is down due to power saving (during a
> maintenance/reservation) what is your approach? Do you end up with 2
> slurm.confs (one for power saving and one that keeps everything up, to
> work on during the maintenance)?

For the slurm.conf part, I'd suggest using the "configless" mode - that
way at least the slurm config will always be up-to-date.  See, e.g.,
https://slurm.schedmd.com/configless_slurm.html

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Problems with sun and TaskProlog

2022-02-11 Thread Bjørn-Helge Mevik
"Putnam, Harry"  writes:

> /opt/slurm/task_epilog
>
> #!/bin/bash
> mytmpdir=/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID
> rm -Rf $mytmpdir
> exit;

This might not be the reason for what you observe, but I believe
deleting the scratch dir in the task epilog is not a good idea.  The
task epilog is run after every "srun" or "mpirun" inside a job, which
means that the scratch dir will be created and deleted for each job
step.  On our systems, we create the scratch dir in the (slurmd) Prolog,
set the environment variable in the TaskProlog, and delete the dir in
the (slurmd) Epilog.  That way the dir is just created and deleted once.

> I am not sure I understand what constitutes a job step.

In practice, every run of srun or mpirun creates a job step, and the job
script itself counts as a job step.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-04 Thread Bjørn-Helge Mevik
Ole Holm Nielsen  writes:

> As Brian Andrus said, you must upgrade Slurm by at most 2 major
> versions, and that includes slurmd's as well!  Don't do a "direct 
> upgrade" of slurmd by more than 2 versions!

That should only be an issue if you have running jobs during the
upgrade, shouldn't it?  As I understand it, without any running jobs,
you can do pretty much what you want on the compute nodes.  Or am I
missing something here?

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-01 Thread Bjørn-Helge Mevik
This might not apply to your setup, but historically when we've seen
similar behaviour, it was often due to the affected compute nodes
missing from /etc/hosts on some *other* compute nodes.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Questions about default_queue_depth

2022-01-12 Thread Bjørn-Helge Mevik
David Henkemeyer  writes:

> 3) Is there a way to see the order of the jobs in the queue?  Perhaps
> squeue lists the jobs in order?

squeue -S -p

Sort jobs in descending priority order.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Is this a known error?

2021-12-08 Thread Bjørn-Helge Mevik
Sean McGrath  writes:

> I'm seeing something similar.
>
> slurmdbd version is 21.08.4
>
> All the slurmd's & slurmctld's are version 20.11.8
>
> This is what is in the slurmdbd.log
>
> [2021-12-07T17:16:50.001] error: unpack_header: protocol_version 8704 not 
> supported

I believe 8704 corresponds to 19.05.x, which is no longer accepted in
21.08.x.

> Can anyone advise how to identify the clients that are generating those
> errors please?

I don't think slurmd connects directly to slurmdbd, so perhaps it is
some frontend node or machine outside the cluster itself which has the
slurm commands installed and is doing requests to slurmdbd (sacct,
sacctmgr, etc.)?

With SlurmdbdDebug set to debug or higher, new client connections will
be logged with

[2021-12-08T09:00:07.992] debug:  REQUEST_PERSIST_INIT: CLUSTER:saga 
VERSION:9472 UID:51568 IP:10.2.3.185 CONN:8

in slurmdbd.log.  But perhaps that will not happen if slurmdbd fails to
unpack the header?

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links

2021-12-03 Thread Bjørn-Helge Mevik
Adrian Sevcenco  writes:

> On 01.12.2021 10:25, Bjørn-Helge Mevik wrote:
>
>> In the end we had to give up
>> using automount, and implement a manual procedure that mounts/umounts
>> the needed nfs areas.
>
> Thanks a lot for info! manual as in "script" or as in "systemd.mount service"?

Script.  We mount (if needed) in the prolog.  Then in the healthcheck
(run every 5 mins), we check if a job is still running on the node that
needs the mount, and unmounts if not.  (We could have done it in the
epilog, but feared it could lead to a lot of mount/umount cycles if a
set of jobs failed immediately.  Hence we put it in the healthcheck
script instead.)

I don't have much experience with the systemd.mount service, but it is
possible it would work fine (and be less hackish than our solution :).

> Also, the big and the only advantage that autofs had over static mounts was
> that whenever there was a problem with the server, after the passing of the 
> glitch
> the autofs would re-mount the target...

That's in theory. :) Our experience in practice is that if the client is
actively using the nfs mounted are when the problem arises, you will
often have to reboot the client to resolve the disk waits.  (I *think*
it has something to do with nfs using longer and longer timeouts when it
cannot reach the server, so eventually it will take too long to time out
and return an error to the running applications.)

> I'm not very sure that a static nfs mount have this capability ... did you 
> baked in
> your manual procedure also a recovery part?

No, we simply pretend it will not happen. :)  In fact, I think we've
only had this type of problems once or twice in the last four-five
years.  But this might be because we only mount the homedirs with nfs,
so most of the time, the jobs are not actively using the nfs mounted
area.  (The most activity happen in BeeGFS or GPFS mounted areas.)

-- 
Bjørn-Helge


signature.asc
Description: PGP signature


Re: [slurm-users] slurmstepd: error: Too many levels of symbolic links

2021-12-01 Thread Bjørn-Helge Mevik
Adrian Sevcenco  writes:

> Hi! Does anyone know what could the the cause of such error?
> I have a shared home, slurm 20.11.8 and i try a simple script in the submit 
> directory
> which is in the home that is nfs shared...

We had the "Too many levels of symbolic links" error some years ago,
while using a combination of automounting nfs areas and private fs name
spaces to get a private /tmp for each job.  In the end we had to give up
using automount, and implement a manual procedure that mounts/umounts
the needed nfs areas.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Per-job TMPDIR: how to lookup gres allocation in prolog?

2021-11-17 Thread Bjørn-Helge Mevik
Mark Dixon  writes:

> Unfortunately, I've not found anything in the Prolog environment (or
> stored on disk under /var/spool/slurmd) containing the gres
> allocations for the job.

[...]

> Is there a better way to get the job's gres information from within
> the prolog, please?

We are using basically the same setup, and have not found any other way
than running "scontrol show job ..." in the prolog (even though it is
not recommended).  I have yet to see any problems arising from it, but
YMMW.

If you find a different way, please share it with the list!

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Warning: can't honor --ntasks-per-node

2021-11-17 Thread Bjørn-Helge Mevik
Ginés Guerrero  writes:

> Hi,
>
> If I submit this script:
>
> #!/bin/bash
> #SBATCH --get-user-env
> #SBATCH -p slims
> #SBATCH -N 2
> #SBATCH -n 40
> #SBATCH --ntasks-per-node=20
> #SBATCH -o log
> #SBATCH -e log
>
> /bin/env
>
> srun hostname
>
> I get the warning: “can't honor --ntasks-per-node set to 20 which
> doesn't match the requested tasks 40 with the number of requested
> nodes 1. Ignoring –ntasks-per-node”.

Are you using IntelMPI?  I've seen this type of warning in some
situations with IntelMPI.  In all our cases, "srun hostname" or "mpirun
hostname" shows that it *does* honor --ntasks-per-node.  (So we
generally just ask our users to check with "srun hostname", and ignore
the warning if it works as expected.)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Bug when I run "sinfo --states=idle"

2021-10-29 Thread Bjørn-Helge Mevik
David Henkemeyer  writes:

> I just noticed today that when I run "sinfo --states=idle", I get all the
> idle nodes, plus an additional node that is in the "DRAIN" state (notice
> how xavier6 is showing up below, even though its not in the idle state):

I *think* this could be because if you drain an idle node, it gets the
state "IDLE+DRAIN", and then "sinfo --states=idle" will include it.  For
instance:

# sinfo --states=idle   
 
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*  up 7-00:00:00 33   resv 
c1-[21-22,24-28,47-52,56],c2-[15-18,21-23,36-39,43,54-56],c3-[9-10,13,15]
bigmem   up 14-00:00:0  1   idle c3-55
accelup 14-00:00:0  0n/a 
optimist up   infinite 33   resv 
c1-[21-22,24-28,47-52,56],c2-[15-18,21-23,36-39,43,54-56],c3-[9-10,13,15]
optimist up   infinite  1   idle c3-55

# drain c3-55 bhmtest
# scontrol show node c3-55
NodeName=c3-55 Arch=x86_64 CoresPerSocket=20 
[...]
   State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1000 Owner=N/A 
MCS_label=N/A
[...]

# sinfo --states=idle
normal*  up 7-00:00:00 33   resv 
c1-[21-22,24-28,47-52,56],c2-[15-18,21-23,36-39,43,54-56],c3-[9-10,13,15]
bigmem   up 14-00:00:0  1  drain c3-55
accelup 14-00:00:0  0n/a 
optimist up   infinite  1  drain c3-55
optimist up   infinite 33   resv 
c1-[21-22,24-28,47-52,56],c2-[15-18,21-23,36-39,43,54-56],c3-[9-10,13,15]

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2021-09-21 Thread Bjørn-Helge Mevik
Amjad Syed  writes:

> We have users who have have defined unix secondary id on our login nodes.
>
> vas20xhu@login01 ~]$ groups
> BIO_pg BIO_AFMAKAY_LAB_USERS
>
> But when we run interactive  and go to compute node , the user does not
> have secondary  group of BIO_AFMAKAY_LAB_USERS
>
> vas20xhu@c0077 ~]$ groups
> BIO_pg

[...]
> When we ssh directly into node without using interactive script there are
> no issues  with groups.

Have you set up your Slurm to be NSS provider for user and group info?
I believe that will only send primary group to the job step processes.
See the enable_nss_slurm LaunchParameters in man slurm.conf, and the URL
in that description.

-- 
Regards,
Bjørn-Helge Mevik



signature.asc
Description: PGP signature


Re: [slurm-users] Is this a known error?

2021-09-17 Thread Bjørn-Helge Mevik
Andreas Davour  writes:

> [2021-09-17T08:53:49.166] error: unpack_header: protocol_version 8448
> not supported
> [2021-09-17T08:53:49.166] error: unpacking header
> [2021-09-17T08:53:49.166] error: destroy_forward: no init
> [2021-09-17T08:53:49.166] error: slurm_receive_msg_and_forward:
> Message receive failure
> [2021-09-17T08:53:49.176] error: service_connection:
> slurm_receive_msg: Message receive failure
>
> Anyone seen that before, or immediately see that I did something wrong?

Sounds a lot like you have a different version of Slurm installed on some
compute node(s).

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] FreeMem is not equal to (RealMem - AllocMem)

2021-09-14 Thread Bjørn-Helge Mevik
Pavel Vashchenkov  writes:

> There  is a line "RealMemory=257433 AllocMem=155648 FreeMem=37773
> Sockets=2 Boards=1"
>
>
> My question is: Why there is so few FreeMem (37 GB instead of expected
> 100 GB (RealMem - AllocMem))?

If I recall correctly, RealMem is what you have configured in
slurm.conf, and AllocMem is how much Slurm has allocated to jobs, while
FreeMem is how much ram is unused on the machine.

So RealMem and AllocMem do not necessarily correspond to what "free" or
"top" reports.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-09 Thread Bjørn-Helge Mevik
Adrian Sevcenco  writes:

> Having just implemented some triggers i just noticed this:
>
> NODELISTNODES PARTITION   STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT 
> AVAIL_FE REASON
> alien-0-47  1alien*draining   48   48:1:1 193324   214030  1 
> rack-0,4 Kill task failed
> alien-0-56  1alien* drained   48   48:1:1 193324   214030  1 
> rack-0,4 Kill task failed
>
> i was wondering why a node is drained when killing of task fails

I guess the heuristic is that something is wrong with the node, so it
should not run more jobs.  Like Disk-waits or similar that might require
a reboot.

> and how can i disable it? (i use cgroups)

I don't know how to disable it, but it can be tuned with:

   UnkillableStepTimeout
  The  length  of time, in seconds, that Slurm will wait before
  deciding that processes in a job step are  unkillable  (after
  they  have  been  signaled  with SIGKILL) and execute Unkill‐
  ableStepProgram.  The default timeout value  is  60  seconds.
  If  exceeded,  the  compute  node  will be drained to prevent
  future jobs from being scheduled on the node.

(Note though, that according to
https://bugs.schedmd.com/show_bug.cgi?id=11103 it should not be set
higher than 127 s.)

You might also want to look at this setting to find out what is going on
on the machine when Slurm cannot kill the job step:

   UnkillableStepProgram
  If  the  processes in a job step are determined to be unkill‐
  able for a period of time specified by the  UnkillableStepTi‐
  meout  variable,  the program specified by UnkillableStepPro‐
  gram will be executed.  By default no program is run.

  See section UNKILLABLE STEP PROGRAM SCRIPT for more  informa‐
      tion.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Building SLURM with X11 support

2021-05-28 Thread Bjørn-Helge Mevik
Thekla Loizou  writes:

> Also, when compiling SLURM in the config.log I get:
>
> configure:22291: checking whether Slurm internal X11 support is enabled
> configure:22306: result:
>
> The result is empty. I read that X11 is build by default so I don't
> expect a special flag to be given during compilation time right?

My guess is that some X development library is missing.  Perhaps look in
the configure script for how this test was done (typically it will try
to compile something with those devel libraries, and fail).  Then see
which package contains that library, install it and try again.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] schedule mixed nodes first

2021-05-17 Thread Bjørn-Helge Mevik
Durai Arasan  writes:

> Is there a way of improving this situation? E.g. by not blocking IDLE nodes
> with jobs that only use a fraction of the 8 GPUs? Why are single GPU jobs
> not scheduled to fill already MIXED nodes before using IDLE ones?
>
> What parameters/configuration need to be adjusted for this to be enforced?

There are two SchedulerParameters you could experiment with (from man 
slurm.conf):

   bf_busy_nodes
  When  selecting resources for pending jobs to reserve for future 
execution (i.e. the job can not be started immediately), then pref‐
  erentially select nodes that are in use.  This will tend to leave 
currently idle resources available for backfilling longer  running
  jobs,  but  may  result  in  allocations  having less than optimal 
network topology.  This option is currently only supported by the
  select/cons_res  and  select/cons_tres  plugins  (or  
select/cray_aries  with  SelectTypeParameters  set  to   "OTHER_CONS_RES"   or
  "OTHER_CONS_TRES", which layers the select/cray_aries plugin over the 
select/cons_res or select/cons_tres plugin respectively).

   pack_serial_at_end
  If used with the select/cons_res or select/cons_tres plugin, then put 
serial jobs at the end of  the  available  nodes  rather  than
  using a best fit algorithm.  This may reduce resource fragmentation 
for some workloads.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] How can I get complete field values with without specify the length

2021-03-08 Thread Bjørn-Helge Mevik
"xiaojingh...@163.com"  writes:

> I am doing a parsing job on slurm fields. Sometimes when one field is
> too long, slum will limit the length with a “+”.

You don't say which slurm command you are trying to parse the output
from, but if it is sacctmgr, it has an option --parsable2(*)
specifically designed for parsing output, and which does not truncate
long field values.

(*) There is also --parsable, but that puts an extra "|" at the end of
the line, so I prefer --parsable2.

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-24 Thread Bjørn-Helge Mevik
Thanks for the heads-up, Ole!

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Set a ramdom offset when starting node health check in SLURM

2020-11-27 Thread Bjørn-Helge Mevik
You can also check out

HealthCheckNodeState=CYCLE

man slurm.conf:

"Rather than running the health check program on all nodes at the same
time, cycle through running on all compute nodes through the course of
the HealthCheckInterval. May be combined with the various node state
options."

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Slurm User Group Meeting (SLUG'20) Agenda Posted

2020-08-31 Thread Bjørn-Helge Mevik
Just wondering, will we get our t-shirts by email? :D

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] GrpMEMRunMins equivalent?

2020-06-06 Thread Bjørn-Helge Mevik
Corey Keasling  writes:

> And thank you also for the solution, I hadn't tried that
> syntax. Interesting that GrpCPURunMins works while GrpMemRunMins does
> not.

Historic reasons: GrpCPURunMins has been there a long time.  Instead of
adding GrpMemRunMins, GrpGPURunMins, etc. for all the TRES on can
specify, they chose to add GrpTRESRunMins instead.

> I also noticed that if the limit is specified as 
> GrpTRESRunMins=Memory=1000,Cpu=2000 only the CPU portion takes effect -- 
> the Memory= portion is silently dropped.  And, specifying Memory=1000
> by itself results in 'Unknown option: grptresrunmins=memory=1000'.
> Only Mem= works, and it works in both instances.

My bad.  Mem is correct, Memory is my false memory. :)

> In fact, it looks
> like any unknown option is silently ignored so long as at least one
> correctly named TRES appears in the list.

Interesting to know!

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] GrpMEMRunMins equivalent?

2020-06-04 Thread Bjørn-Helge Mevik
Corey Keasling  writes:

> The documentation only refers to GrpGRESRunMins, but I can't figure
> out what I might substitute for GRES that means Memory in the same way
> that substituting CPU means, well, CPUs.  Google turns up precisely
> nothing for GrpMemRunMins...  Am I missing something?

GrpTRESRunMins

For instance:

GrpTRESRunMins=Memory=1000,Cpu=2000

See man sacctmgr for details.

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] How to get command from a finished job

2020-04-30 Thread Bjørn-Helge Mevik
Gestió Servidors  writes:

> For example, with "scontrol show jobid" I can know what command has
> been submited, its workir, the stderr file and the stdout one. This
> information, I think, cannot be obtained when the job is finished and
> I run "sacct".

The workdir is available with sacct, IIRC.  For other types of
information, I believe you can add code to your job_submit.lua that stores
it in the job's AdminComment field, which sacct can display.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] How to trap a SIGINT signal in a child process of a batch ?

2020-04-21 Thread Bjørn-Helge Mevik
Jean-mathieu CHANTREIN  writes:

> But that is not enough, it is also necessary to use srun in
> test.slurm, because the signals are sent to the child processes only
> if they are also children in the JOB sense.

Good to know!

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] How to trap a SIGINT signal in a child process of a batch ?

2020-04-21 Thread Bjørn-Helge Mevik
Jean-mathieu CHANTREIN  writes:

> test.sh: 
>
> #!/bin/bash 
>
> function sig_handler() 
> { 
> echo "Executable interrupted" 
> exit 2 
> } 
>
> trap 'sig_handler' SIGINT 
>
> echo "BEGIN" 
> sleep 200 
> echo "END"

Note that bash does not interrupt any running command (except "wait")
when it receives a trapped signal, so the "sleep 200" will not be
interrupted.  The "wait" command is special; it will be interrupted.
From man bash:

   If  bash is waiting for a command to complete and receives a signal for 
which a
   trap has been set, the trap will not be executed until the  command  
completes.
   When  bash  is  waiting  for  an asynchronous command via the wait 
builtin, the
   reception of a signal for which a trap has been set will cause the wait 
builtin
   to  return  immediately with an exit status greater than 128, 
immediately after
   which the trap is executed.

So try using

sleep 200 &
wait

instead.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] log rotation for slurmctld.

2020-03-16 Thread Bjørn-Helge Mevik
Marcus Wagner  writes:

> by concidence, I have stumbled today over the troubleshooting slides
> from slug 2019.
>
> SchedMD there explicitly tells us to use SIGUSR2 instead of restart /
> reload / reconfig / SIGHUP.

Right, I forgot about that. :) SIGUSR2 will not even do a reconfigure,
just reopend the log file.  Thanks for the reminder!

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] log rotation for slurmctld.

2020-03-13 Thread Bjørn-Helge Mevik
navin srivastava  writes:

> can i move the log file  to some other location and then restart.reload of
> slurm service will start a new log file.

Yes, restarting it will start a new log file if the old one is moved
away.  However, also reconfig will do, and you can trigger that by
sending the process a HUP signal.  That way you don't have to restart
the daemon.  We have this in our logrotate file:

postrotate
## Using the newer feature of reconfig when getting a SIGHUP.
kill -hup $(ps -C slurmctld h -o pid)
kill -hup $(ps -C slurmdbd h -o pid)
endscript

(That is for both slurmctld.log and slurmdbd.log.)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Question about SacctMgr....

2020-02-28 Thread Bjørn-Helge Mevik
Ole Holm Nielsen  writes:

> You may use the (undocumented) format=... option to select only the

A while ago, after meticulous study of the man page, I discovered that
the format option is not actually undocumented, it is just very well
hidden. :) All that "man sacctmgr" says about it is

GLOBAL FORMAT OPTION
   When using the format option for listing various fields you can
   put a %NUMBER afterwards to specify how many characters should be 
printed.

   e.g. format=name%30 will print 30 characters of field name right
   justified.  A -30 will print 30  characters left justified.

(in addition to using it in a couple of examples). :)

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] memory in job_submit.lua

2020-02-27 Thread Bjørn-Helge Mevik
Marcus Wagner  writes:

> does anyone know how to detect in the lua submission script, if the
> user used --mem or --mem-per-cpu?
>
> And also, if it is possible to "unset" this setting?

Something like this should work:

if job_desc.pn_min_memory ~= slurm.NO_VAL64 then
   -- --mem or --mem-per-cpu was used; unset it
   job_desc.pn_min_memory = slurm.NO_VAL64
end

> The reason is, we want to remove all memory thingies set by the user
> for exclusive jobs.

We just reject jobs if they use a setting we don't allow -- that avoids
jobs running differently than what the user believed.  For instance:

   -- Bigmem jobs should specify memory, no other job should
   if job_desc.pn_min_memory == slurm.NO_VAL64 then
  -- If bigmem: fail
  if job_desc.partition == "bigmem" then
 slurm.log_info(
"bigmem job from uid %d without memory specification: Denying.",
job_desc.user_id)
 slurm.user_msg("--mem or --mem-per-cpu required for bigmem jobs")
 return 2044 -- Signal ESLURM_INVALID_TASK_MEMORY
  end
   else
  -- If not bigmem: fail
  if job_desc.partition ~= "bigmem" then
 slurm.log_info(
"non-bigmem job from uid %d with memory specification: Denying.",
job_desc.user_id)
 slurm.user_msg("Memory specification only allowed for bigmem jobs")
     return 2044 -- Signal ESLURM_INVALID_TASK_MEMORY
  end
   end

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Question on how to make slurm aware of a CVMFS revision

2020-02-27 Thread Bjørn-Helge Mevik
"Klein, Dennis"  writes:

>   * Can I (and if yes, how can I) update the GRES count dynamically
> (The idea would be to monitor the revision changes on all cvmfs
> mountpoints with a simple daemon process on each worker node which
> then notifies slurm on a revision change)?

Perhaps the daemon process could simply run "scontrol update
node= ..." when it detects a change?

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Slurm 18.08.8 --mem-per-cpu + --exclusive = strange behavior

2019-12-12 Thread Bjørn-Helge Mevik
Beatrice Charton  writes:

> Hi,
>
> We have a strange behaviour of Slurm after updating from 18.08.7 to
> 18.08.8, for jobs using --exclusive and --mem-per-cpu.
>
> Our nodes have 128GB of memory, 28 cores.
>   $ srun  --mem-per-cpu=3 -n 1  --exclusive  hostname
> => works in 18.08.7 
> => doesn’t work in 18.08.8

I'm actually surprised it _worked_ in 18.08.7.  At one time - long before
v 18.08, the behaviour was changed when using --exclusive: In order to
account the job for all cpus on the node, the number of
cpus asked for with --ntasks would simply be multiplied with with
"#cpus-on-node / --ntasks" (so in your case: 28).  Unfortunately, that
also means that the memory the job requires per node is "#cpus-on-node /
--ntasks" multiplied with --mem-per-cpu (in your case 28 * 3 MiB ~=
820 GiB).  For this reason, we tend to ban --exclusive on our clusters
(or at least warn about it).

I haven't looked at the code for a long time, so I don't know whether
this is still the current behaviour, but every time I've tested, I've
seen the same problem.  I believe I've tested on 19.05 (but I might
remember wrong).

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] RHEL8 support

2019-10-28 Thread Bjørn-Helge Mevik
Taras Shapovalov  writes:

> Do I understand correctly that Slurm19 is not compatible with rhel8? It is
> not in the list https://slurm.schedmd.com/platforms.html

It says

"RedHat Enterprise Linux 7 (RHEL7), CentOS 7, Scientific Linux 7 (and newer)"

Perhaps that includes RHEL8, and CentOS 8, not only Scientific Linux 8?

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Sacct selecting jobs outside range

2019-10-17 Thread Bjørn-Helge Mevik
Brian Andrus  writes:

> When running a report to try and get jobs that start during a particular
> day, sacct is returning a number of jobs that show as starting/ending
> outside the range.
> What could cause this?

sacct selects jobs that were eligible to run (including actually
running) between --starttime and --endtime.  (If you add --state, it will
select jobs that were in that state between the times.)

So _any_ job that were running between --starttime and --endtime will be
listed, even if it started before --starttime and/or ended after
--endtime.

Basically, you can think of [--starttime, --endtime] as a window in
time, and sacct will list the jobs that were in the requested state(s)
sometime inside that window.  It will not care in which states the jobs
were outside the window.

At least, this is how I have come to think of it.  IMHO, the sacct
manual is a bit difficult to understand sometimes.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-10 Thread Bjørn-Helge Mevik
Matthew BETTINGER  writes:

> Just curious if this option or oom setting (which we use) can leave
> the nodes in CG "completing" state.

I don't think so.  As far as I know, jobs go into completing state when
Slurm is cancelling them or when they exit on their own, and stays in
that state until any epilogs are run.  In my experience, the most
typical reasons for jobs hanging in CG are disk system failures or other
failures leading to either the job processes or the epilog processes
hanging in "disk wait".

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Bjørn-Helge Mevik
Marcus Boden  writes:

> you're looking for KillOnBadExit in the slurm.conf:
> KillOnBadExit

[...]

> this should terminate the job if a step or a process gets oom-killed.

That is a good tip!

But as I read the documentation (I haven't tested it), it will only kill
the job step itself, it will not kill the whole job.  Also, it will only
have effect for things started with srun, mpirun or similar.  However,
in combination with "set -o errexit", I believe most OOM kills would get
the job itself terminated.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Bjørn-Helge Mevik
Juergen Salk  writes:

> that is interesting. We have a very similar setup as well. However, in
> our Slurm test cluster I have noticed that it is not the *job* that
> gets killed. Instead, the OOM killer terminates one (or more)
> *processes*

Yes, that is how the kernel OOM killer works.

This is why we always tell users to use "set -o errexit" in their job
scripts.  Then at least the job script exits as soon as one of its
processes are killed.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

2019-10-08 Thread Bjørn-Helge Mevik
Jean-mathieu CHANTREIN  writes:

> I tried using, in slurm.conf 
> TaskPlugin=task/affinity, task/cgroup 
> SelectTypeParameters=CR_CPU_Memory 
> MemLimitEnforce=yes 
>
> and in cgroup.conf: 
> CgroupAutomount=yes 
> ConstrainCores=yes 
> ConstrainRAMSpace=yes 
> ConstrainSwapSpace=yes 
> MaxSwapPercent=10 
> TaskAffinity=no 

We have a very similar setup, the biggest difference being that we have
MemLimitEnforce=no, and leave the killing to the kernel's cgroup.  For
us, jobs are killed as they should.  Here are a couple of things you
could check:

- Does it work if you remove the space in "TaskPlugin=task/affinity,
  task/cgroup"? (Slurm can be quite picky when reading slurm.conf).

- See in slurmd.log on the node(s) of the job if cgroup actually gets
  activated and starts limit memory for the job, or if there are any
  errors related to cgroup.

- While a job is running, see in the cgroup memory directory (typically
  /sys/fs/cgroup/memory/slurm/uid_/job_ for the job (on the
  compute node).  Does the values there, for instance
  memory.limit_in_bytes and memory.max_usage_in_bytes, make sense?

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] OverSubscribe parameter

2019-10-02 Thread Bjørn-Helge Mevik
Espen Tangen  writes:

> Hi all, I need a bullet proof way of checking the setting of the
> OverSubscribe parameter from within a runscript.

Perhaps

  squeue -o %h -j $SLURM_JOB_ID

is what you are looking for.  According to squeue(1):

 %hCan  the  compute  resources allocated to the
job be over subscribed by  other  jobs.   The
resources to be over subscribed can be nodes,
sockets,  cores,  or  hyperthreads  depending
upon  configuration.  The value will be "YES"
if the job was submitted  with  the  oversub‐
scribe  option or the partition is configured
with OverSubscribe=Force,  "NO"  if  the  job
requires exclusive node access, "USER" if the
allocated compute nodes are  dedicated  to  a
single  user,  "MCS" if the allocated compute
nodes are  dedicated  to  a  single  security
class  (See  MCSPlugin and MCSParameters con‐
figuration parameters for more  information),
"OK" otherwise (typically allocated dedicated
        CPUs), (Valid for jobs only)

-- 
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] slurm config :: set up a workdir for each job

2019-09-20 Thread Bjørn-Helge Mevik
Adrian Sevcenco  writes:

> Hi! Is there a method for setting up a work directory unique for each
> job from a system setting? and than clean that up?
>
> can i use somehow the prologue and epilogue sections?

slurmd prolog and epilog scripts are commonly used to do this, yes.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Up-to-date agenda for SLUG 2019?

2019-09-16 Thread Bjørn-Helge Mevik
Tim Wickberg  writes:

> Thanks for the reminder. The final version is online now.

Thanks!

> (The only important change is that the time for dinner has been filled
> in, and the schedule is no longer marked as preliminary.)

Hey, Squatters Pub! I was actually considering it for dinner tonight. :)
... hmm ... Does anyone know if the Porcupine Pub & Grill is an ok place?
It is conveniently close to the Guest House.

> See you folks tomorrow!

Cheers!

-- 
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


[slurm-users] Up-to-date agenda for SLUG 2019?

2019-09-16 Thread Bjørn-Helge Mevik
The agenda on https://slurm.schedmd.com/slurm_ug_agenda.html is still
called "Preliminary Schedule", and has not been updated since July 19.

Is this the latest agenda, or is there a newer one somewhere?

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-03 Thread Bjørn-Helge Mevik
Juergen Salk  writes:

> We are also going to implement disk quotas for the amount of local
> scratch space that has been allocated for the job by means of generic
> resources (e.g. `--gres=scratch:100´ for 100GB). This is especially
> important when several users share a node.

Indeed.

> This leads me to ask how you plan to determine the amount of local
> scratch space allocated for the job from within its prolog and epilog
> scripts.
[...]
> I already thought about running `scontrol show job $SLURM_JOB_ID´ from
> within the prolog/epilog scripts in order to get that piece of
> information.

This is exactly what we do. :)

> This line could eventually be parsed to get the amount of scratch
> allocated for this job (and then further used to increase/decrease the
> quota limits for the corresponding $SLURM_JOB_USER in the
> prolog/epilog scripts).

If you use separate directories for each job, and use "project" quotas
(a.k.a folder quotas), then you don't have to adjust the quota when a
new job arrives, even if it is from the same user.

> However, this still looks kind of clumsily to me and I wonder, if I
> have just overlooked a more obvious, cleaner or more robust solution.

Nope.(We could have done something more elegant like writing a
small Perl utility that extracted just the needed parts, but never got
around to it.)

I _think_ another option would be to write a SPANK plugin for the gres,
and let that create/remove the scratch directory and set the quota, but
I haven't looked into that.  That would probably count as a more elegant
solution.

> Since this is probably not an unusual requirement, I suppose this is
> something that many other sites have already solved for
> themselves. No?

Yes, please, let us know how you've solved this!

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-03 Thread Bjørn-Helge Mevik
Ole Holm Nielsen  writes:

> I figured that other sites need the free disk space feature as well
> :-)

:)

> How do you dynamically update your gres=localtmp resource according to
> the current disk free space?  I mean, there is already a TmpFS disk
> space size defined in slurm.conf, so how does your gres=localtmp
> differ from TmpFS?

We simply define the total "count" in the NodeLines for the compute nodes, like

Nodename=c11-[1-36] Gres=localtmp:170 ...

for nodes with 170 GB disk.

Then Slurm will do the rest; it will keep track of these 170 localtmp
"units" and not hand out more than that to jobs.  The jobs just specify
--gres=localtmp:50 for 50 "units".  (Slurm doesn't know how much disk
there is, or even that "localtmp" refers to disk space, it only keeps
count of the units in the Gres definition, so we could have chosen MB as
units (or multipla of Pi, if we really wanted :) ).

So we don't use the TmpFS setting at all.  In our prolog, when a job has
asked for "localtmp", we create a directory for the job
(/localscratch/$SLURM_JOB_ID), and set an environment variable $LOCALTMP
to that directory, so the user can do "cp mydata $LOCALTMP" etc. in the
jobs script.  Then in the epilog, we delete the area.

The new thing we are looking into, then, is to set a "project" quota
(A.K.A folder quota) for the $LOCALTMP directory, and clear the quota
afterwards.  xfs supports this, and ext4 with recent enough version of
the e2fsprogs toolkit.

> With "scontrol show node xxx" we get the node memory values such as
> "RealMemory=256000 AllocMem=24 FreeMem=160056".  Similarly it
> would be great to augment the TmpDisk with a FreeDisk parameter, for
> example "TmpDisk=14 FreeDisk=9".

That would have been nice, yes.

> Would a Slurm modification be required to include a FreeDisk
> parameter, and then change the meaning of "sbatch --tmp=xxx" to refer
> to the FreeDisk in stead of TmpDisk size?

I think it will, yes.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-03 Thread Bjørn-Helge Mevik
We are facing more or less the same problem.  We have historically
defined a Gres "localtmp" with the number of GB initially available
on local disk, and then jobs ask for --gres=localtmp:50 or similar.

That prevents slurm from allocating jobs on the cluster if they ask for
more disk than is currently "free" -- in the sense of "not handed out to
a job".  But it doesn't prevent jobs from using more than they have
asked for, so the disk might have less (real) free space than slurm
thinks.

As far as I can see, cgroups does not support limiting used disk space,
only amount of IO/s and similar.

We are currently considering using file system quotas for enforcing
this.  Our localtmp disk is a separate xfs partition, and the idea is to
make the prolog set up a "project" disk quota for the job on the
localtmp file system, and the epilog to remove it again.

I'm not 100% sure we will make it work, but I'm hopeful.  Fingers
crossed! :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Slurm 19.05 --workdir non existent?

2019-08-15 Thread Bjørn-Helge Mevik
Christopher Benjamin Coffey  writes:

> It seems that --workdir= is no longer a valid option in batch jobs and
> srun in 19.05, and has been replaced by --chdir. I didn't see a change
> log about this, did I miss it? Going through the man pages it seems it
> hasn't existed for some time now actually! Maybe not since before
> 17.11 series. When did this happen?!

From the NEWS file:

* Changes in Slurm 17.11.0rc1
==
[...]
 -- Change --workdir in sbatch to be --chdir as in all other commands (salloc,
srun).

> I guess I'll have to write a
> jobsubmit rule to overwrite this in the meantime till we get users
> trained differently.

I think you would have to write a SPANK plugin that implements the
--workdir switch.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


  1   2   >