Re: [slurm-users] slurmd startup problem

2021-08-16 Thread Brian Andrus
I suspect you may have set some "frontendname" or "frontendaddr" in your slurm.conf that triggered that. A FrontEnd is a node that is used to execute batch scripts rather than compute nodes (Cray ALPS systems). If that is not you, you should not set it. Brian Andrus

Re: [slurm-users] Preemption not working for jobs in higher priority partition

2021-08-20 Thread Brian Andrus
IIRC, Preemption is determined by partition first, not node. Since your pending job is in the 'day' partition, it will not preempt something in the 'night' partition (even if the node is in both). Brian Andrus On 8/19/2021 2:49 PM, Russell Jones wrote: Hi all, I co

Re: [slurm-users] using sacctmgr to change the parent of an account

2021-09-08 Thread Brian Andrus
Yep. I do it all the time when I forget to add a parent. Also when a project/account changes who owns it. sacctmgr will also tell you what it is going to change and gives you 30 seconds to say yes, else it doesn't make the change. Brian Andrus On 9/8/2021 3:41 AM, byron wrote: Hi

Re: [slurm-users] max_script_size

2021-09-13 Thread Brian Andrus
a cluster. This may be a good option for you. Brian Andrus On 9/13/2021 7:14 AM, Ozeryan, Vladimir wrote: *max_script_size=#* Specify the maximum size of a batch script, in bytes. The default value is 4 megabytes. Larger values may adversely impact system performance. I have users

Re: [slurm-users] [External] How can I do to prevent a specific job from being prempted?

2021-09-16 Thread Brian Andrus
Modify it and raise the priority to something very, very high. scontrol update job=JOBID priority=999 Brian Andrus On 9/16/2021 8:39 AM, 顏文 wrote: Dear users Thank for the immediate replies.I currently have one important job running. How to prevent the running job from being preempted

Re: [slurm-users] Possible bug with Prologslurmctld and Epilogslurmctld scripts?

2021-09-27 Thread Brian Andrus
Those would be considered separate for each job. You may want to have your prolog check to see if there is an epilogue running and wait for the epilogue to be done before starting its prolog work. Brian Andrus On 9/27/2021 9:15 AM, Joe Teumer wrote: Should the Prologslurmctld script only

Re: [slurm-users] "Low RealMem" after upgrade

2021-10-01 Thread Brian Andrus
. Also helps with OOM killer situations. Brian Andrus On 10/1/2021 1:22 AM, Diego Zuccato wrote: Hello all. I just upgraded to Debian 11 that brings Slurm 21.08 and the newer nodes upgraded w/o too many issues (just minor config changes, one being RealMemory value in slurm.conf, since for

Re: [slurm-users] job is pending but resources are available

2021-10-12 Thread Brian Andrus
Something is very odd when you have the node reporting:RealMemory=1 AllocMem=0 FreeMem=47563 Sockets=2 Boards=1 What do you get when you run ‘slurmd -C’ on the node? Brian Andrus From: Adam XuSent: Tuesday, October 12, 2021 6:07 PMTo: slurm-users@lists.schedmd.comSubject: Re: [slurm-users] job is

Re: [slurm-users] Slurm Crashing - File has zero size

2021-10-28 Thread Brian Andrus
You may have space, but do you have enough inodes? Two different things to look at when trying to see why you cannot write to a disk. Also verify that it is writeable by SlurmUser. If something happened and it automatically remounted itself as read-only, that can do it too. Brian Andrus

Re: [slurm-users] Slurm Multi-cluster implementation

2021-10-31 Thread Brian Andrus
That is interesting to me. How do you use ulimit and systemd to limit user usage on the login nodes? This sounds like something very useful. Brian Andrus On 10/31/2021 1:08 AM, Yair Yarom wrote: Hi, If it helps, this is our setup: 6 clusters (actually a bit more) 1 mysql + slurmdbd on the

Re: [slurm-users] How to checkout a slurm node?

2021-11-12 Thread Brian Andrus
I don't think slum does what you think it does. It manages the resources and schedule, not the actual hardware of a node. You are likely looking for something more along a hypervisor (if you are doing VMs) or remote KVM (since you are mentioning BIOS access). Brian Andrus On 11/12/2021

Re: [slurm-users] Job Preemption Time

2021-11-22 Thread Brian Andrus
Maybe submit the job with the option to not start for 24 hours... From https://slurm.schedmd.com/sbatch.html : --begin=now+1hour Brian ANdrus On 11/22/2021 8:28 PM, Jeherul Islam wrote: Dear All, Is there any way to configure slurm, that the High Priority job waits for a certain amount of

Re: [slurm-users] random allocation of resources

2021-12-01 Thread Brian Andrus
one and set the job to use that node. Brian Andrus On 12/1/2021 12:06 PM, Benjamin Nacar wrote: Based on some quick experiments, that doesn't do what I'm looking for. I set LLN=YES for the default partition and ran my test job several times, waiting each time for it to fin

Re: [slurm-users] TimeLimit parameter

2021-12-02 Thread Brian Andrus
imit, part_max_time and partition variables are mapped from job_desc and part_list Brian Andrus On 12/2/2021 6:01 AM, mercan wrote: Hi; The EnforcePartLimits parameter in slurm.conf, should be set to ALL or ANY to enforce time limit for partition. Regards. Ahmet M. 2.12.2021 16:18 tarihinde

Re: [slurm-users] slurmdbd does not work

2021-12-02 Thread Brian Andrus
Your slurm needs built with the support. If you have mysql-devel installed it should pick it up, otherwise you can specify the location with --with-mysql when you configure/build slurm Brian Andrus On 12/2/2021 12:40 PM, Giuseppe G. A. Celano wrote: Hi everyone, I am having trouble

Re: [slurm-users] slurmdbd does not work

2021-12-03 Thread Brian Andrus
:41.022] fatal: You are running with a database but for some reason we have no TRES from it.  This should only happen if the database is down and you don't have any state files. On Thu, Dec 2, 2021 at 10:36 PM Brian Andrus wrote: Your slurm needs built with the support. If you have

Re: [slurm-users] TimeLimit parameter

2021-12-03 Thread Brian Andrus
parts you need out of it. Brian Andrus On 12/3/2021 2:11 AM, Gestió Servidors wrote: Hi, Answering between lines... > Hi; > > The EnforcePartLimits parameter in slurm.conf, should be set to ALL or ANY > to enforce time limit for partition. > > Regards. > > Ahmet M

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-03 Thread Brian Andrus
Which version of Mariadb are you using? Brian Andrus On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote: After installation of libmariadb-dev, I have reinstalled the entire slurm with ./configure + options, make, and make install. Still, accounting_storage_mysql.so is missing. On Sat, Dec

Re: [slurm-users] Add new compute node without interruption

2021-12-13 Thread Brian Andrus
Indeed, this is accurate. We regularly add nodes on the fly (cloud based cluster). All that is need is to get them all set in the slurm.conf, restart slurmctld and do 'scontrol reconfigure' Brian Andrus On 12/13/2021 11:01 AM, Paul Brunk wrote: Hi: Normally, adding a new nod

[slurm-users] List only available and up partitions

2022-01-26 Thread Brian Andrus
All, Trying to see if there is a simpler way to do this other than awk.. Is there a way to list only partitions a user has access to that are in the 'UP' state? Brian Andrus

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Brian Andrus
file and folks will be able to login again. Brian Andrus On 1/31/2022 9:18 PM, Sid Young wrote: Sid Young W: https://off-grid-engineering.com W: (personal) https://sidyoung.com/ W: (personal) https://z900collector.wordpress.com/ On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel wrote:

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-01 Thread Brian Andrus
That looks like a DNS issue. Verify all your nodes are able to resolve the names of each other. Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on the nodes (including head/login nodes) to ensure they all match. Brian Andrus On 2/1/2022 1:37 AM, Jeremy Fix wrote: Hello

Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-02 Thread Brian Andrus
), you should be good. You will still need to do the incremental for the db changes, but no worries about state files either way. Brian Andrus On 2/2/2022 1:38 PM, Nathan Smith wrote: The "Upgrades" section of the quick-start guide [0] warns: Slurm permits upgrades to a new major rele

Re: [slurm-users] job_container/tmpfs mounts a private /tmp but the permission is root 700.Normal user can not read or write.

2022-02-03 Thread Brian Andrus
symlink that to /scratch which is where users are directed to. You could just do "chmod 1777 /tmp" as well Caveat: If this is the ephemeral ramdisk/ssd/etc disk that is created each time the node starts up, you have to do the above step every boot. Brian Andrus On 2/2/2022 8:59 P

Re: [slurm-users] Slurm 20.02 Dry-Run DB upgrade - Question on the cloned DB

2022-02-07 Thread Brian Andrus
So if you run sshare on the head node, it shows your dummy user? At any rate, just do a db dump (also known as a backup) and you can restore that if you have an issue of any sort. Brian Andrus On 2/7/2022 12:42 AM, Moshe Mergy wrote: Hi all I cloned the Slurm DB into a separated node

Re: [slurm-users] Slurm 20.02 Dry-Run DB upgrade - Question on the cloned DB

2022-02-07 Thread Brian Andrus
.conf after shutting down slurmd. You should not have anything listed as backuphost for anything on the cloned db node. It should only have 'localhost' for the SlurmtctldHost and AccountingStorageHost (slurm.conf) and DbdHost (slurmdbd.conf) Brian Andrus On 2/7/2022 8:51 AM, Mo

Re: [slurm-users] Slurm 20.02 Dry-Run DB upgrade - Question on the cloned DB

2022-02-07 Thread Brian Andrus
Moshe, So it looks like you added the dummy user to the main database somehow. I would suggest to try again being cautious and make a dummy2 user or such. Your questions now are getting out of slurm and into mysql area, so may be more appropriate in another forum. Brian Andrus On 2/7

Re: [slurm-users] sbatch - accept jobs above limits

2022-02-09 Thread Brian Andrus
Just curious as to expectations out here. When /should /slurm immediately reject a job? Brian Andrus On 2/8/2022 11:41 PM, Alexander Block wrote: Hi Mike, I'm just discussing a familiar case with SchedMD right now (ticket 13309). But it seems that it is not possible with Slurm to s

Re: [slurm-users] HA for slurmdbd

2022-02-15 Thread Brian Andrus
le way to do it would be to have round-robin DNS or a load balancer in front of the slurmdbd servers and let that be where clients access it. Brian Andrus On 2/15/2022 7:46 AM, Xand Meaden wrote: Hello, I'm wondering what others are doing to make their slurmdbd service resilient? W

Re: [slurm-users] Suspend QOS help

2022-02-18 Thread Brian Andrus
First look and I would guess that there are enough resources to satisfy the requests of both jobs, so no need to suspend. Having the node info and the job info to compare would be the next step. Brian Andrus On 2/18/2022 7:20 AM, Walls, Mitchell wrote: Hello, Hoping someone can shed some

Re: [slurm-users] monitoring and update regime for Power Saving nodes

2022-02-23 Thread Brian Andrus
configless so I use a symlink to the slurm.conf file a shared filesystem. This works great. Anytime there are changes, a simple 'scontrol reconfigure' brings all running nodes up to speed and any down nodes will automatically read the latest. Brian Andrus On 2/23/2022 2:31 AM, Dav

Re: [slurm-users] Slurmrestd JWT authentication [SEC=UNOFFICIAL]

2022-03-07 Thread Brian Andrus
Double-check you have all the packages. When slurm is built, slurmrestd is a separate package and is only built if the whole set was directed to do so. If they did not build it, you will need to do so yourself. This will mean using your custom built files throughout. Brian Andrus On 3/7

Re: [slurm-users] job requesting licenses would not be scheduled as expected

2022-03-15 Thread Brian Andrus
Depending on other variables, it is fine. The 7 license job cannot run because there are only 5 available, so that one has to wait. Since there are 5 available, the 1 license job can run, so it does. That is the simple view. Other variables such as job time could affect that. Brian Andrus

Re: [slurm-users] Use all cores when submitting to heterogeneous nodes

2022-03-22 Thread Brian Andrus
to know what is available versus what you asked for. When using exclusive, it becomes more like "I want at least X cores" and you get "Ok, here are X cores or more" Within your script, you could check for total cores. something like 'srun lscpu' and parse the ou

Re: [slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs

2022-03-23 Thread Brian Andrus
It should exist in the user environment as well. I would check the users .bashrc and .bash_profile settings to see if they are doing anything that will change that. Brian Andrus On 3/23/2022 7:42 AM, taleinterve...@sjtu.edu.cn wrote: Hi, all: We found a problem that slurm job with

Re: [slurm-users] Make sacct show short job state codes?

2022-03-24 Thread Brian Andrus
("RUNNING","RU");print}' Just add a 'sub' command for each substitution. It is tedious to setup but will do the trick. You can also specify the specific field to do any substitution on. Brian Andrus On 3/24/2022 6:12 AM, Chip Seraphine wrote: I’m trying to sh

Re: [slurm-users] Why is --cpu_bind not an option for sbatch? Why only srun?

2022-03-31 Thread Brian Andrus
ep that is where you would do that and those are subsets of sbatch. Brian Andrus On 3/31/2022 11:14 AM, David Henkemeyer wrote: We noticed that we can pass --cpu_bind into an srun commandline, but not sbatch.  Why is that? Thanks David

Re: [slurm-users] Can I define and use custom env vars in slurm.conf?

2022-04-04 Thread Brian Andrus
, and are not usable outside of the partition configuration. Feature     All nodes with this single feature will be included as part of this nodeset. Nodes     List of nodes in this set. NodeSet     Unique name for a set of nodes. Must not overlap with any NodeName definitions. Brian Andrus

[slurm-users] Lua to reject if maintenance window

2022-04-05 Thread Brian Andrus
All, Not sure if this is already out there, but it would be nice to be able to immediately reject interactive jobs that are going to be held due to an upcoming maintenance window. Does anyone already have this? If not, I suspect I will work on it as a lua function for the job_submit.lua Brian

Re: [slurm-users] Node is not allocating all CPUs

2022-04-05 Thread Brian Andrus
You want to see what is output on the node itself when you run: slurmd -C Brian Andrus On 4/5/2022 2:11 PM, Guertin, David S. wrote: We've added a new GPU node to our cluster with 32 cores. It contains 2 16-core sockets, and hyperthreading is turned off, so the total is 32 cores. But

Re: [slurm-users] sinfo : Format NodeHost truncation

2022-04-07 Thread Brian Andrus
justified and size must be specified. By default output is left justified. suffix     Arbitrary string to append to the end of the field. Brian Andrus On 4/7/2022 11:02 AM, Nicholas Yue wrote: Hi,   I am spinning up an MPI/Slurm cluster on AWS   I am attempting to script the

Re: [slurm-users] Issues with pam_slurm_adopt

2022-04-08 Thread Brian Andrus
Check selinux. Run "getenforce" on the node, if it returns 1, try setting "setenforce 0" Slurm doesn't play well if selinux is enabled. Brian Andrus On 4/8/2022 10:53 AM, Nicolas Greneche wrote: Hi, I have an issue with pam_slurm_adopt when I moved from 21.08.5

Re: [slurm-users] Issues with pam_slurm_adopt

2022-04-08 Thread Brian Andrus
Ok. Next I would check that the uid of the user is the same on the compute node as the head node. It looks like it is identifying the job, but doesn't see it as yours. Brian Andrus On 4/8/2022 1:40 PM, Nicolas Greneche wrote: Hi Brian, Thanks, SELinux is neither in strict or targeted

Re: [slurm-users] Looking for examples of daily job reports

2022-04-15 Thread Brian Andrus
Not to steal his thunder, but Ole has done a great job with quite a few things. He has some job scripts at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs I fully expect him to chime in and offer additional great advice. Brian Andrus On 4/15/2022 7:28 AM, David Henkemeyer

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Brian Andrus
'vanishes'.  I suspect Nagios even has the hooks to make that work. You could also email the user to let them know their job was ended due to spot being pulled. Just some ideas, Brian Andrus On 5/5/2022 6:28 AM, Steven Varga wrote: Hi Tina, Thank you for sharing. This matches my obser

Re: [slurm-users] Question about having 2 partitions that are mutually exclusive, but have unexpected interactions

2022-05-12 Thread Brian Andrus
on restart of the slurmctld daemon. May not exceed 65533. so if you already have (by default) 5000 jobs being considered, the remaining aren't even looked at. Brian Andrus On 5/12/2022 7:34 AM, David Henkemeyer wrote: Question for the braintrust: I have 3 partitions:

Re: [slurm-users] 21.08.6 srun fails with error "Invalid job credential" ; sbatch is fine.

2022-05-13 Thread Brian Andrus
Double-check the account info on that node (c0801). Could be the node does not recognize the uid being assigned to the user/job. Brian Andrus On 5/13/2022 2:31 PM, Williams, Jenny Avis wrote: Yesterday I upgraded slurmdbd and slurmctld nodes from RHEL7 / Slurm v. 20.11.8 to RHEL8.5 / Slurm

Re: [slurm-users] Performance tracking of array tasks

2022-05-16 Thread Brian Andrus
hings down into slurm steps, so you would be able to get pretty good detailed info. Brian Andrus On 5/16/2022 6:44 AM, William Dear wrote: Could anyone please recommend methods of tracking the performance of individual tasks in a task array job?  I have installed XDMoD but it is focused so

Re: [slurm-users] container on slurm cluster

2022-05-17 Thread Brian Andrus
You are starting to understand a major issue with most containers. I suggest you check out Singularity, which was built from the ground up to address most issues. And it can run other container types (eg: docker). Brian Andrus On 5/16/2022 10:49 PM, GHui wrote: I use podman 4.0.2. And slurm

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Brian Andrus
You need to step upgrade through major versions (not minor). So 19.05=>20.x I would highly recommend going to 21.08 while you are at it. I just did the same migration (although they started at 18.x) with no issues. Running jobs were not impacted and users didn't even notice. Bria

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Brian Andrus
watch the logs to see when it is happy). Don't start slurmctld until that is done. Waiting makes things easier. Brian Andrus On 5/17/2022 9:29 AM, Paul Edmon wrote: I think it should be, but you should be able to run a test and find out. -Paul Edmon- On 5/17/22 12:13 PM, byron wrote: Sor

Re: [slurm-users] container on slurm cluster

2022-05-18 Thread Brian Andrus
the people you give the permission to that they will not abuse it. Brian Andrus On 5/18/2022 12:22 AM, GHui wrote: Hi, Brian Andrus I think the main poblem is that container can cheat Slurm. On 5/17/22 06:58:20, Brian Andrus wrote: > You are starting to understand a major issue with most co

Re: [slurm-users] How to Make AvailableFeatures Persist after Slurmctld Restart

2022-06-02 Thread Brian Andrus
Add it to your slurm.conf Then it is always there after a restart. Brian Andrus On 6/2/2022 12:05 PM, Hanby, Mike wrote: Howdy, I can’t seem to find a solution in ‘man slurm.conf’ for this. How can I make the following persist a slurmctld restart: scontrol update NodeName="

Re: [slurm-users] what is the possible reason for secondary slurmctld node not allocate job after takeover?

2022-06-03 Thread Brian Andrus
Offhand, I would suggest double check munge and versions of slurmd/slurmctld. Brian Andrus On 6/3/2022 3:17 AM, taleinterve...@sjtu.edu.cn wrote: Hi, all: Our cluster set up 2 slurm control node and scontrol show config as below: > scontrol show config … SlurmctldHost[0] = slu

Re: [slurm-users] Persistent Interactive Jobs

2022-06-09 Thread Brian Andrus
rent users/groups. Brian Andrus On 6/9/2022 5:19 PM, Willy Markuske wrote: Hello All, I have a request from users for the ability to have persistent interactive jobs. Currently some users are using srun to allocate and interactive job and run their scripts but sshd will close connections aft

Re: [slurm-users] detailed worker state with sinfo

2022-06-26 Thread Brian Andrus
respectively. *NOTE*: The suffix "*" identifies nodes that are presently not responding. Brian Andrus On 6/26/2022 5:39 AM, z1...@arcor.de wrote: Hello, if I call "sinfo -o %all", the worker state includes only a single state word like "DRNG". It is clearer in

Re: [slurm-users] Is there split-brain danger when using backup slurmdbd?

2022-06-27 Thread Brian Andrus
ensures both are getting accurate and current information. Brian Andrus On 6/27/2022 9:15 AM, taleinterve...@sjtu.edu.cn wrote: Hi, all: We noticed that slurmdbd provide the conf option *DbdBackupHost* for user to set a secondary slurmdbd node. Since slurmdbd is closely related to database

Re: [slurm-users] Problems building RPMs

2022-07-21 Thread Brian Andrus
Hmm. That would imply you could still use the tar file with something like: rpmbuild -v -ta --define "_lto_cflags %{nil}" slurm-22.05.2.tar.bz2 Note, I have not tried this (no immediate access to RHEL9 derivative), so YMMV. Brian Andrus On 7/21/2022 10:15 AM, Kilian Cavalotti

Re: [slurm-users] unable to ssh onto compute nodes on which I have running jobs

2022-07-27 Thread Brian Andrus
Verify that their uid on the node is the same as the uid your master sees Brian Andrus On 7/27/2022 8:53 AM, byron wrote: Hi When a user tries to login into a compute node on which they have a running job they get the error Access denied: user blahblah (uid=) has no active jobs on

Re: [slurm-users] unable to ssh onto compute nodes on which I have running jobs

2022-07-27 Thread Brian Andrus
Lloyd, You could  check out the order of entries in your pam.d/ssh (and related/included) files See where the slurm_pam_adopt is, how it is being called and if there are settings that are interferring. Does this occur only on a single node, or all of them? Brian Andrus On 7/27/2022 9:29

Re: [slurm-users] Does the slurmctld node need access to Parallel File system and Runtime libraries of the SW in the Compute nodes.

2022-08-02 Thread Brian Andrus
compute nodes do. Brian Andrus On 8/2/2022 6:45 AM, Paul Edmon wrote: No, the node running the slurmctld does not need access to any of the customer facing filesystems or home directories.  While all the login and client nodes do, the slurmctld does not. -Paul Edmon- On 8/2/2022 9:30 AM

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-03 Thread Brian Andrus
So an example of using slurm to reboot all nodes 3 at a time:     sinfo -h -o %n|xargs --max-procs=3 scontrol reboot {} If you want to get fancy, make a script that does the reboot and waits for the node to be back up before exiting and use that instead of the 'scontrol reboot' par

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-04 Thread Brian Andrus
This is actually brilliant! Brian Andrus On 8/3/2022 10:20 PM, Gerhard Strangar wrote: Phil Chiu wrote: - Individual slurm jobs which reboot nodes - With a for loop, I could submit a reboot job for each node. But I'm not sure how to limit this so at most N jobs are ru

Re: [slurm-users] Node status (without repeats)

2022-08-08 Thread Brian Andrus
It looks to me like you have the same node in multiple partitions. If the output you are getting is basically what you want just pipe it to 'sort -u' or 'uniq' Brian Andrus On 8/8/2022 10:14 AM, Borchert, Christopher B ERDC-RDE-ITL-MS CIV wrote: Hello. How can I simply

Re: [slurm-users] License management and invoking scontrol in the prolog

2022-08-30 Thread Brian Andrus
Not sure if you can do all the things you intend, but the job_submit script is precisely where you want to check submission options. https://slurm.schedmd.com/job_submit_plugins.html Brian Andrus On 8/30/2022 12:58 PM, Davide DelVento wrote: Hi, I would like to soft-enforce license

Re: [slurm-users] License management and invoking scontrol in the prolog

2022-09-01 Thread Brian Andrus
I would be surprised if it were compiled without the support. However, you could check and run something like: strings /sbin/slurmctld | grep job_submit (or where ever your slurmctld binary is). There should be quite a few lines with that in it. Brian Andrus On 9/1/2022 10:54 AM, Davide

Re: [slurm-users] License management and invoking scontrol in the prolog

2022-09-01 Thread Brian Andrus
) Usually it would be found at /usr/lib64/slurm/job_submit_lua.so If that is there, you should be good with trying out a job_submit lua script. Brian Andrus On 9/1/2022 1:24 PM, Davide DelVento wrote: Thanks again, Brian, indeed that grep returns many hits, but none of them includes lua, i.e

Re: [slurm-users] License management and invoking scontrol in the prolog

2022-09-01 Thread Brian Andrus
Try setting logging to debug mode, then you can get some info from the logs. Brian Andrus On 9/1/2022 8:15 PM, Davide DelVento wrote: Thanks. I did try a lua script as soon as I got your first email, but that never worked (yes, I enabled it in slurm.conf and ran "scontrol reconfigure&q

Re: [slurm-users] License management and invoking scontrol in the prolog

2022-09-07 Thread Brian Andrus
Possibly way off base, but did you happen to do any of the editing in Windows? Maybe running into the cr/lf issue for how windows saves text files? Brian Andrus On 9/7/2022 5:21 AM, Davide DelVento wrote: Thanks Ole, your wiki page sheds some light on this mystery. Very frustrating that even

Re: [slurm-users] can a job run across partition in slurm

2022-09-08 Thread Brian Andrus
No, however a node can reside in multiple partitions. So if you add those nodes to the partition they are running in, they will be available to them. Brian Andrus On 9/8/2022 11:38 AM, Purvesh Parmar wrote: We require more nodes to run a single job which requires more nodes than present in

Re: [slurm-users] can a job run across partition in slurm

2022-09-12 Thread Brian Andrus
I had completely forgotten about HETJOB supporting multiple partitions. Thanks for reminding me. Brian Andrus On 9/12/2022 6:06 AM, Marcus Wagner wrote: yes, that is possible by submitting a hetjob. Best Marcus Am 08.09.2022 um 20:38 schrieb Purvesh Parmar: We require more nodes to run a

Re: [slurm-users] How to debug a prolog script?

2022-09-15 Thread Brian Andrus
configured? Brian Andrus On 9/15/2022 2:49 PM, Davide DelVento wrote: I have a super simple prolog script, as follows (very similar to the example one) #!/bin/bash if [[ $VAR == 1 ]]; then echo "True" fi exit 0 This fails (and obviously causes great disruption to my production j

Re: [slurm-users] remote license

2022-09-15 Thread Brian Andrus
is in the database updated to match the number free from flexlm to stop license starvation due to users outside slurm using them up so they really aren't available to slurm. Brian Andrus On 9/15/2022 3:34 PM, Davide DelVento wrote: I am a bit confused by remote licenses. https://li

Re: [slurm-users] Can I set dynamic weighting for nodes?

2022-09-15 Thread Brian Andrus
You can dynamically modify the weight of nodes with:     scontrol update nodename= weight= So, in theory, you could do that periodically to adjust the weights you may want. Brian Andrus On 9/15/2022 4:27 PM, Russell Smithies wrote: Can I set dynamic or calculated  “weights” for nodes

Re: [slurm-users] remote license

2022-09-16 Thread Brian Andrus
t) 2) Update the database (sacctmgr command) As you can see, that 1st step would be highly dependent on you and your environment. The 2nd step would be dependent on what things you are tracking within that. Brian Andrus On 9/16/2022 5:01 AM, Davide DelVento wrote: So if I understand corr

Re: [slurm-users] remote license

2022-09-16 Thread Brian Andrus
Feel free to do that. It is not something that scales well, but it looks like you have a rather beginner cluster that would never be impacted by such choices. Brian Andrus On 9/16/2022 10:00 AM, Davide DelVento wrote: Thanks Brian. I am still perplexed. What is a database to install, admin

Re: [slurm-users] remote license

2022-09-16 Thread Brian Andrus
ourself working on a large cluster sometime in your career, I would not recommend using it there. Brian Andrus On 9/16/2022 3:06 PM, Davide DelVento wrote: Hi Brian, From your response, I speculate that my wording sounded harsh or unrespectful. That was not my intention and therefore I sincer

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Brian Andrus
Paul, You are likely spot on with the inactiveLimit change. It may also be an environment variable of TMOUT (under bash) set. Brian Andrus On 9/19/2022 5:46 AM, Paul Raines wrote: I have had two nights where right at 3:35am a bunch of jobs were killed early with TIMEOUT way before  their

Re: [slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Brian Andrus
out of the mix. Brian Andrus On 9/23/2022 7:09 AM, Groner, Rob wrote: I'm working through how to use the new dynamic node features in order to take down a particular node, reconfigure it (using nvidia MIG to change the number of graphic cores available) and give it back to slurm. I

Re: [slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Brian Andrus
ynamic node.  What is the preferred method? Rob -------- *From:* slurm-users on behalf of Brian Andrus *Sent:* Friday, September 23, 2022 10:24 AM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] slurmd and dyna

Re: [slurm-users] Ideal NFS exported StateSaveLocation size.

2022-10-24 Thread Brian Andrus
YMMV, but if you aren't having excessive traffic to the share, you should be good. I have yet to discover what would be excessive enough to impact things. The only use I have had for the HA is being able to keep the cluster running/happy during maintenance. Brian Andrus On 10/24/2022 1:

Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-10-31 Thread Brian Andrus
It caches up to a point. As I understand it, that is about an hour (depending on size and how busy the cluster is, as well as available memory, etc). Brian Andrus On 10/31/2022 9:20 PM, Richard Chang wrote: Hi, Just for my info, I would like to know what happens when SlurmDBD loses

Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-11-01 Thread Brian Andrus
Ole, Fair enough, it is actually slurmctld that does the caching. Technical typo on my part there. Just trying to let the user know, there is a window that they have to ensure no information is lost during a database outage. Brian Andrus On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote: Hi

Re: [slurm-users] SlurmDBD losing connection to the backend MariaDB

2022-11-01 Thread Brian Andrus
processing data. There are many ways to do that, but those designs fall under MariaDB and not Slurm. Brian Andrus On 11/1/2022 6:49 PM, Richard Chang wrote: Does it mean it is best to use a single slurmdbd host in my case? My primary slurmctld is the backup slurmdbd host, and my worry is if t

Re: [slurm-users] Slurm: Handling nodes that fail to POWER_UP in a cloud scheduling system

2022-11-23 Thread Brian Andrus
reset/recreate it. That addresses even a miffed software change. Brian Andrus On 11/23/2022 5:11 AM, Xaver Stiensmeier wrote: Hello slurm-users, The question can be found in a similar fashion here: https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a

Re: [slurm-users] How to launch slurm services after installation

2022-11-27 Thread Brian Andrus
Steve, I suspect you did not install the packages. You need to install slurm-slurmctld to get the slurmctld systemd files: /# rpm -qlp slurm-slurmctld-20.11.9-1.el7.x86_64.rpm// ///run/slurm/slurmctld.pid// /*//usr/lib/systemd/system/slurmctld.service/*/ ///usr/sbin/slurmctld//

Re: [slurm-users] Licenses: Remote vs Reservation

2022-11-30 Thread Brian Andrus
ed to submit at all? The reservation method can cause an sbatch command to be rejected, if that is what you are looking for. Brian Andrus On 11/30/2022 6:29 AM, Richard Ems wrote: Hi all, I have to change our set up to be able to update the total number of available licenses due to users che

Re: [slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread Brian Andrus
I successfully build it for Rocky straight from the tgz file as usual with rpmbuild -ta Brian Andrus On 12/2/2022 9:21 AM, David Thompson wrote: Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8 Slurm cluster. We would like to be able to use the sbatch –prefer option

Re: [slurm-users] Job allocation from a heterogenous pool of nodes

2022-12-07 Thread Brian Andrus
You may want to look here: https://slurm.schedmd.com/heterogeneous_jobs.html Brian Andrus On 12/7/2022 12:42 AM, Le, Viet Duc wrote: Dear slurm community, I am encountering a unique situation where I need to allocate jobs to nodes with different numbers of CPU cores. For instance

Re: [slurm-users] I can't seem to use all the CPUs in my Cluster?

2022-12-13 Thread Brian Andrus
assigned to it. Also check the state of the nodes with 'sinfo' It would also be good to ensure the node settings are right. Run 'slurmd -C' on a node and see if the output matches what is in the config. Brian Andrus On 12/13/2022 1:38 AM, Gary Mansell wrote: Dear Slurm Us

Re: [slurm-users] I can't seem to use all the CPUs in my Cluster?

2022-12-13 Thread Brian Andrus
the many articles, wikis and videos out there. TLDR; If you are going to be running efficient HPC jobs, you are indeed better off with HT turned off. Brian Andrus On 12/13/2022 8:03 AM, Gary Mansell wrote: Hi, thanks for getting back to me. I have been doing some more experimenting, and I

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Brian Andrus
Try:     sacctmgr list runawayjobs Brian Andrus On 12/20/2022 7:54 AM, Reed Dier wrote: Hoping this is a fairly simple one. This is a small internal cluster that we’ve been using for about 6 months now, and we’ve had some infrastructure instability in that time, which I think may be the

Re: [slurm-users] Job cancelled into the future

2022-12-20 Thread Brian Andrus
Seems like the time may have been off on the db server at the insert/update. You may want to dump the database, find what table/records need updated and try updating them. If anything went south, you could restore from the dump. Brian Andrus On 12/20/2022 11:51 AM, Reed Dier wrote: Just to

Re: [slurm-users] slurmrestd service broken by 22.05.07 update

2022-12-28 Thread Brian Andrus
I suspect if you delete /var/lib/slurmrestd.socket and then start slurmrestd, it will create it as the user you need it to be. Or just change the owner of it to the slurmrestd owner. I have been running slurmrestd as a separate user for some time. Brian Andrus On 12/28/2022 3:20 PM, Chris

Re: [slurm-users] slurmrestd service broken by 22.05.07 update

2022-12-29 Thread Brian Andrus
lurm/slurm.conf"*/ You can change those as needed. This made it listen on port 8081 only (no socket and not 6820) I was then able to just use curl on port 8081 to test things. Hope that helps. Brian Andrus On 12/29/2022 6:49 AM, Chris Stackpole wrote: Greetings, Thanks for responding

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-04 Thread Brian Andrus
ready. Brian Andrus On 1/4/2023 9:22 AM, Groner, Rob wrote: We currently have a test cluster and a production cluster, all on the same network.  We try things on the test cluster, and then we gather those changes and make a change to the production cluster.  We're doing that through two diffe

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-17 Thread Brian Andrus
y with the new (known good) config. Brian Andrus On 1/17/2023 12:36 PM, Groner, Rob wrote: So, you have two equal sized clusters, one for test and one for production?  Our test cluster is a small handful of machines compared to our production. We have a test slurm control node on a test cl

Re: [slurm-users] slurm and singularity

2023-02-07 Thread Brian Andrus
Then cluster_run.sh would call sbatch along with the appropriate commands. Brian Andrus On 2/7/2023 9:31 AM, Groner, Rob wrote: I'm trying to setup the capability where a user can execute: $: sbatch script_to_run.sh and the end result is that a job is created on a node, and that job wi

Re: [slurm-users] slurm and singularity

2023-02-08 Thread Brian Andrus
commands are xterm, a shell script containing srun commands, and srun (see the EXAMPLES section). *If no command is specified, then salloc runs the user's default shell.* Brian Andrus On 2/8/2023 7:01 AM, Jeffrey T Frey wrote: You may need srun to allocate a pty for the command.

Re: [slurm-users] GPUs not available after making use of all threads?

2023-02-13 Thread Brian Andrus
efficient HPC jobs. The goal is that every process is utilizing the CPU as close to 100% as possible, which would render hyper-threading moot. Brian Andrus On 2/13/2023 12:15 AM, Hermann Schwärzler wrote: Hi Sebastian, I am glad I could help (although not exactly as expected :-). With

<    1   2   3   4   >