[slurm-users] Re: slurmdbd not connecting to mysql (mariadb)

2024-05-30 Thread Brian Andrus via slurm-users
That SIGTERM message means something is telling slurmdbd to quit. Check your cron jobs, maintenance scripts, etc. Slurmdbd is being told to shutdown. If you are running in the foreground, a ^C does that. If you run a kill or killall on it, you will get that same message. Brian Andrus On

[slurm-users] Re: slurmdbd archive format

2024-05-28 Thread Brian Andrus via slurm-users
Oh, to address the passed train: Restore the archive data with "sacctmgr archive load", then you can do as you need. From man sacctmgr: *archive*{dump|load}     Write database information to a flat file or load information that has previously been written to a file. Brian Andrus Setup

[slurm-users] Re: slurmdbd archive format

2024-05-28 Thread Brian Andrus via slurm-users
Instead of using the archive files, couldn't you query the db directly for the info you need? I would recommend sacct/sreport if those can get the info you need. Brian Andrus On 5/28/2024 9:59 AM, O'Neal, Doug (NIH/NCI) [C] via slurm-users wrote: My organization needs to access historic job

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-23 Thread Brian Andrus via slurm-users
/23/2024 6:16 AM, Christopher Samuel via slurm-users wrote: On 5/22/24 3:33 pm, Brian Andrus via slurm-users wrote: A simple example is when you have nodes with and without GPUs. You can build slurmd packages without for those nodes and with for the ones that have them. FWIW we have both GPU

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-22 Thread Brian Andrus via slurm-users
Not that I recommend it much, but you can build them for each environment and install the ones needed in each. A simple example is when you have nodes with and without GPUs. You can build slurmd packages without for those nodes and with for the ones that have them. Generally, so long as

[slurm-users] Re: Submitting from an untrusted node

2024-05-14 Thread Brian Andrus via slurm-users
Rike, Assuming the data, scripts and other dependencies are already on the cluster, you could just ssh and execute the sbatch command in a single shot: ssh submitnode sbatch some_script.sh It will ask for a password if appropriate and could use ssh keys to bypass that need. Brian Andrus

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
/...). Wouldn't Slurm pick up that one? Thanks! Jeff On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users wrote: This is because you have no slurm.conf in /etc/slurm, so it it is trying 'configless' which queries DNS to find out where to get the config. It is failing because

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
This is because you have no slurm.conf in /etc/slurm, so it it is trying 'configless' which queries DNS to find out where to get the config. It is failing because you do not have DNS configured to tell nodes where to ask about the config. Simple solution: put a copy of slurm.conf in

[slurm-users] Re: Slurm.conf and workers

2024-04-15 Thread Brian Andrus via slurm-users
Xaver, If you look at your slurmctld log, you likely end up seeing messages about each node's slurm.conf not being the same as that on the master. So, yes, it can work temporarily, but unless there are some very specific settings done, issues will arise. The state you are in now, you will

[slurm-users] Re: Upgrading nodes

2024-04-10 Thread Brian Andrus via slurm-users
Yes. You can build the 8 rpms on 9. Look at 'mock' to do so. I did similar when I still had to support EL7 Fairly generic plan, the devil is in the details and verifying each step, but those are the basic bases you need to touch. Brian Andrus On 4/10/2024 1:48 PM, Steve Berg via

[slurm-users] Re: Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-08 Thread Brian Andrus via slurm-users
Xaver, You may want to look at the ResumeRate option in slurm.conf: ResumeRate The rate at which nodes in power save mode are returned to normal operation by ResumeProgram. The value is a number of nodes per minute and it can be used to prevent power surges if a large number of

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
, Brian Andrus via slurm-users ha scritto: Quick correction, it is SaveStateLocation not SlurmSaveState. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear all, I am having trouble finalizing the configuration of the backup controller for my slurm

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
Quick correction, it is SaveStateLocation not SlurmSaveState. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear all, I am having trouble finalizing the configuration of the backup controller for my slurm cluster. In principle, if no job is running everything seems

[slurm-users] Re: controller backup slurmctld error while takeover

2024-03-25 Thread Brian Andrus via slurm-users
Miriam, You need to ensure the SlurmSaveState directory is the same for both. And by 'the same', I mean all contents are exactly the same. This is usually achieved by using a shared drive or replication. Brian Andrus On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote: Dear all, I am

[slurm-users] Re: We're Live! Check out the new SchedMD.com now!

2024-03-13 Thread Brian Andrus via slurm-users
Wow, snazzy! Looks very good. My compliments. Brian Andrus On 3/12/2024 11:24 AM, Victoria Hobson via slurm-users wrote: Our website has gone through some much needed change and we'd love for you to explore it! The new SchedMD.com is equipped with the latest information about Slurm, your

[slurm-users] Re: Slurm billback and sreport

2024-03-04 Thread Brian Andrus via slurm-users
Chip, I use 'sacct' rather than sreport and get individual job data. That is ingested into a db and PowerBI, which can then aggregate as needed. sreport is pretty general and likely not the best for accurate chargeback data. Brian Andrus On 3/4/2024 6:09 AM, Chip Seraphine via slurm-users

[slurm-users] Re: Is SWAP memory mandatory for SLURM

2024-03-04 Thread Brian Andrus via slurm-users
Joseph, You will likely get many perspectives on this. I disable swap completely on our compute nodes. I can be draconian that way. For the workflow supported, this works and is a good thing. Other workflows may benefit from swap. Brian Andrus On 3/3/2024 11:04 PM, John Joseph via

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users
oxy> Brian Andrus On 2/28/2024 12:54 PM, Dan Healy wrote: Are most of us using HAProxy or something else? On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users wrote: Magnus, That is a feature of the load balancer. Most of them have that these days. Brian

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users
Magnus, That is a feature of the load balancer. Most of them have that these days. Brian Andrus On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote: On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote: for us, we put a load balancer in front

[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-27 Thread Brian Andrus via slurm-users
Josef, for us, we put a load balancer in front of the login nodes with session affinity enabled. This makes them land on the same backend node each time. Also, for interactive X sessions, users start a desktop session on the node and then use vnc to connect there. This accommodates

[slurm-users] Re: [INTERNET] Re: question on sbatch --prefer

2024-02-10 Thread Brian Andrus via slurm-users
I imagine you could create a reservation for the node and then when you are completely done, remove the reservation. Each helper could then target the reservation for the job. Brian Andrus On 2/9/2024 5:52 PM, Alan Stange via slurm-users wrote: Chip, Thank you for your prompt response.  We