Re: [slurm-users] Slurm User Group Meeting (SLUG'20) Agenda Posted

2020-08-31 Thread Ole Holm Nielsen
to view the presentations later on. Can the broadcasts be made available for viewing later on? Thanks, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] Adding Users to Slurm's Database

2020-08-18 Thread Ole Holm Nielsen
On 18-08-2020 17:36, Jason Simms wrote: Hello everyone! We have a script that queries our LDAP server for any users that have an entitlement to use the cluster, and if they don't already have an account on the cluster, one is created for them. In addition, they need to be added to the Slurm

[slurm-users] Compute node OS and firmware updates

2020-08-06 Thread Ole Holm Nielsen
the nodes as they become idle, thereby performing the updates that you want. Remove the crontab job as part of the update.sh script. The update.sh script and instructions for usage are in: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/nodes Comments are welcome. /Ole -- Ole Holm N

Re: [slurm-users] Reservation vs. Draining for Maintenance?

2020-08-06 Thread Ole Holm Nielsen
On 06-08-2020 19:13, Jason Simms wrote: Later this month, I will have to bring down, patch, and reboot all nodes in our cluster for maintenance. The two options available to set nodes into a maintenance mode seem to be either: 1) creating a system-wide reservation, or 2) setting all nodes into

Re: [slurm-users] squeue reports ReqNodeNotAvail but node is available

2020-07-13 Thread Ole Holm Nielsen
://bugs.schedmd.com/show_bug.cgi?id=9257 Upgrade to Slurm 20.02 is highly recommended. /Ole On 7/12/20 3:36 PM, Ole Holm Nielsen wrote: In case your Arp cache is the problem, there is some advice in the Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks

Re: [slurm-users] squeue reports ReqNodeNotAvail but node is available

2020-07-12 Thread Ole Holm Nielsen
In case your Arp cache is the problem, there is some advice in the Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks I think there are other causes for ReqNodeNotAvail, for example, the node being allocated for other jobs. The "scontrol

Re: [slurm-users] save job comment into job completion data

2020-07-06 Thread Ole Holm Nielsen
On 7/5/20 5:42 PM, Fred Liu wrote: It looks job comment won't be saved into job completion data, for I can't see it when I use sacct -c But I can see it when I use sacct(without -c). Is it possible to make it work? The sacct manual page explains the -c parameter: -c, --completion Use

Re: [slurm-users] Module "pam_slurm_adopt"

2020-07-01 Thread Ole Holm Nielsen
On 7/1/20 11:10 AM, Gestió Servidors wrote: I want to limit users to allow SSH connection to compute nodes. I have read at https://slurm.schedmd.com/pam_slurm_adopt.html that “pam_slurm_adopt” allows a SSH connection if and only if that user has a job (or more than one) running in that node.

Re: [slurm-users] runtime priority

2020-07-01 Thread Ole Holm Nielsen
On 6/30/20 4:52 PM, Lawrence Stewart wrote: How does one configure the runtime priority of a job? That is, how do you set the CPU scheduling “nice” value? We’re using Slurm to share a large (16 core 768 GB) server among FPGA compilation jobs. Slurm handles core and memory reservations just

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-19 Thread Ole Holm Nielsen
On 19-06-2020 18:55, Mark Hahn wrote: The host-based SSH authentication is a good idea, but only inside the cluster's security perimeter, and one should not trust computers external to the cluster nodes in this way. Even more than that!  Hostbased allows you to define intersecting sets of

Re: [slurm-users] Changing job order

2020-06-18 Thread Ole Holm Nielsen
The scontrol command to set the nice level is on the list here: https://wiki.fysik.dtu.dk/niflheim/SLURM#useful-commands /Ole On 6/18/20 8:05 AM, navin srivastava wrote: Thanks ** What is the command to modify the Nice value of an already submitted job. Regards Navin On Thu, Jun 18, 2020 at

Re: [slurm-users] [External] Re: ssh-keys on compute nodes?

2020-06-17 Thread Ole Holm Nielsen
On 6/16/20 3:17 PM, Durai Arasan wrote: Thank you. We are planning to put ssh keys on login nodes only and use the PAM module to control access to compute nodes. Will such a setup work? Or is it necessary for PAM to work to have the ssh keys on the compute nodes as well?  I'm sorry but this is

[slurm-users] Slurm 20.02.3 error: CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.

2020-06-16 Thread Ole Holm Nielsen
Today we upgraded the controller node from 19.05 to 20.02.3, and immediately all Slurm commands (on the controller node) give error messages for all partitions: # sinfo --version sinfo: error: NodeNames=a[001-140] CPUs=1 match no Sockets, Sockets*CoresPerSocket or

Re: [slurm-users] Slurm 20.02.3 error: CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.

2020-06-16 Thread Ole Holm Nielsen
ockets" before installing/upgrading to 20.02. It may be a good idea to track updates to bug https://bugs.schedmd.com/show_bug.cgi?id=9241 Best regards, Ole 16.06.2020 11:12 tarihinde Ole Holm Nielsen yazdı: Today we upgraded the controller node from 19.05 to 20.02.3, and immediatel

Re: [slurm-users] unable to start slurmd process.

2020-06-11 Thread Ole Holm Nielsen
On 11-06-2020 14:24, navin srivastava wrote: Hi Team, when i am trying to start the slurmd process i am getting the below error. 2020-06-11T13:11:58.652711+02:00 oled3 systemd[1]: Starting Slurm node daemon... 2020-06-11T13:13:28.683840+02:00 oled3 systemd[1]: slurmd.service: Start operation

Re: [slurm-users] [External] Re: ssh-keys on compute nodes?

2020-06-09 Thread Ole Holm Nielsen
be used in special cases. On 6/9/20 11:45 AM, Michael Jennings wrote: On Tuesday, 09 June 2020, at 12:43:34 (+0200), Ole Holm Nielsen wrote: in which case you need to set up SSH authorized_keys files for such users. I'll admit that I didn't know about this until I came to LANL, but there's

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-09 Thread Ole Holm Nielsen
: On Tuesday, 09 June 2020, at 12:43:34 (+0200), Ole Holm Nielsen wrote: in which case you need to set up SSH authorized_keys files for such users. I'll admit that I didn't know about this until I came to LANL, but there's actually a much better alternative than having to create user key pairs

Re: [slurm-users] cluster reconfigure

2020-06-09 Thread Ole Holm Nielsen
On 6/9/20 12:12 PM, Steve Brasier wrote: Hi all, looking for some advice on the process to following when doing one of the reconfigurations which requires a slurm daemon restart (as listed in docs for "scontrol reconfigure"). When reconfiguring slurm.conf, make sure to propagate that file to

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-09 Thread Ole Holm Nielsen
orrect? So is it not sufficient for only the "slurm" linux user to have passwordless ssh access to all nodes? Why do we have to give passwordless ssh access to every user of the cluster? Thanks, Durai Zentrum für Datenverarbeitung Tübingen On Mon, Jun 8, 2020 at 6:43 PM Ole Holm Nielsen mail

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-08 Thread Ole Holm Nielsen
On 08-06-2020 18:07, Jeffrey T Frey wrote: There's a Slurm PAM module you can use to gate ssh access -- basically it checks to see if the user has a job running on the node and moves any ssh sessions to the first cgroup associated with that user on that node. If you don't use cgroup resource

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-05 Thread Ole Holm Nielsen
Hi Geoffrey, I'm just curious as to what causes a user to decide that a given node has an issue? If a node is healthy in all respects, why would a user decide not to use the node? We can certainly perform all sorts of node health checks from Slurm by configuring the use of LBNL Node Health

Re: [slurm-users] Problem with permisions. CentOS 7.8

2020-06-03 Thread Ole Holm Nielsen
the update before figuring out what is happening. Best, Ferran -- *From:* slurm-users on behalf of Ole Holm Nielsen *Sent:* Tuesday, June 2, 2020 10:04:53 PM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-user

Re: [slurm-users] Problem with permisions. CentOS 7.8

2020-06-02 Thread Ole Holm Nielsen
file does not even exist. Any idea? Many thanks, Ferran *From:* slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> *Sent

Re: [slurm-users] sacct

2020-06-02 Thread Ole Holm Nielsen
time, see how to make the setup in https://wiki.fysik.dtu.dk/niflheim/Slurm_database#setting-database-purge-parameters You can check your database purge configuration by: $ grep -i purge /etc/slurm/slurmdbd.conf Best regards, Ole -Original Message- From: slurm-users On Behalf Of Ole

Re: [slurm-users] Problem with permisions. CentOS 7.8

2020-06-02 Thread Ole Holm Nielsen
/var/run/munge and my system now starts munge up fine during a reboot. Renata On Fri, 29 May 2020, Ole Holm Nielsen wrote: Hi Ferran, When you have a CentOS 7 system with the EPEL repo enabled, and you have installed the munge RPM from EPEL, then things should be working correctly. Since

Re: [slurm-users] sacct

2020-06-02 Thread Ole Holm Nielsen
On 6/2/20 10:16 AM, Sidhu, Khushwant wrote: When a job is running & I use the command: Sacct –format “AveCPU, AveDiskRead, AveDiskWrite,user” –j 12345 I get values for all parameters. However, when a job is completed, the same command returns no values for all but ‘user’. Is there a reason

Re: [slurm-users] Problem with permisions. CentOS 7.8

2020-05-29 Thread Ole Holm Nielsen
ge: Unable to talk to NTP daemon. Is it running? It is the same message I get in the nodes that DO work. All nodes are sync in time and date with the central node *From:* slurm-users on behalf of Ole Holm Nielsen *S

Re: [slurm-users] Problem with permisions. CentOS 7.8

2020-05-29 Thread Ole Holm Nielsen
On 29-05-2020 08:46, Sudeep Narayan Banerjee wrote: also check: a) whether NTP has been setup and communicating with master node b) iptables may be flushed (iptables -L) c) SeLinux to disabled, to check : getenforce vim /etc/sysconfig/selinux (change SELINUX=enforcing to SELINUX=disabled and

Re: [slurm-users] Problem with permisions. CentOS 7.8

2020-05-28 Thread Ole Holm Nielsen
Good check, but I don't believe it is necessary to disable SELinux in order to run Munge correctly. Our slurmctld server (CentOS 7.8) reports Enforcing. /Ole On 28-05-2020 20:37, Ree, Jan-Albert van wrote: Have you checked if SELinux is perhaps blocking this? Give a 'getenforce' command.

Re: [slurm-users] How does slurm keep track of latest jobid

2020-05-20 Thread Ole Holm Nielsen
On 5/20/20 7:57 AM, Ole Holm Nielsen wrote: On 20-05-2020 00:03, Flynn, David P. (Dave) wrote: Where does Slurm keep track of the latest jobid.  Since it is persistent across reboots, I suspect it’s in a file somewhere. $ scontrol show config | grep MaxJobId Sorry, I should have written

Re: [slurm-users] How does slurm keep track of latest jobid

2020-05-19 Thread Ole Holm Nielsen
On 20-05-2020 00:03, Flynn, David P. (Dave) wrote: Where does Slurm keep track of the latest jobid.  Since it is persistent across reboots, I suspect it’s in a file somewhere. $ scontrol show config | grep MaxJobId

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Ole Holm Nielsen
, but still get counted against the user's current requests. ------------ *From:* Ole Holm Nielsen *Sent:* Friday, May 8, 2020 9:27 AM *To:* slurm-users@lists.schedmd.com *Cc:* Renfro, Michael *Subject:* Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue Hi Mic

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Ole Holm Nielsen
ncorrect sacct output). I've also gone through sacctmgr show runaway to clean up any runaway jobs. I had lots, but they were all from a different user, and had no effect on this particular user's values. *From:* slurm

Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

2020-05-08 Thread Ole Holm Nielsen
Hi Michael, Maybe you will find a couple of my Slurm tools useful for displaying data from the Slurm database in a more user-friendly format: showjob: Show status of Slurm job(s). Both queue information and accounting information is printed. showuserlimits: Print Slurm resource user limits

Re: [slurm-users] update or fresh install slurm 20.x

2020-05-01 Thread Ole Holm Nielsen
be mixed as follows: slurmdbd >= slurmctld >= slurmd >= commands You cannot mix 20.02 with an old 17.02 or 17.11 Slurm! /Ole On 5/1/2020 10:45 AM, Ole Holm Nielsen wrote: On 01-05-2020 09:21, Felix Farcas wrote: I did install a new server ARC-CE with slurm-20.x I installed only one node fo

Re: [slurm-users] update or fresh install slurm 20.x

2020-05-01 Thread Ole Holm Nielsen
On 01-05-2020 09:21, Felix Farcas wrote: I did install a new server ARC-CE with slurm-20.x I installed only one node for testing. On my old nodes I do have slurm-17.x Now I am asking, if I am going to move my old nodes to the new sever with slurm.20.x do i have to remove old slurm and

Re: [slurm-users] How to get command from a finished job

2020-04-30 Thread Ole Holm Nielsen
On 30-04-2020 15:34, Bjørn-Helge Mevik wrote: Gestió Servidors writes: For example, with "scontrol show jobid" I can know what command has been submited, its workir, the stderr file and the stdout one. This information, I think, cannot be obtained when the job is finished and I run "sacct".

Re: [slurm-users] slurm-20.02.1-1 failed rpmbuild with error File not found

2020-04-23 Thread Ole Holm Nielsen
b in the spec file. Just makes it easier for people who aren't experts in building packages. And slurmdbd will work with both mysql and mariadb regardless of what it was built against. Thanks for the update, Christian. On 20/04/2020 17.57, Ole Holm Nielsen wrote: For the record: The Slurm developers h

Re: [slurm-users] pam_slurm_adopt seems not working properly under "configless" slurm mode

2020-04-21 Thread Ole Holm Nielsen
On 21-04-2020 04:58, Haoyang Liu wrote: I am setting up the latest slurm-20.02-1 on my clusters and trying to configure the "configless" slurm on the compute nodes. After following the instructions from https://slurm.schedmd.com/configless_slurm.html, both slurmctld and slurmd works fine. The

Re: [slurm-users] slurm-20.02.1-1 failed rpmbuild with error File not found

2020-04-20 Thread Ole Holm Nielsen
For the record: The Slurm developers has found it tricky to write a slurm.spec file which requires the mysql-devel package and still works in all environments, see https://bugs.schedmd.com/show_bug.cgi?id=6488 My recommendation[1] is therefore to explicitly require mysql when building Slurm

Re: [slurm-users] slurm-20.02.1-1 failed rpmbuild with error File not found

2020-04-17 Thread Ole Holm Nielsen
On 17-04-2020 11:47, Ole Holm Nielsen wrote: On 17-04-2020 10:38, Christian Anthon wrote: It would be neat to have these build requirements / install requirements built into the spec file. I agree with you, and it seems that the SchedMD pages no longer list the build prerequisites (I think

Re: [slurm-users] slurm-20.02.1-1 failed rpmbuild with error File not found

2020-04-17 Thread Ole Holm Nielsen
would need to have a support contract with SchedMD. We get a lot of benefit from having such a support contract ;-) Best regards, Ole On 17/04/2020 10.08, Ole Holm Nielsen wrote: Hi Felix, Please make sure to install all prerequisite packages on the Slurm build host.  I have summarized this i

Re: [slurm-users] slurm-20.02.1-1 failed rpmbuild with error File not found

2020-04-17 Thread Ole Holm Nielsen
Hi Felix, Please make sure to install all prerequisite packages on the Slurm build host. I have summarized this information in my Slurm Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms /Ole On 17-04-2020 09:11, Felix Farcas wrote: I am trying to build a

Re: [slurm-users] Munge decode failing on new node

2020-04-16 Thread Ole Holm Nielsen
You might want to check the Munge section in my Slurm Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#munge-authentication-service /Ole On 15-04-2020 19:57, Dean Schulze wrote: I've installed two new nodes onto my slurm cluster.  One node works, but the other one complains

Re: [slurm-users] Error buildind rpm on Centos 7

2020-04-07 Thread Ole Holm Nielsen
Hi Alfonso, You just need to get the CentOS 7 prerequisites right, check out my Slurm installation Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms HTH, Ole On 07-04-2020 13:07, Alfonso Núñez Slagado wrote:     I'm trying to build rpm packages running

Re: [slurm-users] How many users are running jobs per day on average in slurm ?

2020-04-02 Thread Ole Holm Nielsen
:00. Put the numbers into an Excel spreadsheet. As I said, you should look into proper Slurm accounting so that you can get answers to relevant questions. /Ole On 02/04/20 5:34 pm, Ole Holm Nielsen wrote: On 02-04-2020 13:27, Sudeep Narayan Banerjee wrote: any help in getting the right

Re: [slurm-users] How many users are running jobs per day on average in slurm ?

2020-04-02 Thread Ole Holm Nielsen
On 02-04-2020 13:27, Sudeep Narayan Banerjee wrote: any help in getting the right flags ? The question is not well-defined. If you just want to know the JobID number in the cluster, you could run this command every day and watch the NEXT_JOB_ID increase: # scontrol show config | grep

Re: [slurm-users] Slurm - Maridb error

2020-03-27 Thread Ole Holm Nielsen
carefully: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation To remove existing packages: rpm -qa | grep slurm yum remove /Ole *From:* slurm-users on behalf of Ole Holm Nielsen *Sent:* Wednesday, March 25, 2020 3

Re: [slurm-users] Slurm - Maridb error

2020-03-25 Thread Ole Holm Nielsen
Hi Nilesh, You may find relevant information about installing on CentOS/RHEL 7 in my Slurm Wiki: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation /Ole On 3/25/20 2:48 AM, Dhumal, Dr. Nilesh wrote: Hello, I am installing slurm on centos . I installed all supporting libraries

Re: [slurm-users] Slurm on GCP Scripts v3 Now Available

2020-03-21 Thread Ole Holm Nielsen
On 20-03-2020 00:24, Brian Christiansen wrote: Google and SchedMD are pleased to announce the V3 release of the slurm-gcp scripts. Check it out at: https://github.com/schedmd/slurm-gcp ... Send Questions / Feedback to the community: google-cloud-slurm-disc...@google.com

Re: [slurm-users] Upgrade paths

2020-03-11 Thread Ole Holm Nielsen
On 11-03-2020 20:01, Will Dennis wrote: I have one cluster running v16.05.4 that I would like to upgrade if possible to 19.05.5; it was installed via a .deb package I created back in 2016. I have located a 17.11.7 Ubuntu PPA (https://launchpad.net/~jonathonf/+archive/ubuntu/slurm) and have

Re: [slurm-users] Job not cancelled after "TimeLimit" supered

2020-03-10 Thread Ole Holm Nielsen
On 3/10/20 9:03 AM, sysadmin.caos wrote: my SLURM cluster has configured a partition with a "TimeLimit" of 8 hours. Now, a job is running during 9h30m and it has been not cancelled. During these 9 hours and a half, a script has executed a "scontrol update partition=mypartition state=down" for

Re: [slurm-users] Meaning of --cpus-per-task and --mem-per-cpu when SMT processors are used

2020-03-04 Thread Ole Holm Nielsen
On 3/4/20 10:12 AM, Alexander Grund wrote: we have a Power9 partition with 44 processors having 4 cores each totaling 176. What is your hardware configuration? Do you have 1 server with 44 processor sockets, and each processor has 4 CPU cores? Or is it maybe 1 server with 1 or more sockets

[slurm-users] How to make sreport cluster reports for individual partitions?

2020-03-03 Thread Ole Holm Nielsen
er reports as indicated? I actually have a Slurm top-user accounting report tool which can select partitions: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmacct but it doesn't generate the desired account utilization tree as provided by sreport. Thanks, Ole -- Ole Holm Nielsen P

Re: [slurm-users] Question about SacctMgr....

2020-02-28 Thread Ole Holm Nielsen
sacctmgr show association You may use the (undocumented) format=... option to select only the columns you want, for example: sacctmgr show assoc format=user,account,qos Usage of the format option is only given in the Examples section of the sacctmgr page

Re: [slurm-users] Slurm Upgrade from 17.02

2020-02-19 Thread Ole Holm Nielsen
On 2/19/20 3:10 PM, Ricardo Gregorio wrote: I am putting together an upgrade plan for slurm on our HPC. We are currently running old version 17.02.11. Would you guys advise us upgrading to 18.08 or 19.05? You should be able to upgrade 2 Slurm major versions in one step. The 18.08 version is

Re: [slurm-users] Job limit in slurm.

2020-02-17 Thread Ole Holm Nielsen
: Hi ole, Thanks Ole. After setting the Enforce it worked. I am new to slurm to thanks for helping me. Regards Navin On Mon, Feb 17, 2020 at 5:36 PM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: Hi Navin, I wonder if you have configured the Slurm da

Re: [slurm-users] Cluster usage with Slurm

2020-02-17 Thread Ole Holm Nielsen
On 2/17/20 1:19 PM, Parag Khuraswar wrote: Hi Team, Does Slurm  provide cluster usage reports like mentioned below ? Detailed reports about cluster usage statistics. Reports of every user and jobs including their monthly usage, node usage, percentage of utilization, History tracking, number of

Re: [slurm-users] Job limit in slurm.

2020-02-17 Thread Ole Holm Nielsen
limit is set it should allow only 3 jobs at any point of time. Regards Navin. On Mon, Feb 17, 2020 at 4:48 PM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: Hi Navin, Why do you think the limit is not working?  The MaxJobs limits the number of running job

Re: [slurm-users] Job limit in slurm.

2020-02-17 Thread Ole Holm Nielsen
tava wrote: Hi, Thanks for your script. with this i am able to show the limit what i set. but this limt is not working. MaxJobs =        3, current value = 0 Regards Navin. On Mon, Feb 17, 2020 at 4:13 PM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: On 2/17/20

Re: [slurm-users] Job limit in slurm.

2020-02-17 Thread Ole Holm Nielsen
On 2/17/20 11:16 AM, navin srivastava wrote: i have an issue with the slurm job limit. i applied the Maxjobs limit on user using  sacctmgr modify user navin1 set maxjobs=3 but still i see this is not getting applied. i am still bale to submit more jobs. Slurm version is 17.11.x Let me

Re: [slurm-users] Which ports does slurm use?

2020-02-10 Thread Ole Holm Nielsen
. -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: Friday, February 7, 2020 2:34 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Which ports does slurm use? On 06-02-2020 22:40, Dean Schulze wrote: I've moved two nodes to a different controller

Re: [slurm-users] Which ports does slurm use?

2020-02-07 Thread Ole Holm Nielsen
On 06-02-2020 22:40, Dean Schulze wrote: I've moved two nodes to a different controller.  The nodes are wired and the controller is networked via wifi.  I had to open up ports 6817 and 6818 between the wired and wireless sides of our network to get any connectivity. Now when I do srun -N2

Re: [slurm-users] Print slurm cgroup parameters

2020-01-27 Thread Ole Holm Nielsen
On 27-01-2020 20:35, Mahmood Naderan wrote: Hi Is there any command to print current cgroup parameters or configurations that are used by Slurm? This works for me: # scontrol show config | tail -22 Cgroup Support Configuration: AllowedDevicesFile =

Re: [slurm-users] Question about slurm source code and libraries

2020-01-25 Thread Ole Holm Nielsen
On 24-01-2020 20:22, Dean Schulze wrote: Since there isn't a list for slurm development I'll ask here.  Does the slurm code include a library for making REST calls?  I'm writing a plugin that will make REST calls and if slurm already has one I'll use that, otherwise I'll find one with an

Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Ole Holm Nielsen
and suggestions for improvement are welcome! /Ole On 1/18/20 12:06 PM, Ole Holm Nielsen wrote: When we have created a new Slurm user with "sacctmgr create user name=xxx", I would like inquire at a later date about the timestamp for the user creation.  As far as I can tell, the sacctm

Re: [slurm-users] [External] Re: Partition question

2019-12-19 Thread Ole Holm Nielsen
Some examples are here: https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting#quality-of-service-qos /Ole On 19-12-2019 19:30, Prentice Bisbal wrote: On 12/19/19 10:44 AM, Ransom, Geoffrey M. wrote: The simplest is probably to just have a separate partition that will only allow job times of

Re: [slurm-users] Upgraded Slurm 17.02 to 19.05, now GRPTRESRunMin limits are applied incorrectly

2019-12-16 Thread Ole Holm Nielsen
Hi Mike, My showuserlimits tool prints nicely user limits from the Slurm database: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits Maybe this can give you further insights into the source of problems. /Ole On 16-12-2019 17:27, Renfro, Michael wrote: Hey, folks. I’ve

Re: [slurm-users] (no subject)

2019-12-08 Thread Ole Holm Nielsen
Forgot the link to the Wiki: https://wiki.fysik.dtu.dk/niflheim/SLURM On 12/8/19 9:18 PM, Ole Holm Nielsen wrote: Hi Dean, You may want to look at the links in my Slurm Wiki page.  Both the official Slurm documentation and other resources are listed.  I think most of your requirements

Re: [slurm-users] (no subject)

2019-12-08 Thread Ole Holm Nielsen
Hi Dean, You may want to look at the links in my Slurm Wiki page. Both the official Slurm documentation and other resources are listed. I think most of your requirements and questions are described in these pages. My Wiki gives detailed deployment information for a CentOS 7 cluster, but

Re: [slurm-users] Slurm 19.05: can not submit job

2019-11-28 Thread Ole Holm Nielsen
On 11/28/19 11:47 AM, Nguyen Dai Quy wrote: On Thu, Nov 28, 2019 at 11:20 AM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: On 11/28/19 10:35 AM, Nguyen Dai Quy wrote: > Hi list, > I can not submit my job: >  > sbatch submit.sh > sba

Re: [slurm-users] Slurm 19.05: can not submit job

2019-11-28 Thread Ole Holm Nielsen
On 11/28/19 10:35 AM, Nguyen Dai Quy wrote: Hi list, I can not submit my job: > sbatch submit.sh sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified After checking slurmdbd.log, I see: [2019-11-28T10:21:07.578] Accounting storage MYSQL

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Ole Holm Nielsen
On 13-11-2019 18:04, Bas van der Vlies wrote: We have currently version 18.08.7 installed on our cluster and want to upgrade to 19.03.3.. So I wanted to start small and installed it one of our compute node. Buy if I start the 'slurmd' then our slurmctld will complain that: {{{

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-12 Thread Ole Holm Nielsen
Hi Daniel, Thanks for sharing your insights! I have updated my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-mariadb-database now. /Ole On 11/12/19 8:52 AM, Daniel Letai wrote: On 11/12/19 9:34 AM, Ole Holm Nielsen wrote: On 11/11/19 10:14 PM, Daniel Letai

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Ole Holm Nielsen
On 11/12/19 8:10 AM, Nguyen Dai Quy wrote: I have the same issue by compiling RPM. Just add "--with mysql" at rpmbuild option and the error gone :-) HTH, That's an interesting observation! Do you know what the "--with mysql" actually does? IMHO, the Slurm .spec file should include all

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Ole Holm Nielsen
s required by any of the mariadb packages, it'll get pulled automatically. If not, you don't need it on the build system. On 11/11/19 10:56 PM, Ole Holm Nielsen wrote: Hi William, Interesting experiences with MariaDB 10.4!  I tried to collect the instructions from the MariaDB page, but I'm un

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Ole Holm Nielsen
server that will run slurmd, i.e. all compute nodes. I expect that if I looked harder at the build options there may be a way to do this, perhaps with linker flags. For now, I can progress. Thanks William -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: 11

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Ole Holm Nielsen
Hi, Maybe my Slurm Wiki can help you build SLurm on CentOS/RHEL 7? See https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms Note in particular: Important: Install the MariaDB (a replacement for MySQL) packages before you build Slurm RPMs (otherwise some libraries will be

Re: [slurm-users] [EXT] Re: How to find core count per job per node

2019-10-21 Thread Ole Holm Nielsen
om:* slurm-users on behalf of Ole Holm Nielsen *Sent:* Friday, October 18, 2019 2:15 PM *To:* slurm-users@lists.schedmd.com *Subject:* [EXT] Re: [slurm-users] How to find core count per job per node WARNING: This is an EXTERNAL email. Please think before RESPONDING or CLICKING on links/a

Re: [slurm-users] Running mix versions of slurm while upgrading

2019-10-21 Thread Ole Holm Nielsen
FWIW, you may be interested in my Wiki on upgrading Slurm: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm You should also read the pages on Upgrading in the presentation Technical: Field Notes From A MadMan, Tim Wickberg, SchedMD from last month's Slurm User Group

Re: [slurm-users] How to find core count per job per node

2019-10-18 Thread Ole Holm Nielsen
On 18-10-2019 19:56, Tom Wurgler wrote: I need to know how many cores a given job is using per node. Say my nodes have 24 cores each and I run a 36 way job. It take a node and a half. scontrol show job id shows me 36 cores, and the 2 nodes it is running on. But I want to know how it split the

Re: [slurm-users] Removing user from slurm configuration

2019-10-10 Thread Ole Holm Nielsen
sacctmgr delete user XXX I would also like to mention my Slurm account and user updating tools: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmaccounts /Ole On 10/10/19 1:41 PM, Mahmood Naderan wrote: Hi I had created multiple test users, and then removed them. However, I see

Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-10 Thread Ole Holm Nielsen
an NHC check "check_fs_used /scratch 90%"). Best regards, Ole On 10-09-2019 20:41, Michael Jennings wrote: On Monday, 02 September 2019, at 20:02:57 (+0200), Ole Holm Nielsen wrote: We have some users requesting that a certain minimum size of the *Available* (i.e., free) TmpFS

Re: [slurm-users] slurm node weights

2019-09-08 Thread Ole Holm Nielsen
You should be able to assign node weights to accommodate your prioritization wishes. I've summarized this setting in my Slurm Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-weight I hope this helps. /Ole On 9/5/19 5:48 PM, Douglas Duckworth wrote: Hello We added

Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-03 Thread Ole Holm Nielsen
e idea is to make the prolog set up a "project" disk quota for the job on the localtmp file system, and the epilog to remove it again. I'm not 100% sure we will make it work, but I'm hopeful. Fingers crossed! :) On 9/2/19 8:02 PM, Ole Holm Nielsen wrote:> We have some users requesting tha

[slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-02 Thread Ole Holm Nielsen
ce? And how can users specify the minimum *Available* disk space required by their jobs submitted by "sbatch"? If this is not feasible, are there other techniques that achieve the same goal? We're currently still at Slurm 18.08. Thanks, Ole -- Ole Holm Nielsen PhD, Senior HPC

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Ole Holm Nielsen
Hi Guillaume, The performance of the slurmctld server depends strongly on the server hardware on which it is running! This should be taken into account when considering your question. SchedMD recommends that the slurmctld server should have only a few, but very fast CPU cores, in order to

[slurm-users] ANNOUNCE: A new showuserlimits tool for printing Slurm user resource limits and usage

2019-08-21 Thread Ole Holm Nielsen
regards, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] Fwd: Getting information about AssocGrpCPUMinutesLimit for a job

2019-08-11 Thread Ole Holm Nielsen
Andreas made a good suggestion of looking at the user's TRESRunMin from sshare in order to answer Jeff's question about AssocGrpCPUMinutesLimit for a job. However, getting at this information is in practice really complicated, and I don't think any ordinary user will bother to look it up.

[slurm-users] ANNOUNCE: A new showpartitions tool

2019-07-15 Thread Ole Holm Nielsen
n really supportive in testing showpartitions during development and comparing the output to spart. Thorsten Deilmann from University of Wuppertal has offered a number of useful suggestions, including the colored output. Best regards, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] Slurm topology.conf file

2019-07-09 Thread Ole Holm Nielsen
On 7/9/19 10:14 AM, Priya Mishra wrote: Hi Ole, I am using slurm emulator and would soon start working with the slurm simulator. I need these larger topology files for the purpose of a project and not actual job scheduling. If there are any suitable resources for me to use, please let me

Re: [slurm-users] Slurm topology.conf file

2019-07-09 Thread Ole Holm Nielsen
On 7/9/19 9:04 AM, Priya Mishra wrote: Hi, I am using the slurmibtopology tool to generate the topology.conf file from the cluster at my institute which gives me a file with around 400 nodes. I need a topology file with a larger no of nodes for further use. Is there anyway of generating a

Re: [slurm-users] Jobs waiting while plenty of cpu and memory available

2019-07-09 Thread Ole Holm Nielsen
Hi Edward, The squeue command tells you about job status. You can get extra information using format options (see the squeue man-page). I like to set this environment variable for squeue: export SQUEUE_FORMAT="%.18i %.9P %.6q %.8j %.8u %.8a %.10T %.9Q %.10M %.10V %.9l %.6D %.6C %m %R"

Re: [slurm-users] Hints, Cheatsheets, etc

2019-07-09 Thread Ole Holm Nielsen
Hi Edward, Besides my Slurm Wiki page https://wiki.fysik.dtu.dk/niflheim/SLURM, I have written a number of tools which we use for monitoring our cluster, see https://github.com/OleHolmNielsen/Slurm_tools. I recommend in particular these tools: * pestat Prints a Slurm cluster nodes status

Re: [slurm-users] dual slurmctld and slurmdbd

2019-07-03 Thread Ole Holm Nielsen
On 7/2/19 10:48 PM, Tina Fora wrote: We run mysql on a dedicated machine with slurmctld and slurmdbd running on another machine. Now I want to add another machine running slurmctld and slurmdbd and this machine with be on CentOS 7. Existing one is CentOS 6. Is this possible? Can I run two

Re: [slurm-users] Installation troubles

2019-07-01 Thread Ole Holm Nielsen
On 01-07-2019 21:47, HELLMERS Joe wrote: I’m having trouble installing Slurm 18.08.7 on Red Hat 7.3. I installed munge from source. It may be easier for you to install Slurm with RPMs. A complete guide is in my Slurm Wiki pages: https://wiki.fysik.dtu.dk/niflheim/SLURM

Re: [slurm-users] getting closer

2019-06-28 Thread Ole Holm Nielsen
On 6/28/19 9:57 AM, Valerio Bellizzomi wrote: On Fri, 2019-06-28 at 09:39 +0200, Ole Holm Nielsen wrote: On 6/28/19 9:18 AM, Valerio Bellizzomi wrote: On Fri, 2019-06-28 at 08:51 +0200, Valerio Bellizzomi wrote: On Thu, 2019-06-27 at 18:35 +0200, Valerio Bellizzomi wrote: The nodes are now

Re: [slurm-users] getting closer

2019-06-28 Thread Ole Holm Nielsen
On 6/28/19 9:18 AM, Valerio Bellizzomi wrote: On Fri, 2019-06-28 at 08:51 +0200, Valerio Bellizzomi wrote: On Thu, 2019-06-27 at 18:35 +0200, Valerio Bellizzomi wrote: The nodes are now communicating however when I run the command srun -w compute02 /bin/ls it remains stuck and there is no

Re: [slurm-users] Tiny feature request: sacct.1 man page should list SACCT_FORMAT

2019-06-26 Thread Ole Holm Nielsen
On 6/26/19 1:14 PM, John Marshall wrote: On 26 Jun 2019, at 11:51, Ole Holm Nielsen wrote: You should open a case with SchedMD containing your patch: https://bugs.schedmd.com/ Yes, I considered creating a Bugzilla account at SchedMD so that I could send them a three-line patch

Re: [slurm-users] Tiny feature request: sacct.1 man page should list SACCT_FORMAT

2019-06-26 Thread Ole Holm Nielsen
On 6/26/19 12:23 PM, John Marshall wrote: I have had $SQUEUE_FORMAT set in my environment for a long time, but have only today learnt that sacct will also listen to an environment variable to set a default output format. Previously I had only looked for it in the Environment Variables section

<    1   2   3   4   5   >